SeqLab and Molecular Evolutionary Phylogenetic Inference: How GCG's SeqLab Graphical User Interface (GUI) to the Wisconsin Sequence Analysis Package can be used to develop and refine input for programs from PHYLIP (PHYLogeny Inference Package) and how to use the available molecular evolution inference tools to ascertain and draw phylogenetic trees.
Authors:
Steven M. Thompson and Susan Jean Johns
The inscription on the inner cover of Evolution by Dobzhansky, Ayala, Stebbins and Valentine reads "Nothing in biology makes sense except in the light of evolution." (Dobzhansky, 1973). These words ring true in my ears. To me, evolution provides the single, unifying, cohesive force that allows all of life to be explained. It is to the life sciences what the missing holy grail of the unified field theory is to astrophysics.
The concepts of common ancestry and homology extend beyond anatomy, physiology, and genes down to the smallest heritable unit of life, the individual base pairs in DNA. Thus, even after homologous molecules are identified, it is necessary to establish the correspondence between individual sequence positions. A multiple sequence alignment is a hypothesis about evolutionary history. Therefore, use everything available to insure that you have prepared a satisfying one! Remember the old adage: "garbage in -- garbage out!"
Using any moleclular evolution inference program effectively depends upon your understanding of the operation of the software packages and the quality of the sequence alignments that are used. Devoting considerable time and energy toward developing the most satisfying multiple sequence alignment possible is invaluable in evolutionary inference as quality alignments mean everything for obtaining meaningful results from phylogenetic inference algorithms. Use all available information and understanding to insure that your alignment is as good as it can be. Be sure that the alignment makes biological sense. Make sure that known enzymatic, regulatory, and structural elements all align, for the results of your inference are absolutely dependent upon your alignment. The new GCG X-interface in the VADMS suite, SeqLab, can greatly assist you in this process. All of the molecular sequence phylogenetic inference programs make the validity of your input alignment their first and most critical assumption.
The available software can process any alignment that you feed it of the proper format. Whether or not they are appropriate and should be used, is up for you to determine. Beware of comparing "apples and oranges." Make sure that the family of sequences that you align are in fact related and that the alignment is meaningful. The programs will work with almost any input sequences but only make sense if they actually do belong to the same gene family. Either make paralogous comparisons (i.e. evolution via gene duplication) to ascertain gene phylogenies, or orthologous (within one ancestral loci) comparisons to ascertain organismal phylogenies; try not to mix them up. Lots of confusion can arise and extremely misleading interpretations can result otherwise. Also be wary of trying to align genomic sequences with cDNA when working with DNA. Similarly, don't try to align mature and precursor proteins; trim them even if they are different sequences. Otherwise it doesn't make evolutionary sense to use both of them, as one is not evolved from the other, rather one is the other -- these are all easy mistakes to make and can cause all sorts of problems.
After you have obtained a satisfying alignment, the next step is to feed it to some program for evolutionary reconstruction. GCG now has an phylogenetic inference program section. These programs, Distances and GrowTree, offer a neighbor-joining distance matrix solution using several different methods for the correction of multiple substitutions at homologous sites. PAUP (Phylogenetic Analysis Using Parsimony) by David Swofford of the Smithsonian Institute will be incorporated into GCG at some point; beta-testing, licensing, and documentation problems have all delayed its eventual incorporation. This week's session will introduce the University of Washingtion's Dr. Joseph Felsenstein's PHYLIP (PHYLogenetic Inference Package) programs. PHYLIP (http://evolution.genetics.washington.edu/phylip.html) is a comprenisive suite of 31 programs that can handle both molecular sequence and morphological character data.
PHYLIP's molecular sequence programs have versions that accept either amino acid or nucleotide sequence data. Again, I strongly encourage all searching and alignment to be performed on an amino acid basis if dealing with coding sequence. If you want to perform your phylogenetic inference on nucleotide sequence data, which may be desirable particularly with very similar sequences, it should be converted to nucleotides after alignment. In addition to the more easily achieved alignment, this also insures that gaps are not placed within codons. The SeqLab interface can also be used to help with these matters. Another general guideline to keep in mind when using any phylogenetic inference algorithm is to never initially impose a molecular clock. The existence of a good clock for your system can be evaluated after the phylogeny has been inferred; however, the universal molecular clock is a hotly disputed issue so don't base your entire analysis upon this principle. This is the main reason why I never recommend using UPGMA analysis -- it mandates a perfect clock. I will only illustrate those programs in this exercise that do not impose a clock and that accept proteins, in order to simplify matters.
One of the biggest problems in this field is that of sequence format. Each suite of programs requires different sequence format. We have so far introduced GCG sequence format both as single and multiple sequences and PIR format. Now we have to deal with still more types. SeqLab has its own format called Rich Sequence Format (rsf) and PHYLIP has its own unique input data format requirements. A couple of different programs are available to allow us to convert the alignment into a format usable with the PHYLIP package. One interface, tophylip, is a part of EGCG; it takes alignments generated in the various GCG formats and converts them into usable PHYLIP format. Another program on the VADMS system, ReadSeq by Dr. Don Gilbert at Indiana University, allows for the back and forth conversion between several different formats.
Access to PHYLIP on ribozyme is always available. All program names are found through the standard login path assignment. All documentation on the PHYLIP package is assessible to the user through the VADMS environment variable $DOC_DIR. Review the available PHYLIP documentation to aid in selecting the appropiate software to fit your needs. Pay particular atttention to the general package documentation, main.doc, that describes all the general options available within the package. Some of them should definitely be taken advantage of since they can make your analysis much more robust. Some of PHYLIP's programs require special input files. Others are meant to work on the output data created by another program in the package. Be sure you are familiar with the requirements for running each program you wish to use prior to trying it.
Many programs for phylogenetic inference are best run in background mode when faced with very large data sets. Determining when to do this is based on both the program and the size of your dataset -- it's largely a matter of experience. Because sooner or later all the programs in the PHYLIP package become too slow to run interactively, batch mode documentation and special template parameter and script files have been created to assist you in operating this way. Review this information prior to running any of these programs in this manner. When background processing is required, copy over the desired file, edit it to reflect the data that you are working with, and then submit the job.
Many of the programs also generate information that can be used to create plots of the results. The PHYLIP package can output graphics to many different devices. However most are not readily amiable to displaying terminal images through a network connection. Therefore, a locally produced program, phylip_plot, allows you to take advantage of HP plotter text output from PHYLIP. This text file is processed by phylip_plot to produce GCG style graphics. Because this plotting software was written with GCG subroutines, the size and appearance of the plotted data can be manipulated with the usual GCG plotting command switches.
I reiterate, the most important factor in inferring reliable phylogenies is the accuracy of the multiple sequence alignment. The interpretation of your results is utterly dependent on the quality of your input. In fact, many experts advice against using any parts of the sequence data that are at all questionable. Only analyze those portions which assuredly do align. If any portions of the alignment are in doubt, throw them out. This usually means trimming down the alignment's terminal ends and may require internal trimming as well. SeqLab makes this process much easier. Another possibility is to exclude portions with PHYLIP's Weight option. Many of the PHYLIP programs allow the user to differentially weight different parts of their alignment to reflect their confidence in it. This can be a handy trick with some data sets, especially those with both highly conserved and highly variable regions.
To properly use the two software packages used in this week's exercise, it is necessary to understand a little of the philosophies behind the two packages. GCG was created with the user in mind. In spite of common opinion, it is quite user friendly, expects little in the way of background from its users, and provides excellent documentation for its software. GCG provides default parmeters for the programs which, while not always ideal, in each case will produce meaningful results. And the new SeqLab interface makes GCG even easier to use. All GCG functions can be launched from SeqLab. It allows full editing control of your alignments, and furthermore, it can parse feature information from database entries so that you can see where important structural and functional sites lay in your alignments. PHYLIP, on the other hand, was developed with the accomplished evolutionary biologist in mind. It expects a great deal from its users in the way of background knowledge and experience. It expects the user to know enough about the subject matter to be able to effectively make decisions regarding the options available in each program. While documentation is provided, it is written with these expectations in mind. The use of alignments created by GCG in the PHYLIP programs means that the user needs to understand the characteristics of both packages in regard to displaying the names of sequences, the interpretation of symbols used, and file accessing. PHYLIP also has a number of other general considerations that will be discussed in this section
The PHYLIP system has a limitation of ten characters for naming purposes. GCG, while it will truncate names, does not cut them back to such a small limit, rather it uses the name itself as the determining factor for the length of the name field used. The field will be truncated so that fifty bases or amino acids will be shown on the data line in a sequence file. There is also a difference in the manner in which the two systems treat names. PHYLIP believes that a name should be shown for a sequence only once in a data set. GCG displays the name of the sequence as often as it can to avoid user confusion. PHYLIP's ten character name limitation can be gotten around just before the final plotting process, but it is a bother. In general, it's a good idea to begin the process with the names that you will want on the final graphic.
Because GCG and PHYLIP were created independently they use alignment symbols differently. PHYLIP's symbol usage needs to be clarified because of this. GCG alignment programs insert periods, "."'s, to represent gaps in the alignment and tildes "~"'s to show uneven end lengths. However, periods mean "the same symbol as the above sequence" to PHYLIP and it doesn't regognize the tildes at all, therefore, ToPhylip converts periods and to minuses, "-"'s, which mean deletions to PHYLIP. ReadSeq does not do this automatically and it must be done afterwards. However, not all gaps in sequences should be interpreted as deletions. Interior gaps are probably okay to represent this way, as regardless of whether a deletion, insertion or a duplication event created the gap, logically they will be treated the same by the algorithms. These are called indels. However, end gaps, the tildes, should not be represented as indels because a lack of information beyond the length of a given sequence may not be due to a deletion or insertion event; it may have nothing to do with the particular stretch being analyzed at all. It may just not have been sequenced! Therefore, you need to manually edit the output from any alignment format converter to change leading and trailing gap symbols to either "x"'s or "?"'s depending on the situation. In peptide alignments these symbols mean "unknown amino acid" and "unknown amino acid or deletion" respectively to PHYLIP. This will assure that the progams do not make incorrect assumptions about your sequences.
The two systems also treat files differently. PHYLIP does not allow you to assign output file names; it uses a set of standard output file names of the format outfile, treefile, and plotfile. Because of this, and UNIX's lack of a version number, if you run concurrent PHYLIP analyses, you must run them in separate directories, otherwise the output from one program will overwrite the output of any other. Furthermore, even when running PHYLIP programs sequentially, be sure to rename every output file as it is produced or the same "clobbering" effect will destroy your results when the next program is run. This process can be very annoying and bothersome but it is vitally important!
The length of time PHYLIP takes to do the processing on any given data set depends on the nature of the data being used and the program doing the processing. Some of the PHYLIP programs are just very slow by the nature of what they are trying to do. Others only slow down when a large number of sequences are being worked with that are relatively long (i.e. over around 500 bases or amino acids). The individual program documentation is very good about letting the user know if the program is slow or not. Read through some level of documentation on a program prior to running it. If you are familiar with the ideas behind evolutionary analyses, you may need to only read the main.doc information on how to run the programs. Otherwise, read the author supplied documentation on each program being used to get the entire breadth available.
This week's exercise will explore several methods for inferring molecular phylogenies as well as introduce the Wisconsin Sequence Analysis Package's SeqLab Graphical User Interface (GUI). You will prepare, refine, and use a sequence alignment from the results of your FastA search of the SwissProtein database back in Exercise 8. I will illustrate the techniques with an alignment prepared from the prion FastA output. You will determine the most reasonable phylogeny of a group of about twenty representative organisms based on your Selected Molecule. To start out, create this week's subdirectory and copy the necessary FastA output file from Exercise 8's subdirectory to it. Be sure to activate GCG and EGCG, and to use SetPlot to designate your VersaTerm-Pro graphics configuration, before beginning the exercise.
1) Prepare an appropriate dataset using organisms in your FastA output.
Load your FastA output file into the pico editor. I want you to be able to perform all of this week's analyses in real time so we need to trim down the size of our dataset to about twenty organisms. However, I do not want you to merely reiterate established clear-cut phylogenies. We need to do some real science here. Therefore, choose about twenty entries total, but be sure to include a couple of sequences from the list that are not so clearly alike your specific Selected Molecule. Also try to choose sequences from as many different organisms as possible, in other words, don't just grab the top twenty scores! However, be sure that all of the sequences that you choose are all of the same protein type. We don't want to confuse issues by using paralogous molecules; it's just that we want some further diverged phylogenetic members in the dataset. My example FastA list only included 26 prion molecules overall, so I will use all of them but two here. I commented out prip_bovin and prp2_trast because they are clearly paralogous molecules with prio_bovin and prio_trast respectively. You will need to choose from among many on your lists. The relevant portion from my example edited FastA list follows:
The best scores are: init1 initn opt z-sc E(51759).. sw:prio_human Begin: 23 End: 253 ! P04156 homo sapiens (human). major ... 1690 1690 1690 1682.2 0 sw:prio_gorgo Begin: 23 End: 253 ! P40252 gorilla gorilla gorilla (low... 1686 1686 1686 1678.3 0 sw:prio_pantr Begin: 23 End: 253 ! P40253 pan troglodytes (chimpanzee)... 1680 1680 1680 1672.4 0 sw:prio_ponpy Begin: 23 End: 253 ! P40256 pongo pygmaeus (orangutan). ... 1668 1668 1668 1660.5 0 sw:prio_colgu Begin: 23 End: 253 ! P40251 colobus guereza. major prion... 1655 1655 1655 1647.7 0 sw:prio_prefr Begin: 23 End: 253 ! P40257 presbytis francoisi. major p... 1654 1654 1654 1646.7 0 sw:prio_macfa Begin: 23 End: 253 ! P40254 macaca fascicularis (crab ea... 1645 1645 1645 1637.8 0 sw:prio_calja Begin: 23 End: 252 ! P40247 callithrix jacchus (common m... 1406 1406 1631 1624.1 0 sw:prio_saisc Begin: 23 End: 260 ! P40258 saimiri sciureus (common squ... 1416 1416 1626 1618.9 0 sw:prio_calmo Begin: 16 End: 241 ! P40248 callicebus moloch (dusky tit... 1612 1612 1612 1605.6 0 sw:prio_cebap Begin: 23 End: 252 ! P40249 cebus apella (brown-capped c... 1396 1396 1611 1604.3 0 sw:prio_mansp Begin: 16 End: 241 ! P40255 mandrillus sphinx (mandrill)... 1610 1610 1610 1603.6 0 !sw:prip_bovin Begin: 25 End: 256 ! Q01880 bos taurus (bovine). major p... 1549 1549 1598 1591.4 0 sw:prio_sheep Begin: 25 End: 256 ! P23907 ovis aries (sheep). major pr... 1539 1539 1588 1581.5 0 sw:prio_odohe Begin: 25 End: 256 ! P47852 odocoileus hemionus (mule de... 1534 1534 1583 1576.6 0 sw:prio_aottr Begin: 16 End: 239 ! P40245 aotus trivirgatus (night mon... 1363 1363 1580 1574.1 0 sw:prio_mesau Begin: 23 End: 254 ! P04273 mesocricetus auratus (golden... 1451 1451 1576 1569.7 0 sw:prio_bovin Begin: 26 End: 264 ! P10279 bos taurus (bovine). major p... 1360 1360 1565 1558.6 0 sw:prio_musvi Begin: 25 End: 257 ! P40244 mustela vison (american mink... 1066 1066 1563 1556.8 0 !sw:prp2_trast Begin: 25 End: 256 ! P40243 tragelaphus strepsiceros (gr... 1511 1511 1560 1553.9 0 sw:prio_mouse Begin: 23 End: 254 ! P04925 mus musculus (mouse). major ... 1199 1321 1554 1548.0 0 sw:prio_trast Begin: 26 End: 264 ! P40242 tragelaphus strepsiceros (gr... 1352 1352 1546 1539.9 0 sw:prio_cerae Begin: 23 End: 245 ! P40250 cercopithecus aethiops (gree... 1335 1335 1545 1539.4 0 sw:prio_rat Begin: 1 End: 226 ! P13852 rattus norvegicus (rat). maj... 1396 1396 1524 1519.1 0 sw:prio_atege Begin: 21 End: 232 ! P40246 ateles geoffroyi (black-hand... 1327 1327 1381 1377.9 0 sw:prio_chick Begin: 28 End: 270 ! P27177 gallus gallus (chicken). maj... 330 363 488 495.8 4.3e-21 \\End of List
2) Introducing SeqLab.
To begin using SeqLab, go to the Apple "Finder" icon in the upper right-hand corner of the screen and open the Finder. The Apple "Launcher" will now be visible. Double click the "Ribozyme X" icon. This will launch the X-server MacX and connect you to ribozyme using the xterm program. Log in as usual. Change directory to your Exercise 10 subdirectory and initialize GCG. This is necessary because this session is completely independent of your concurrent VersaTerm-Pro vt100 session. In fact, you can easily switch between the two using the Apple Finder icon. Now launch GCG's GUI by typing seqlab &. The ampersand, "&," is not necessary but it allows you to retain control of the xterm window by running SeqLab in the background in another window and thereby not tying up the xterm window. This way you can switch back and forth between the xterm and SeqLab windows. Furthermore, if you are working at a smaller monitor, about 15 inches or less, then you will want to launch SeqLab with its small option to prevent windows from overflowing your screen. That should not be required with the computers in our teaching lab. After a few moments two blue windows will open; click in the smaller one and check "OK." This will put you in the main SeqLab window where all analyses may be performed. All menus that I refer to from this point on in SeqLab will be within the SeqLab display, not across the very top of the monitor -- those are MacX commands. Before beginning the analyses, go to the "Options" menu and select "Preferences . . ." We are going to customize a few features to make SeqLab a bit more intuitive to run.
First notice that there are three different "Preferences" settings that can be changed: "General", "Output," and "Editor Properties;" start with "General." The "Working Dir . . ." setting will be the directory from which SeqLab was initially launched. This is where all SeqLab's working files will be stored; it can be changed in later sessions if desired, however, it is appropriate to leave it as your Exercise 10 directory for now. Be sure that the "Start SeqLab in:" choice has "List mode" selected and that "Close the window" is selected under the "After I push the 'Run' button:" choice. Next select the "Output Preferences." Change the default by selecting "Automatically display new output." Leave the other choices alone. Take a look at the "Editor Properties" menu next. We will leave all these choices as is but I want to point out that if you are dealing with very large alignments, then changing to the "Small" "Font Size" may be desirable. Click "OK" to accept all of your changes. Next under the "Options" menu, choose "Graphics Devices . . .". The standard SetPlot menu will be displayed; select the "Encapsulated PostScript File" choice by clicking on it and then check "OK." Now the SeqLab interface is ready to be utilized.
Be sure the "Mode:" "Main List" choice is selected and then go to the "File" menu. Pick "Add sequences from" and select "Sequence Files." This will produce an "Add Sequences" window from which you can select sequences to add to your working.list. The "Filter" box is very important here! By default files are filtered such that only those that end with the extension ".seq" are displayed. This won't do us any good as the sequences that we want to add are all in the modified FastA list file that we worked on earlier. Therefore, delete the ".seq" extension in the "Filter" box; be sure to leave the "*" wildcard. Press the "Filter" button to display all of the files in your working directory. Select your FastA file from the "Files" box and then check the "Add" and then "Close" buttons at the bottom of the window to put the file in your working.list. It will appear in the SeqLab "Main List" window. Now switch to "Editor" "Mode:" to load the sequences into the SeqLab editor. You will be asked whether you want to modify the sequences because some of them have attributes set; check "Modify the sequences." This will automatically trim all of them down to that portion that FastA identified as being similar to your query, a nice thing to do for our purposes. Notice that all of the sequences now appear in the editor window with color-coded amino acids and that few, if any, begin with Methionine any more. Expand the window full-screen by clicking on its grow box in the upper right-hand corner. You should be able to see all of your sequences now. The display will look something like this:

Explore the editor interface for a while. The scroll bar at the bottom allows you to move through the sequences linerally. You can select any positions by 'capturing' them with the mouse. The "pos:" and "col:" indicators show you where the cursor is located in any particular sequence and the overall dataset respectively. The "1:1" scroll bar near the upper right-hand corner allows you to 'zoom' in or out on the sequences; move it to 2:1 and beyond and notice the difference in the display. Go to the "Display:" box and change it from "Residue Coloring" to "Feature Coloring." The colors are now based on the information from the database Feature Table for each entry. Change the "Display:" to "Graphic Features;" now the features are represented using the same colors as before but in a 'cartoon' fashion. Use the mouse to move your cursor to one of the colored areas; double-click it. This will produce a new window that describes the features located at the cursor. Click on one of the features to get more information on it. All the features are fully editable through the "Edit" checkbox in this panel and new features can be added with several desired shapes and colors through the "Add" checkbox. The display will look something like my example below:

Click the "Close" box and return your display to "1:1" and "Residue Coloring." Now use your mouse to select all of your sequence names; either click and drag through the whole list or click the top entry then shift-click the bottom one (control-click allows you to pick separate, multiple entries). Another way to select them all would be to go to the "Edit" menu and click on "Select All" there. Once all of your sequences are selected, go to the "Functions" menu and select "Multiple Sequence Analysis." Click on "PileUp . . ." to align the members of your list. This will produce a new window with the parameters for running PileUp. For this first pass accept all of the program defaults by merely pressing the "Run" button. The window will go away and after a few moments, depending on the complexity of your alignment and the load on ribozyme, new output windows will automatically display. The top window will be the msf output from your PileUp run. Notice the BLOSUM62 matrix and gap introduction and extension penalties of 12 and 4 respectively are used by default. In most cases these will work very well. Scroll through your alignment to check it out and then "Close" the window. The next window will be the "Output Manager." This is a very important window and will contain all of the output from your current SeqLab session. Files may be displayed, printed, saved in other locations with other names, and deleted from this window. We need to use an extremely important function at this point; click on "Add to Editor." The "Reloading Same Sequence" dialog box will appear; click on "Overwrite old with new" to replace the unaligned dataset with the new one. This will take your msf output and move it into the open editor, keeping all feature information intact, yet renumbering all of its reference locations. "Close" the window after loading your new alignment. The next window will contain PileUp's cluster dendrogram. If desired, you can directly print from this window to a PostScript file by picking "Print . . ." Be sure that the "Output Device:" chosen is "Encapsulated PostScript File." You can rename the output file from tmp.ps to anything that you may want in this window; click "Proceed" to create the EPSF output in your current directory. "Close" the dendrogram window to return to the editor.
Now notice that your residues align by color. Be sure that all sequences are still selected and then go back to "Functions" "Multiple Sequence Analysis" and chose "PlotSimilarity . . ." This time we need to change some of the program defaults so chose "Options . . ." Check "Save SeqLab colormask to" and "Scale the plot between:" the "minimum and maximum values calculated from the alignment." "Close" the window; notice that the "Command Line:" box now reflects your updated options. Click the "Run" box to launch the program. The output will quickly return. "Close" the plotsimilarity.cmask display and the "Output Manager" then take a look at the similarity plot. Again, make a PostScript file of the plot, if desired, and then "Close" the window. Now go to the "File" menu and click on "Open Color Mask Files." This will produce another window from which you should select your new "plotsimilarity.cmask" file and then click on "Add" and "Close" the window. This will produce a gray scale overlay on your sequences that describes their regional similarity. My prion alignment, using a zoom factor of 4 to 1 looks like this:

The beauty of this representation is you can now select only those regions of low similarity to try to improve their alignment. This is possible because of PileUp's new InSitu option. I may not be able to improve my alignment much since these sequences are all so similar but I will go through the steps so that you will see how it can be done. Be sure that all of your sequences are still selected then zoom back in to your alignment to 1:1 so that you can see individual residues and then scroll to the end. It's best to start at the carboxy termini so that the position of the low similarity regions does not become skewed as you proceed through this procedure. Now select a region of low similarity, either by using the mouse or by using the "Edit" "Select Range" function (determine the positions by placing your cursor at the begining and end of the range to be selected and noting the column number). Relaunch PileUp by going through the "Functions" menu. (The "Windows" menu also contains a listing of all of the programs that you have used in the current session; you can launch any of them from there as well.) You will be asked whether you want to use the "Selected sequences" or "Selected region;" specify "Selected region." This will again produce a PileUp window, only this time we need to use some options so click on "Options . . ." In the "Options" window check the gap creation and extension boxes and change their respective values to much less than the default 12 and 4. Four times less seems to work pretty well, i.e. change them to 3 and 1 respectively. Also be sure to check "Realign a portion of an existing alignment;" this calls up the InSitu option. Furthermore, we really don't need another similarity dendrogram, so uncheck the "Plot dendrogram" box. "Close" the window and notice the new options in the PileUp "Command Line:" "Run" the program to improve your alignment.
Your results should be returned very quickly since you are only realigning a portion of the alignment. Check out the new msf file and then close its window. Be sure to "Add to Editor" the new msf file in the "Output Manager." This will put the new alignment into the open editor and yet retain all the feature information. This feature information may help guide your alignment efforts in the next steps. Your alignment should now be a bit better within the specified region. Repeat this process in all areas of low similarity, again, working from the carboxy termini toward the amino end. Notice that all of the options that you last specified are retained by the program so you don't need to respecify them. You can also save these run parameters so that they will come up in subsequent sessions by clicking on the "Save Settings" box in any of the program run windows. You may want to go to the "File" menu periodically to save your work using the "Save as . . ." function in case of a computer or network problem. It's also probably a good idea to reperform the PlotSimilarity and color mask procedure after going through the entire alignment to see how things have improved after you've finished the various InSitu PileUps. If you discover an area that you can not improve through this automated procedure, then it is time to either manually 'correct' it or throw it away. Again, note those 'problem' areas and then switch back to "Residue Coloring." This will ease manual alignment by allowing your eyes to work with columns of color.
Other things that can help manual alignment are "GROUP"ing and "Protections." The "GROUP" function allows you to manipulate 'families' of sequences as a whole -- any change in one will be propagated throughout them all, much the same as in the 'anchoring' function of the Fragment Assembly System (FAS). To "GROUP" sequences, select those that you want to behave collectively and then click on the "GROUP" icon right above your alignment. You can have as many groups as you want. Also as in FAS, the space bar will introduce a gap into the sequence and the delete key will take a gap away. However, you can not delete a sequence residue without changing that sequence's (or the entire alignment's) "Protections." Click on the padlock icon to produce a "Protections" window. Notice that the default protection allows you to modify "Gap Characters" and "Reversals" only. Check "All other characters" to allow you to "Cut" regions out of your alignment and/or delete individual residues and then click "OK" to close the window. A very powerful manual alignment function can be thought of as the 'abacus' function. To take advantage of this function select the region that you want to slide and then press the shift key as you move the region with the right or left arrow key. You can slide residues greater distances by prefacing the command keystrokes with the number of spaces that you want them to slide.
Make subjective decisions regarding your alignment. Is it good enough; do things line up the way that they should? If, after all else, you decide that you just can't align some region, or even an entire sequence, then get rid of it with the "Cut" function. Amino and carboxy termini seldom align properly although allowing SeqLab to trim the ends based on FastA's beginning and ending constraints considerably improves this situation. You can trim them even more within SeqLab now, if required, or you can trim out the jagged ends later with the ToPHYLIP conversion program -- it's particularly the interior here that you need to decide upon. Remember, everything we do in this exercise from this point on, is absolutely dependent on this alignment! In my example prion case, I decided that trying to align the chicken prion to all of the mammal sequences was just too hard, so I cut it out of the dataset by selecting only it and then pressing the "Cut" icon. It was just too divergent for the analysis at hand -- for one thing the characteristic prion Proline-Histidine repeat unit occurs in most of my sequences four times, in a few of them three or five times, but in the chicken sequence it occurs six times, and the disulphide bridge location was also all messed up. Cutting out this sequence left some columns of gaps in my alignment so I reselected all of my sequences and then went to the "Edit" menu where I selected "Remove Gaps . . ." "Columns of gaps." This produced a very clean, unambiguous alignment that I now have a very high confidence in.
After you have finished tweaking, evaluating, and readjusting your alignment to make it as 'satisfying' as possible, go to the "File" menu. Click on "Export . . .;" under the "Format" selection in the new window be sure that "MSF" (Multiple Sequence Format) is specified; click "Save" to save a copy of your file in GCG's msf format. Be sure to change its name to give it the msf extension in order to prevent overwriting your rsf file; click "OK" to close the window. Now get out of SeqLab by going back to the "File" menu and clicking on "Exit." You will probably be asked if you want to save your rsf file and any changes in your list. Accept the suggested changes and SeqLab will close. This will return you to the xterm window. Log out of ribozyme in the xterm window and quit MacX using its "File" menu. Go back to the Apple Finder and switch to the VersaTerm-Pro session that's been hanging out in the background while we used SeqLab.
Obviously I have only touched the 'tip of the iceberg' regarding SeqLab's full potential. Please refer to the GCG documentation on SeqLab to fully explore its many possibilities. It is an incredibly powerful way to run the Wisconsin Sequence Analysis Package.
Many other alignment editors are also available for cleaning up multiple sequence alignments. However, I think that you will find SeqLab most satisfying, and only using a GCG compatible editor assures you that the format will not be corrupted. If you do not have access to an X-server interface, then the editor LineUp in the Wisconsin Package with the msf option is provided for this purpose. However, LineUp is not very powerful; it only allows a maximum of thirty sequences to be worked on. The EGCG program ELineUp expands this capability but is still not nearly as powerful as SeqLab. If you do make any changes to a GCG sequence data file with a non-GCG compatible editor, you must reformat the alignment afterwards. However, reformatting MSF files requires a couple of tricks. If this step is not done exactly correct, you will get very weird results. You must use the msf option and you must specify all the sequences within your MSF file, {*}, for example:
% reformat -msf my_favorite.msf{*}
Here you will not need to perform this step, unless for some perverse reason you decided to edit your alignment with a non-GCG compliant editor such as pico; however, it may prove necessary in other situations outside of class. After reformatting, the new MSF file will follow GCG convention, with updated format, numbering, and checksums.
Anytime you do alter an alignment, even using SeqLab, be sure to check the updated checksums in the msf file. If any of the sequence's checksums are now the same, they should be excluded from the rest of the analysis because they are now identical. Comment out all but one of each identical sequence by placing an exclamation mark in front of their checksum lines and then run reformat -msf to exclude it from subsequent steps. Since the sequences would be identical it doesn't make any sense to ascertain their phylogenetic relationship.
3) Alignment visualization and publication tools.
Another good way to 'see' a multiple sequence alignment is with the two EGCG programs PrettyBox and PrettyPlot. They can help visualize the conservation within an alignment and are great for producing publication quality graphics. They display the alignment in an interleaved fashion, boxing areas of conservation. PrettyBox uses different gray tones to differentiate various levels of conservation. PrettyPlot encloses conserved residues in an open box, basing its conservation decisions on the options you specify at runtime. PrettyBox only generates PostScript output which must be redirected to an output file. PrettyPlot uses standard GCG graphics standards so its output can be displayed to any supported emulation including Encapsulated PostScript or redirected to a Figure file for further manipulation. Run PrettyPlot on your alignment to display the results at your terminal (you may also wish to rerun the program to prepare PostScript graphics of the result for harcopy). A screen trace follows:
% prettyplot prion.msf{*}
PRETTYPLOT displays multiple sequence alignments and calculates a
consensus sequence. It does not create the alignment, it simply
displays it.
Start (* 1 *) ? <rtn>
End (* 243 *) ? <rtn>
prio_bovin len: 243 wgt: 1.00
prio_trast len: 243 wgt: 1.00
prio_saisc len: 243 wgt: 1.00
prio_sheep len: 243 wgt: 1.00
prio_odohe len: 243 wgt: 1.00
prio_musvi len: 243 wgt: 1.00
prio_mesau len: 243 wgt: 1.00
prio_rat len: 243 wgt: 1.00
prio_mouse len: 243 wgt: 1.00
prio_cerae len: 243 wgt: 1.00
prio_atege len: 243 wgt: 1.00
prio_human len: 243 wgt: 1.00
prio_gorgo len: 243 wgt: 1.00
prio_pantr len: 243 wgt: 1.00
prio_ponpy len: 243 wgt: 1.00
prio_prefr len: 243 wgt: 1.00
prio_macfa len: 243 wgt: 1.00
prio_mansp len: 243 wgt: 1.00
prio_colgu len: 243 wgt: 1.00
prio_calmo len: 243 wgt: 1.00
prio_calja len: 243 wgt: 1.00
prio_aottr len: 243 wgt: 1.00
prio_cebap len: 243 wgt: 1.00
Find consensus to what minimum plurality (* 12.0 *) ? <rtn>
When your VERSATERM-TEK4105 attached to tty is ready, press <Return>. <rtn>





4) Format conversion: ToPHYLIP and ReadSeq.
Once a user understands the possible problems associated with converting data from GCG format into PHYLIP format, as described in the introduction, it is time to proceed. Data conversion can be done with either EGCG's ToPHYLIP or Don Gilbert's ReadSeq program. You will be using ToPHYLIP here because it allows you to select a portion of an alignment and it automatically handles the dot to dash problem. You may have already edited off the divergent ends while in SeqLab, which is great, but the auto' dot to dash feature still makes using it worthwhile. Enter the command tophylip to launch the process; specify all of your alignment's sequences using GCG's {*} convention; exclude your alignment's divergent ends if you didn't trim them earlier:
% tophylip
TOPHYLIP writes GCG sequences into a single file in PHYLIP format.
TOPHYLIP uses any sequences
TOPHYLIP of what sequence(s) ? prion.msf{*}
prion.msf{prio_bovin}, len: 243
prion.msf{prio_trast}, len: 243
prion.msf{prio_saisc}, len: 243
prion.msf{prio_sheep}, len: 243
prion.msf{prio_odohe}, len: 243
prion.msf{prio_musvi}, len: 243
prion.msf{prio_mesau}, len: 243
prion.msf{prio_rat}, len: 243
prion.msf{prio_mouse}, len: 243
prion.msf{prio_cerae}, len: 243
prion.msf{prio_atege}, len: 243
prion.msf{prio_human}, len: 243
prion.msf{prio_gorgo}, len: 243
prion.msf{prio_pantr}, len: 243
prion.msf{prio_ponpy}, len: 243
prion.msf{prio_prefr}, len: 243
prion.msf{prio_macfa}, len: 243
prion.msf{prio_mansp}, len: 243
prion.msf{prio_colgu}, len: 243
prion.msf{prio_calmo}, len: 243
prion.msf{prio_calja}, len: 243
prion.msf{prio_aottr}, len: 243
prion.msf{prio_cebap}, len: 243
Start (* 1 *) ? <rtn>
End (* 243 *) ? <rtn> What should I call the output file (* prio_trast.phylip *) ? prion.phyLoad the new file into the pico editor to take a look at the new format. PHYLIP requires a very simple format with the size of the data matrix specified on the top line. While the file is in the editor make the "?" change described in the introduction. As discussed earlier, the ToPHYLIP routine will replace gaps in the data with deletion symbols, "-," which is logically correct in most cases. However, gaps at the ends and beginnings of sequences should not have hyphens unless you really know that a deletion/insertion is responsible for the length discrepancy. Therefore, replace leading and trailing hyphens in your alignment with question marks or the unknown amino acid symbol "x" depending on which is more appropriate. Be very careful when changing these characters so that the alignment does not get shifted out of phase. This is also an excellent point at which to verify that the sequence names are exactly as you wish them to appear in the final resultant plots. Phylip sequence names can contain punctuation and capitalization, and up to ten characters or less. My corrected ToPHYLIP output follows:
23 243
prio_bovin ?KRPKPGGGW NTGGSRYPGQ GSPGGNRYPP QGGGGWGQPH GGGWGQPHGG
prio_trast ?KRPKPGGGW NTGGSRYPGQ GSPGGNRYPS QGGGGWGQPH GGGWGQPHGG
prio_saisc KKRPKPG-GW NTGGSRYPGQ GSPGGNRYPP Q-GGGWGQPH GGGWGQPHGG
prio_sheep KKRPKPGGGW NTGGSRYPGQ GSPGGNRYPP QG-------- GGGWGQPHGG
prio_odohe KKRPKPGGGW NTGGSRYPGQ GSPGGNRYPP QG-------- GGGWGQPHGG
prio_musvi KKRPKPGGGW NTGGSRYPGQ GSPGGNRYPP QG-------- GGGWGQPHGG
prio_mesau KKRPKPG-GW NTGGSRYPGQ GSPGGNRYPP QG-------- GGTWGQPHGG
prio_rat ??????G-GW NTGGSRYPGQ GSPGGNRYPP QS-------- GGTWGQPHGG
prio_mouse KKRPKPG-GW NTGGSRYPGQ GSPGGNRYPP Q--------- GGTWGQPHGG
prio_cerae KKRPKPG-GW NTGGSRYPGQ GSPGGNRYPP QG-------- --------GG
prio_atege ?????PG-GW NTGGSRYPGQ GSPGGNRYPP Q--------- --------GG
prio_human KKRPKPG-GW NTGGSRYPGQ GSPGGNRYPP QG-------- GGGWGQPHGG
prio_gorgo KKRPKPG-GW NTGGSRYPGQ GSPGGNRYPP QG-------- GGGWGQPHGG
prio_pantr KKRPKPG-GW NTGGSRYPGQ GSPGGNRYPP QG-------- GGGWGQPHGG
prio_ponpy KKRPKPG-GW NTGGSRYPGQ GSPGGNRYPP QG-------- GGGWGQPHGG
prio_prefr KKRPKPG-GW NTGGSRYPGQ GSPGGNRYPP QG-------- GGGWGQPHGG
prio_macfa KKRPKPG-GW NTGGSRYPGQ GSPGGNRYPP QG-------- GGGWGQPHGG
prio_mansp KKRPKPG-GW NTGGSRYPGQ GSPGGNRYPP QG-------- GGGWGQPHGG
prio_colgu KKRPKPG-GW NTGGSRYPGQ GSPGGNRYPP QG-------- GGGWGQPHGG
prio_calmo KKRPKPG-GW NTGGSRYPGQ GSPGGNRYPP QG-------- GGSWGQPHGG
prio_calja KKRPKPG-GW NTGGSRYPGQ GSPGGNRYPP Q--------- GGGWGQPHGG
prio_aottr KKRPKPG-GW NTGGSRYPGQ SSPGGNRYPP QS-------- GG-WGQPHGG
prio_cebap KKRPKPG-GW NTGGSRYPGQ GSPGGNLYPP Q--------- GGGWGQPHGG
GWGQPHGGGW GQPHGGGWGQ PHGGGGWGQ- GGTHGQWNKP SKPKTNMKHV
GWGQPHGGGW GQPHGGGWGQ PHGGGGWGQ- GGTHGQWNKP SKPKTNMKHV
GWGQPHGGGW GQPHGGGWGQ PH-GGGWGQG GGTHNQWNKP SKPKTNMKHM
GWGQPHGGGW GQPHGGGWGQ PHGGGGWGQ- GGSHSQWNKP SKPKTNMKHV
GWGQPHGGGW GQPHGGGWGQ PHGGGGWGQ- GGTHSQWNKP SKPKTNMKHV
GWGQPHGGGW GQPHGGGWGQ PHGGGGWGQG GGSHGQWGKP SKPKTNMKHV
GWGQPHGGGW GQPHGGGWGQ PH-GGGWGQG GGTHNQWNKP SKPKTNMKHM
GWGQPHGGGW GQPHGGGWGQ PH-GGGWSQG GGTHNQWNKP SKPKTNLKHV
GWGQPHGGSW GQPHGGSWGQ PH-GGGWGQG GGTHNQWNKP SKPKTNLKHV
GWGQPHGGGW GQPHGGGWGQ PH-GGGWGQG GGTHNQWHKP SKPKTSMKHM
GWGQPHGGGW GQPHGGGWGQ PH-GGGWGQG GGTHNQWNKP SKPKTNMKHM
GWGQPHGGGW GQPHGGGWGQ PH-GGGWGQG GGTHSQWNKP SKPKTNMKHM
GWGQPHGGGW GQPHGGGWGQ PH-GGGWGQG GGTHSQWNKP SKPKTNMKHM
GWGQPHGGGW GQPHGGGWGQ PH-GGGWGQG GGTHSQWNKP SKPKTNMKHM
GWGQPHGGGW GQPHGGGWGQ PH-GGGWGQG GGTHSQWNKP SKPKTNMKHM
GWGQPHGGGW GQPHGGGWGQ PH-GGGWGQG GGTHSQWNKP SKPKSNMKHM
GWGQPHGGGW GQPHGGGWGQ PH-GGGWGQG GGTHNQWHKP SKPKTSMKHM
GWGQPHGGGW GQPHGGGWGQ PH-GGGWGQG GGTHNQWHKP NKPKTSMKHM
GWGQPHGGGW GQPHGGGWGQ PH-GGGWGQG GGTHSQWNKP SKPKTSMKHM
GWGQPHGGGW GQPHGGGWGQ PH-GGGWGQG GGTHNQWNKP SKPKTNMKHV
GWGQPHGGGW GQPHGGGWGQ PH-GGGWGQG GGTHSQWNKP SKPKTNMKHV
GWGQPHGGGW GQPHGGGWGQ PH-GGGWGQG GGTHNQWNKP SKPKTNMKHM
GWGQPHGGGW GQPHGGSWGQ PH-GGGWGQG GGTHNQWNKP SKPKTSMKHV
AGAAAAGAVV GGLGGYMLGS AMSRPLIHFG SDYEDRYYRE NMHRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPLIHFG SDYEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPLIHFG NDYEDRYYRE NMYRYPSQVY
AGAAAAGAVV GGLGGYMLGS AMSRPLIHFG NDYEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMNRPLIHFG NDYEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPLIHFG NDYEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPMMHFG NDWEDRYYRE NMNRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPMLHFG NDWEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPMIHFG NDWEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPLIHFG NDYEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPLIHFG NDYEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPIIHFG SDYEDRYYRE NMHRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPIIHFG SDYEDRYYRE NMHRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPIIHFG SDYEDRYYRE NMHRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPIIHFG NDYEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPLIHFG NDYEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPLIHFG NDYEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPLIHFG NDYEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPLIHFG NDYEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPLIHFG NDYEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPLIHFG NDYEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPLIHFG NDYEDRYYRE NMYRYPNQVY
AGAAAAGAVV GGLGGYMLGS AMSRPLIHFG NDYEDRYYRE NMYRYPNQVY
YRPVDQYSNQ NNFVHDCVNI TVKEHTVTTT TKGENFTETD IKMMERVVEQ
YRPVDQYSNQ NNFVHDCVNI TVKQHTVTTT TKGENFTETD IKMMERVVEQ
YRPVDQYSNQ NNFVHDCVNV TIKQHTVTTT TKGENFTETD VKMMERVVEQ
YRPVDRYSNQ NNFVHDCVNI TVKQHTVTTT TKGENFTETD IKIMERVVEQ
YRPVDQYNNQ NTFVHDCVNI TVKQHTVTTT TKGENFTETD IKMMERVVEQ
YKPVDQYSNQ NNFVHDCVNI TVKQHTVTTT TKGENFTETD MKIMERVVEQ
YRPVDQYNNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD IKIMERVVEQ
YRPVDQYSNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKMMERVVEQ
YRPVDQYSNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKMMERVVEQ
YRPVDQYSNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKMMERVVEQ
YRPVDQYNNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKMMERVVEQ
YRPMDEYSNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKMMERVVEQ
YRPMDQYSNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKMMERVVEQ
YRPMDQYSSQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKMMERVVEQ
YRPVDQYSNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKMMERVVEQ
YRPVDQYSNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKMMERVVEQ
YRPVDQYSNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKMMERVVEQ
YRPVDQYSNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKMMERVVEQ
YRPVDQYSNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKMMERVVEQ
YRPVDQYSNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKMMERVVEQ
YRPVDQYNNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKMMERVVEQ
YRPVDQYSNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKIMERVVEQ
YRPVDQYSNQ NNFVHDCVNI TIKQHTVTTT TKGENFTETD VKMMERVVEQ
MCITQYQRES QAYY--QRGA SVILFSSPPV ILLISFLIFL IVG
MCITQYQRES EAYY--QRGA SVILFSSPPV ILLISFLIFL IVG
MCITQYEKES QAYY--QRGS SMVLFSSPPV ILLISFLIFL IVG
MCITQYQRES QAYY--QRGA SVILFSSPPV ILLISFLIFL IVG
MCITQYQRES QAYY--QRGA SVILFSSPPV ILLISFLIFL IVG
MCVTQYQRES EAYY--QRGA SAILFSPPPV ILLISLLILL IVG
MCTTQYQKES QAYYDGRRSS A-VLFSSPPV ILLISFLIFL MVG
MCVTQYQKES QAYYDGRRSS A-VLFSSPPV ILLISFLIFL IVG
MCVTQYQKES QAYYDGRRSS STVLFSSPPV ILLISFLIFL IVG
MCITQYEKES QAYY--QRGS SMVLFSSPPV ILLISFLIFL IVG
MCITQYERES QAYY--QRGS SMVLFSSPPV ILLISFLI?? ???
MCITQYERES QAYY--QRGS SMVLFSSPPV ILLISFLIFL IVG
MCITQYERES QAYY--QRGS SMVLFSSPPV ILLISFLIFL IVG
MCITQYERES QAYY--QRGS SMVLFSSPPV ILLISFLIFL IVG
MCITQYERES QAYY--QRGS SMVLFSSPPV ILLISFLIFL IVG
MCITQYEKES QAYY--QRGS SMVFFSSPPV ILLISFLIFL IVG
MCITQYEKES QAYY--QRGS SMVLFSSPPV ILLISFLIFL IVG
MCITQYEKES QAYY--QRGS SMVLFSSPPV ILLISFLI?? ???
MCITQYEKES QAYY--QRGS SMVLFSSPPV ILLISFLIFL IVG
MCITQYEKES QAYY--QRGS SMVLFSSPPV ILLISFLI?? ???
MCITQYEKES QAYY--QRGS SMVLFSSPPV ILLISFLIFL IVG
MCITQYEKES QAYY--QRGS SMVLFSSPPV ILLISFL??? ???
MCITQYERES QAYY--QRGS SMVLFSSPPV ILLISFLIFL IVGDon Gilbert's program ReadSeq can also be used to change GCG MSF format into something acceptable to PHYLIP use. However, as stated above, ReadSeq does not allow you to only choose a portion of an alignment, nor does it automatically convert dots to hyphens. Therefore, I will only show its use, but not require you to use it. ReadSeq runs a bit backward from what you are used to. It first prompts you for an appropriate output file name, then you choose 12 off of its menu for the current PHYLIP format, then you designate an input sequence (Do not use the GCG {*} designator; this is not a GCG program.), and finally you specify "All" the sequences. When the program again asks for an input sequence, press return, and let it do its thing. A sample screen trace is shown below:
% readseq
readSeq (1Feb93), multi-format molbio sequence reader.
Name of output file (?=help, defaults to display): prion.phy
1. IG/Stanford 10. Olsen (in-only)
2. GenBank/GB 11. Phylip3.2
3. NBRF 12. Phylip
4. EMBL 13. Plain/Raw
5. GCG 14. PIR/CODATA
6. DNAStrider 15. MSF
7. Fitch 16. ASN.1
8. Pearson/Fasta 17. PAUP/NEXUS
9. Zuker (in-only) 18. Pretty (out-only)
Choose an output format (name or #): 12
Name an input sequence or -option: prion.msf
Sequences in prion.msf (format is 15. MSF)
1) prio_bovin Len: 243 Check: 4745 Weight: 1.00
2) prio_trast Len: 243 Check: 4884 Weight: 1.00
3) prio_saisc Len: 243 Check: 5750 Weight: 1.00
4) prio_sheep Len: 243 Check: 6835 Weight: 1.00
5) prio_odohe Len: 243 Check: 6930 Weight: 1.00
6) prio_musvi Len: 243 Check: 5440 Weight: 1.00
7) prio_mesau Len: 243 Check: 7778 Weight: 1.00
8) prio_rat Len: 243 Check: 8967 Weight: 1.00
9) prio_mouse Len: 243 Check: 1154 Weight: 1.00
10) prio_cerae Len: 243 Check: 6484 Weight: 1.00
11) prio_atege Len: 243 Check: 3328 Weight: 1.00
12) prio_human Len: 243 Check: 6101 Weight: 1.00
13) prio_gorgo Len: 243 Check: 6605 Weight: 1.00
14) prio_pantr Len: 243 Check: 6830 Weight: 1.00
15) prio_ponpy Len: 243 Check: 7373 Weight: 1.00
16) prio_prefr Len: 243 Check: 6794 Weight: 1.00
17) prio_macfa Len: 243 Check: 7019 Weight: 1.00
18) prio_mansp Len: 243 Check: 4939 Weight: 1.00
19) prio_colgu Len: 243 Check: 7345 Weight: 1.00
20) prio_calmo Len: 243 Check: 6003 Weight: 1.00
21) prio_calja Len: 243 Check: 6517 Weight: 1.00
22) prio_aottr Len: 243 Check: 4303 Weight: 1.00
23) prio_cebap Len: 243 Check: 7009 Weight: 1.00
Choose a sequence (# or All): all
Name an input sequence or -option: <rtn>To repeat, if you use ReadSeq to convert from GCG MSF to PHYLIP, you must make a couple more essential changes before it is correct for PHYLIP. First of all, periods will not work to represent indels (gaps); they must all be changed to hyphens, Unlike ToPHYLIP, ReadSeq does not do this step for you. The following UNIX command works very well for this step from the command line:
% tr \. \- < filein > fileout
Secondly, just as we did with the ToPHYLIP output, you must evaluate the terminal ends of your data matrix; if any of the implied indels are uncertain (especially true if sequence lengths were different), then substitute question marks, for the new hyphens. The differences are quite subtle between the two conversion programs' outputs. Either will work exactly the same in PHYLIP (if you've trimmed the divergent ends first before using ReadSeq). ToPHYLIP will require much less editing work, but it will only do GCG to PHYLIP conversions. ReadSeq is more flexible but requires some major editing changes.
5) Phylogenetic tree estimation -- GCG and PHYLIP.
5.a.1. Distance methods: GCG's Distances and GrowTree.
With the data in a usable format, it is time to go on to evolutionary studies. First let's use the original GCG MSF file to check out GCG's programs. Presently GCG only includes distance methods for phylogenetic inference. The evolutionary distance between each sequence must first be calculated, compensating for multiple substitutions at homologous sites, and then those distances can be used in their neighbor-joining algorithm to estimate the best evolutionary tree for the data. The distance calculation program is appropriately named Distances. Please use GCG's online GenManual help system or the hard-copy program manual in your carrel drawers to read their complete description of the Molecular Evolution program Distances -- they describe problems and considerations for this whole area very well. Since Distances is a GCG program, the {*} designation is required. You can accept the default parameters for the program, however, I encourage you to learn about the various evolutionary models incorporated. Also, if you did not trim your divergent ends with SeqLab, you'll have to trim them here with optional -begin and -end constraints on the command line. Type distances to start the procedure. An abridged screen trace of the process follows:
% distances
Distances creates a table of the pairwise distances within a group of
aligned sequences.
DISTANCES for what aligned sequences ? prion.msf{*}
Reading sequences...
prio_bovin: 243 total, 243 read
prio_trast: 243 total, 243 read
prio_saisc: 243 total, 243 read
prio_sheep: 243 total, 243 read
prio_odohe: 243 total, 243 read
prio_musvi: 243 total, 243 read
prio_mesau: 243 total, 243 read
prio_rat: 243 total, 243 read
prio_mouse: 243 total, 243 read
prio_cerae: 243 total, 243 read
prio_atege: 243 total, 243 read
prio_human: 243 total, 243 read
prio_gorgo: 243 total, 243 read
prio_pantr: 243 total, 243 read
prio_ponpy: 243 total, 243 read
prio_prefr: 243 total, 243 read
prio_macfa: 243 total, 243 read
prio_mansp: 243 total, 243 read
prio_colgu: 243 total, 243 read
prio_calmo: 243 total, 243 read
prio_calja: 243 total, 243 read
prio_aottr: 243 total, 243 read
prio_cebap: 243 total, 243 read
Distances will be computed for 23 protein sequences.
Which distance correction method to use ?
1 Uncorrected distance
2 Jukes-Cantor distance
3 Kimura protein distance
Choose the method to use: (* 3 *) <rtn>
What should I call the distance matrix file (* prion.distances *) ?
<rtn>
Computing distances using Kimura method...
1 x 2: 1.69 1 x 3: 6.19
1 x 4: 3.10 1 x 5: 3.10
1 x 6: 6.33 1 x 7: 9.35
1 x 8: 9.57 1 x 9: 8.85
1 x 10: 6.63 1 x 11: 5.92
1 x 12: 5.44 1 x 13: 4.97
1 x 14: 5.44 1 x 15: 5.44
1 x 16: 6.39 1 x 17: 6.39
/////////////////////////////////
19 x 21: 1.32 19 x 22: 2.27
19 x 23: 2.21 20 x 21: 1.35
20 x 22: 1.81 20 x 23: 2.26
21 x 22: 2.28 21 x 23: 2.66
22 x 23: 3.21If your sequences are too divergent, then you will get a warning message at the conclusion of the run stating how many of your sequence pairs have a distance greater than 70 with the Kimura protein correction model. The Kimura model overestimates distances when they are greater than 70 substitutions per 100 residues. This may lead to interpretive problems with the most diverged sequences in the alignment. Reliablity can be increased in these cases by further editing of the initial alignment to exclude the most diverse regions and/or sequences. The GCG Distances program excludes positions with gaps when using Kimura protein corrections and only exact matches contribute to the match score. It is not the most robust of methods; several PHYLIP alternatives to the GCG programs will also be explored this week.
To infer the neighbor-joining tree of the distance data run GCG's GrowTree program. Again, all defaults can be used for this example. Whatever you do, do not use UPGMA. UPGMA imposes an absolutely uniform molecular clock across all of your data. This is seldom, if ever, the situation. A screen trace follows:
% growtree
GrowTree creates a phylogenetic tree from a distance matrix created by
Distances using either the UPGMA or neighbor-joining method. You can create
a text or graphics output file.
What is the distance matrix ? prion.distances
Which method to use ?
1 Neighbor-joining
2 UPGMA
Choose the method to use: (* 1 *) <rtn>
What should I call the trees file (* prion.trees *) ? <rtn>
21 internal, 23 terminal nodes
The minimum density for a one-page plot is 16.0 taxa/100 platen units.
What density do you want (* 16.0 *) ? <rtn>
That will take 1 page. Is this all right (* yes *) ? <rtn>
When your VERSATERM-TEK4105 attached to tty is ready, press <Return>.
<rtn>

Vertical branch lengths are proportional to evolutionary divergence and have the unit of substitutions per 100 characters. The exact numbers can be seen in the program's output .trees file. This tree file is written in a nested parenthetical connotation known as the Newick standard. Many phylogeny programs, including PHYLIP, can work with this format of data. I recognize some problems with the GCG inference. Mainly Mink (prio_musvi) should not be included in the Artiodactyl clade and the branching order in the primate clade seems a bit odd. This may be because all of the primate sequences are so similar to each other; in fact, it show Gorillas and Orangutans (prio_gorgo and prio_ponpy) having negative branch lengths.
5.a.2. Another estimate of the distance matrix: PHYLIP's ProtDist.
Let's use the PHYLIP package now to see how its results differ. First I will show the use of a couple of the distance methods in the package. All the distance methods require that a distance matrix first be prepared from the alignment matrix. The GCG distance matrix is not suitable as input to PHYLIP. The programs are all triggered by entering their name. If you have a file in the present directory named "infile," the program will automatically read that, otherwise it will request an input file name. As discussed in the introduction, the way that PHYLIP is usually installed prevents you from specifying the names of its output files. Once in the program a menu comes up allowing you to pick the desired options. Several will be taken advantage of in this exercise. When the program has all the desired changes made to the default settings, the user responds with y to the prompt about the settings being correct, and off the process goes.
PHYLIP's protein distance matrix program, ProtDist, allows the use of three alternate multiple substitution correction models. Refer to the documentation for an explanation of the models. The default PAM model is much more robust than the Kimura correction previously used. However, it takes a while longer to run. It should be tolerable with the size of datasets that we are using though. When the program asks you for an input file, specify your converted alignment. At the "Are these settings correct?" prompt type "y" for yes. Issue the following command line to produce the screen trace below:
% protdist protdist: can't read infile Please enter a new filename>prion.phy Protein distance algorithm, version 3.572c Settings for this run: P Use PAM, Kimura or categories model? Dayhoff PAM matrix M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes Are these settings correct? (type Y or the letter for one to change) y Computing distances: prio_bovin prio_trast . prio_saisc .. prio_sheep ... prio_odohe .... prio_musvi ..... prio_mesau ...... prio_rat ....... prio_mouse ........ prio_cerae ......... prio_atege .......... prio_human ........... prio_gorgo ............ prio_pantr ............. prio_ponpy .............. prio_prefr ............... prio_macfa ................ prio_mansp ................. prio_colgu .................. prio_calmo ................... prio_calja .................... prio_aottr ..................... prio_cebap ...................... Output written to output file
The distance matrix output file is named outfile by default. Immediately rename this file so that the next program does not overwrite it!
% mv outfile prion.phydist
The distance matrix is probably wider than your screen so it may not look very good when displayed but you're welcome to take a look.
5.a.2.1. Fit the best tree to the distance data: PHYLIP's Fitch.
Next we can pass the distance matrix to one of several distance programs in PHYLIP. I'll start with Fitch, a least-squares fit algorithm. Least-square fit is probably the most powerful way of estimating a tree from distance data but it is fairly computationally intense. Notice the time it takes to run in the following session. When asked for an input file you need to specify the input distance matrix. Note the options I choose below. One option that we will be using, if it is appropriate in your dataset, is the designation of an outgroup; specify o and designate the number of your outgroup sequence based on its order in the input file. This is often the most diverged sequence in your dataset and is used to root the inferred tree. Its designation should be based on external knowledge, i.e. if you know, based on other studies, that a particular sequence is not of the same clade as the rest of your sequences , then you can designate it as an outgroup. In all cases, your designated outgroup should be as close to your ingroup as possible without actually being a part of it. In my example dataset I'm not sure what to choose as an appropriate outgroup but I do know that the Mink sequence is a member of Carnivora and no other members of the dataset belong to this family; therefore, I will designate it as the outgroup.
You should always improve tree reliability by taking advantage of the global rearrangement and jumble (randomize) sequence order options. The global rearangement option adds the extra step of breaking up and recombining the tree found after all species have been added in an attempt to find an even better tree. This optimization procedure helps prevent you from getting caught on a local "tree-space" optima and roughly triples the run time of each pass through the algortithm. The jumble option is also quite powerful and should usually be taken advantage of since the results of most of the tree construction algorithms are dependent on the order of sequence input. It is a very good idea to jumble multiple times, at least ten; however, here, in the interest of time, we will only jumble once. Supply jumble with any odd number as a random number seed. The Fitch screen trace follows:
% fitch fitch: can't read infile Please enter a new filename>prion.phydist Fitch-Margoliash method version 3.572c Settings for this run: U Search for best tree? Yes P Power? 2.00000 - Negative branch lengths allowed? No O Outgroup root? No, use as outgroup species 1 L Lower-triangular data matrix? No R Upper-triangular data matrix? No S Subreplicates? No G Global rearrangements? No J Randomize input order of species? No. Use input order M Analyze multiple data sets? No 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) o Type number of the outgroup: 6 Fitch-Margoliash method version 3.572c Settings for this run: U Search for best tree? Yes P Power? 2.00000 - Negative branch lengths allowed? No O Outgroup root? Yes, at species number 6 L Lower-triangular data matrix? No R Upper-triangular data matrix? No S Subreplicates? No G Global rearrangements? No J Randomize input order of species? No. Use input order M Analyze multiple data sets? No 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) g Fitch-Margoliash method version 3.572c Settings for this run: U Search for best tree? Yes P Power? 2.00000 - Negative branch lengths allowed? No O Outgroup root? Yes, at species number 6 L Lower-triangular data matrix? No R Upper-triangular data matrix? No S Subreplicates? No G Global rearrangements? Yes J Randomize input order of species? No. Use input order M Analyze multiple data sets? No 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) j Random number seed (must be odd)? 321 Number of times to jumble? 1 Fitch-Margoliash method version 3.572c Settings for this run: U Search for best tree? Yes P Power? 2.00000 - Negative branch lengths allowed? No O Outgroup root? Yes, at species number 6 L Lower-triangular data matrix? No R Upper-triangular data matrix? No S Subreplicates? No G Global rearrangements? Yes J Randomize input order of species? Yes (seed = 321, 1 times) M Analyze multiple data sets? No 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) y Adding species: prio_musvi prio_rat prio_colgu prio_cebap prio_aottr prio_saisc prio_cerae prio_bovin prio_mouse prio_atege prio_ponpy prio_trast prio_gorgo prio_pantr prio_prefr prio_odohe prio_calja prio_sheep prio_calmo prio_mansp prio_mesau prio_human prio_macfa Doing global rearrangements !---------------------! ............................................................................. ..... Output written to output file Tree also written onto file
Fitch produces both a tree file called treefile and an output text file called outfile. Again, immediately rename them. Use names that make sense and that identify the process used to create them; it's very important to keep track of what all the files are.
% mv outfile prion.fitch % mv treefile prion.fitchtree
The output text file shows the least-squares fit formula used, an ASCII representation of the tree, and it lists all branch lengths. The text output from this Fitch run is shown below:
23 Populations
Fitch-Margoliash method version 3.572c
__ __ 2
\ \ (Obs - Exp)
Sum of squares = /_ /_ ------------
2
i j Obs
Negative branch lengths not allowed
global optimization
+prio_sheep
!
! +prio_odohe
! +-14
! +-10 +prio_trast
! ! !
! ! +prio_bovin
! !
! ! +prio_calmo
! ! +-17
! ! ! +prio_calja
! ! !
! ! ! +prio_aottr
! ! ! +--3
! ! ! ! +prio_saisc
! ! +--9 !
! ! ! ! ! +prio_macfa
! ! ! ! +-15 +-21
-16--6 ! ! ! ! +-18 +prio_mansp
! ! ! ! ! ! ! !
! ! ! ! ! ! +--5 +prio_cerae
! ! +--1 +--8 ! ! !
! ! ! ! ! +-13 +prio_colgu
! ! ! ! ! !
! ! ! ! ! +prio_cebap
! ! ! ! !
! ! +--4 ! +prio_prefr
! ! ! ! !
! ! ! ! +prio_atege
! ! ! !
! ! ! ! +-prio_mesau
! ! ! +-19
! ! ! ! +prio_mouse
! +--2 +--7
! ! +prio_rat
! !
! ! +prio_human
! ! +-20
! ! +-12 +prio_pantr
! ! ! !
! +-11 +prio_gorgo
! !
! +prio_ponpy
!
+-prio_musvi
remember: (although rooted by outgroup) this is an unrooted tree!
Sum of squares = 4.80510
Average percent standard deviation = 9.76419
examined 4447 trees
Between And Length
------- --- ------
16 prio_sheep 0.01045
16 6 0.00491
6 10 0.01086
10 14 0.00165
14 prio_odohe 0.00000
14 prio_trast 0.00570
10 prio_bovin 0.00432
6 2 0.02787
2 4 0.00340
4 1 0.00041
1 9 0.00146
9 17 0.00145
17 prio_calmo 0.00550
17 prio_calja 0.00641
9 8 0.00073
8 15 0.00076
15 3 0.00270
3 prio_aottr 0.01072
3 prio_saisc 0.00785
15 13 0.00114
13 5 0.00207
5 18 0.00705
18 21 0.00000
21 prio_macfa 0.00000
21 prio_mansp 0.00473
18 prio_cerae 0.00000
5 prio_colgu 0.00206
13 prio_cebap 0.01526
8 prio_prefr 0.01069
1 prio_atege 0.00630
4 19 0.02846
19 prio_mesau 0.02857
19 7 0.01742
7 prio_mouse 0.00389
7 prio_rat 0.00882
2 11 0.00491
11 12 0.01381
12 20 0.00012
20 prio_human 0.00434
20 prio_pantr 0.00437
12 prio_gorgo 0.00000
11 prio_ponpy 0.00000
16 prio_musvi 0.04012Notice that the Mink sequence is forced to lie to the outside of everything else by its designation as the outgroup, although it is still strongly associated with the Artiodactyls. Surprisingly, the rodents are grouped within the primate clade. Weird phylogenies can be the result of several factors: bad alignments, insufficient data, or horizontal gene transfer. Take a look at the tree file to get a feel for the Newick format. The corresponding Fitch tree file follows:
(prio_sheep:0.01045,(((prio_odohe:0.00000,prio_trast:0.00570):0.00165, prio_bovin:0.00432):0.01086,(((((prio_calmo:0.00550,prio_calja:0.00641) :0.00145,(((prio_aottr:0.01072,prio_saisc:0.00785):0.00270,((((prio_macfa: 0.00000,prio_mansp:0.00473):0.00000,prio_cerae:0.00000):0.00705, prio_colgu:0.00206):0.00207,prio_cebap:0.01526):0.00114):0.00076, prio_prefr:0.01069):0.00073):0.00146,prio_atege:0.00630):0.00041, (prio_mesau:0.02857,(prio_mouse:0.00389,prio_rat:0.00882):0.01742):0.02846): 0.00340,(((prio_human:0.00434,prio_pantr:0.00437):0.00012,prio_gorgo:0.00000): 0.01381,prio_ponpy:0.00000):0.00491):0.02787):0.00491,prio_musvi:0.04012);
Many PHYLIP programs generate tree file data for an evolutionary tree. Two PHYLIP drawing routines, DrawTree and DrawGram, can produce graphical tree representations from tree files. However, we will delay using them until we've run through all the PHYLIP tree inference programs that I want to introduce.
5.a.2.2. Bootstrapping neighbor-joining techniques.
Next I will show how to run a bootstrapped neighbor-joining analysis. Bootstrapping is a statistical method for ascertaining input data reliability. It randomly selects column subsets of your alignment data matrix up to the same overall size as the original to create as many test sets as specified. A consensus is then made of all the resultant analyses; the longest branch lengths are those best resolved. Normally bootstrapping is done with a minimum of 100 replicates. Because of this, the neighbor-joining distance based method is often used to estimate the best tree of the results because it is so much faster than all other methods even though it can also be less reliable. All methods accept the multiple dataset input option -- they all have tradeoffs, one way or another. Use SeqBoot to bootstrap your original PHYLIP compatible input data set using the following command line and screen trace as a guide:
% seqboot seqboot: can't read infile Please enter a new filename>prion.phy Random number seed (must be odd)? 963 Bootstrapped sequences algorithm, version 3.572c Settings for this run: D Sequence, Morph, Rest., Gene Freqs? Molecular sequences J Bootstrap, Jackknife, or Permute? Bootstrap R How many replicates? 100 I Input sequences interleaved? Yes 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes Are these settings correct? (type Y or the letter for one to change) y completed replicate number 10 completed replicate number 20 completed replicate number 30 completed replicate number 40 completed replicate number 50 completed replicate number 60 completed replicate number 70 completed replicate number 80 completed replicate number 90 completed replicate number 100 Output written to output file
Rename the outfile before doing anything else!
% mv outfile prion.seqboot
The output alignment data matrix now contains 100 randomly selected sequence data matrices. To generate 100 distance matrices, launch ProtDist, being very careful to specify multiple datasets with the m option. This time I will also take advantage of the speed of Kimura's approximation model, although, as stated previously, Dayhoff's model often seems to offer better estimates of protein divergence. Therefore, depending on how much time you want to wait, either change the correction model to Kimura by specifying p at the option menu or leave it with the PAM model. Issue the following command line, to see the accompanying, much abridged screen trace (This will take a while to run, especially if you use the PAM correction model!):
% protdist protdist: can't read infile Please enter a new filename>prion.seqboot Protein distance algorithm, version 3.572c Settings for this run: P Use PAM, Kimura or categories model? Dayhoff PAM matrix M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes Are these settings correct? (type Y or the letter for one to change) p Protein distance algorithm, version 3.572c Settings for this run: P Use PAM, Kimura or categories model? Kimura formula M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes Are these settings correct? (type Y or the letter for one to change) m How many data sets? 100 Protein distance algorithm, version 3.572c Settings for this run: P Use PAM, Kimura or categories model? Kimura formula M Analyze multiple data sets? Yes, 100 sets I Input sequences interleaved? Yes 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes Are these settings correct? (type Y or the letter for one to change) y Data set # 1: Computing distances: prio_bovin prio_trast . prio_saisc .. prio_sheep ... prio_odohe .... prio_musvi ..... prio_mesau ...... prio_rat ....... prio_mouse ........ prio_cerae ......... prio_atege .......... prio_human ........... prio_gorgo ............ prio_pantr ............. prio_ponpy .............. prio_prefr ............... prio_macfa ................ prio_mansp ................. prio_colgu .................. prio_calmo ................... prio_calja .................... prio_aottr ..................... prio_cebap ...................... Output written to output file Data set # 2: Computing distances: prio_bovin prio_trast . prio_saisc .. prio_sheep ... prio_odohe .... prio_musvi ..... prio_mesau ...... prio_rat ....... prio_mouse ........ prio_cerae ......... prio_atege .......... prio_human ........... prio_gorgo ............ prio_pantr ............. prio_ponpy .............. prio_prefr ............... prio_macfa ................ prio_mansp ................. prio_colgu .................. prio_calmo ................... prio_calja .................... prio_aottr ..................... prio_cebap ...................... Output written to output file ////////////////////////////////////////////////////////////////////////////// Data set # 100: Computing distances: prio_bovin prio_trast . prio_saisc .. prio_sheep ... prio_odohe .... prio_musvi ..... prio_mesau ...... prio_rat ....... prio_mouse ........ prio_cerae ......... prio_atege .......... prio_human ........... prio_gorgo ............ prio_pantr ............. prio_ponpy .............. prio_prefr ............... prio_macfa ................ prio_mansp ................. prio_colgu .................. prio_calmo ................... prio_calja .................... prio_aottr ..................... prio_cebap ...................... Output written to output file
Rename the outfile:
% mv outfile prion.bootdist
Now submit the 100 matrix distance file to Neighbor. The program will produce a huge output text file and an output tree file, each containing 100 trees. Be sure to use the multiple option and specify 100 multiple datasets. Always randomize the input order with the jumble option and if it really matters make multiple runs with different seed numbers and compare results. Also specify your outgroup if you have one. Issue the following command line to see the drastically shortened screen trace below:
% neighbor neighbor: can't read infile Please enter a new filename>prion.bootdist Neighbor-Joining/UPGMA method version 3.5 Settings for this run: N Neighbor-joining or UPGMA tree? Neighbor-joining O Outgroup root? No, use as outgroup species 1 L Lower-triangular data matrix? No R Upper-triangular data matrix? No S Subreplicates? No J Randomize input order of species? No. Use input order M Analyze multiple data sets? No 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) o Type number of the outgroup: 6 Neighbor-Joining/UPGMA method version 3.5 Settings for this run: N Neighbor-joining or UPGMA tree? Neighbor-joining O Outgroup root? Yes, at species number 6 L Lower-triangular data matrix? No R Upper-triangular data matrix? No S Subreplicates? No J Randomize input order of species? No. Use input order M Analyze multiple data sets? No 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) j Random number seed (must be odd)? 9731 Neighbor-Joining/UPGMA method version 3.5 Settings for this run: N Neighbor-joining or UPGMA tree? Neighbor-joining O Outgroup root? Yes, at species number 23 L Lower-triangular data matrix? No R Upper-triangular data matrix? No S Subreplicates? No J Randomize input order of species? Yes (random number seed = 9731) M Analyze multiple data sets? No 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) m How many data sets? 100 Neighbor-Joining/UPGMA method version 3.5 Settings for this run: N Neighbor-joining or UPGMA tree? Neighbor-joining O Outgroup root? Yes, at species number 23 L Lower-triangular data matrix? No R Upper-triangular data matrix? No S Subreplicates? No J Randomize input order of species? Yes (random number seed = 9731) M Analyze multiple data sets? Yes, 100 sets 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) y /////////////////////////////////////////////////////////////////////////////// Data set # 100: CYCLE 20: OTU 3 ( 0.01649) JOINS OTU 4 ( 0.02873) CYCLE 19: OTU 1 ( 0.00991) JOINS OTU 2 ( 0.00564) CYCLE 18: NODE 3 ( 0.01693) JOINS NODE 1 ( 0.01299) CYCLE 17: OTU 20 ( 0.03288) JOINS OTU 23 ( 0.80175) CYCLE 16: OTU 22 ( 0.03273) JOINS OTU 21 ( -0.00863) CYCLE 15: OTU 5 ( 0.00835) JOINS OTU 7 ( 0.00405) CYCLE 14: NODE 5 ( 0.00018) JOINS OTU 6 ( -0.00021) CYCLE 13: NODE 22 ( 0.02280) JOINS NODE 20 ( 0.03775) CYCLE 12: NODE 22 ( 0.01337) JOINS NODE 3 ( 0.05477) CYCLE 11: NODE 5 ( 0.02100) JOINS OTU 8 ( 0.00413) CYCLE 10: NODE 22 ( 0.00908) JOINS OTU 19 ( 0.00891) CYCLE 9: NODE 5 ( 0.00832) JOINS OTU 17 ( 0.02166) CYCLE 8: NODE 5 ( 0.00329) JOINS NODE 22 ( 0.00354) CYCLE 7: NODE 5 ( 0.00947) JOINS OTU 15 ( 0.02311) CYCLE 6: OTU 9 ( 0.00846) JOINS OTU 12 ( -0.00022) CYCLE 5: NODE 5 ( 0.00185) JOINS OTU 13 ( 0.00109) CYCLE 4: NODE 5 ( 0.00252) JOINS NODE 9 ( 0.00311) CYCLE 3: OTU 18 ( 0.00871) JOINS OTU 14 ( 0.00401) CYCLE 2: NODE 18 ( 0.00019) JOINS OTU 10 ( -0.00027) CYCLE 1: NODE 18 ( 0.00025) JOINS OTU 11 ( -0.00010) LAST CYCLE: NODE 5 ( 0.00108) JOINS OTU 16 ( 0.02195) JOINS NODE 18 ( 0.00006) Output written on output file Tree written on tree file
The output text file is huge. You may want to look at it, but then delete it as it takes up an awful lot of space in your account. While deleting things it's a good idea to get rid of the SeqBoot'ed alignment and distance matrices also since they are also very large and they will not be needed any more in the exercise. Be sure not to delete the tree file output from this last run of Neighbor; instead, rename it immediately:
% mv treefile prion.neighbor_boottree
To condense all this information down into a single consensus tree with branch lengths proportional to bootstrap confidence levels run the PHYLIP program Consense. Launch the following command line; specify the input tree file and designate the outgroup if you have one. Be careful with this designation. If your outgroup was not the last member of the dataset in your initial input file, as it was in my case, then it probably changed its order after the above neighbor run because you specified it as an outgroup there. Most likely it is now either the first or the last member of the dataset. Check the output treefile from the above step to be sure. Each tree is separated from the next by a semicolon, ";". Use the following abridged screen trace as a guide:
% consense consense: can't read infile Please enter a new filename>prion.neighbor_boottree Majority-rule and strict consensus tree program, version 3.572c Settings for this run: O Outgroup root? No, use as outgroup species 1 R Trees to be treated as Rooted? No 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the sets of species Yes 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) o Type number of the outgroup: 23 Majority-rule and strict consensus tree program, version 3.572c Settings for this run: O Outgroup root? Yes, at species number 23 R Trees to be treated as Rooted? No 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the sets of species Yes 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) y Output written to output file Tree also written onto file
Rename both output files:
% mv outfile prion.neighbor_bootconsense
% mv treefile prion.neighbor_bootconsensetree
Check out the text file output from the Consense program. The numbers at the branch forks indicate the bootstrap value. The following abridged display show the results of this Consense run:
Majority-rule and strict consensus tree program, version 3.572c
Species in order:
prio aottr
prio calmo
prio calja
prio atege
prio mansp
prio macfa
prio cerae
prio colgu
prio cebap
prio saisc
prio prefr
prio human
prio pantr
prio gorgo
prio ponpy
prio mesau
prio mouse
prio rat
prio odohe
prio trast
prio bovin
prio sheep
prio musvi
Sets included in the consensus tree
Set (species in order) How many times out of 100.00
********** ********.. ... 97.00
.......... .....***.. ... 96.00
.......... .***...... ... 95.00
.......... ......**.. ... 90.00
.......... .****..... ... 86.00
....***... .......... ... 83.00
.......... .........* *.. 77.00
.......... .**....... ... 66.00
********** ********** *.. 55.00
....*.*... .......... ... 53.00
....****.. .......... ... 44.00
********** *****..... ... 44.00
....*****. .......... ... 36.00
*........* .......... ... 34.00
********** *********. ... 32.00
.......... *****..... ... 30.00
..**...... .......... ... 23.00
**..****** *****..... ... 11.00
*...****** *****..... ... 4.00
....*****. *****..... ... 4.00
Sets NOT included in consensus tree:
Set (species in order) How many times out of 100.00
....**.... .......... ... 34.00
.......... ........*. .*. 33.00
*......... .....***.. ... 29.00
.......... ........** *.. 26.00
....***.*. .......... ... 23.00
********** *********. .*. 21.00
.......*.. *......... ... 21.00
********** *....***.. ... 19.00
**.******* ********.. ... 17.00
.......... .*.*...... ... 17.00
.......... ........** **. 17.00
//////////////////////////////////
...**.**.. .......... ... 1.00
.......... ***.*..... ... 1.00
*.*.****** *****..... ... 1.00
.*..*****. *......... ... 1.00
CONSENSUS TREE:
the numbers at the forks indicate the number
of times the group consisting of the species
which are to the right of that fork occurred
among the trees, out of 100.00 trees
+-------------------prio cebap
!
! +----prio cerae
+-36.0 +-53.0
! ! +-83.0 +----prio mansp
! ! ! !
! +-44.0 +---------prio macfa
! !
! +--------------prio colgu
+--4.0
! ! +--------------prio ponpy
! ! !
! ! +-86.0 +----prio human
! ! ! ! +-66.0
! ! ! +-95.0 +----prio pantr
+--4.0 +-30.0 !
! ! ! +---------prio gorgo
! ! !
! ! +-------------------prio prefr
+-11.0 !
! ! ! +----prio saisc
! ! +---------------------34.0
! ! +----prio aottr
+-44.0 !
! ! +----------------------------------prio calmo
! !
! ! +----prio calja
+-97.0 +-------------------------------23.0
! ! +----prio atege
! !
! ! +---------prio mesau
+-32.0 +-------------------------------96.0
! ! ! +----prio rat
! ! +-90.0
! ! +----prio mouse
+-55.0 !
! ! +-------------------------------------------------prio odohe
! !
! ! +----prio bovin
! +----------------------------------------------77.0
! +----prio trast
!
+-----------------------------------------------------------prio musvi
!
+-----------------------------------------------------------prio sheep
remember: (though rerooted by outgroup) this is an unrooted tree!Notice some nodes are very well resolved with bootstrap values as high as 97%. This is a great indication of the reliability of the phylogenetic inference of those nodes based on the prepared dataset. However, several others, in particular those that have had trouble with other techniques, all reflect very low confidence values with numbers ranging from about 50% on down. In general bootstrap values above about 60% tend to underrepresent the actual confidence level of the branching pattern, whereas numbers below 40% overrepresent confidence levels.
The output tree file follows. Notice that the branch lengths are now representing bootstrap values, not evolutionary distances. We will plot this data later on.
(((((((((prio_cebap:100.0,(((prio_cerae:100.0,prio_mansp:100.0):53.0, prio_macfa:100.0):83.0,prio_colgu:100.0):44.0):36.0,((prio_ponpy:100.0, ((prio_human:100.0,prio_pantr:100.0):66.0,prio_gorgo:100.0):95.0):86.0, prio_prefr:100.0):30.0):4.0,(prio_saisc:100.0,prio_aottr:100.0):34.0): 4.0,prio_calmo:100.0):11.0,(prio_calja:100.0,prio_atege:100.0):23.0): 44.0,(prio_mesau:100.0,(prio_rat:100.0,prio_mouse:100.0):90.0):96.0): 97.0,prio_odohe:100.0):32.0,(prio_bovin:100.0,prio_trast:100.0):77.0): 55.0,prio_musvi:100.0,prio_sheep:100.0);
You should notice that those nodes of low bootstrap confidence correspond to those that seem to bounce about depending on the inference method used. One possible solution is to exclude them from your analysis.
5.b. Parsimony methods: PHYLIP's ProtPars.
Finally I will show how to run one of PHYLIP's parsimony programs, ProtPars. Parsimony analysis directly uses the aligned sequence data matrix as input. The program explores all 'tree space' in an attempt to answer the question: "Which trees lead to the observed alignment with the least number of symbol changes?" ProtPars uses an approximate, heuristic, tree-rearrangement solution, although a much slower branch-and-bound, exact solution program, DNAPenny, is also available in PHYLIP. Begin the program with the command line below; again take advantage of the jumble option. I'll illustrate with only one jumble, but in actual research you should probably jumble a minimum of ten times. Also designate your outgroup, if you have one. Another option available in parsimony is the reconstruction of ancestral node sequences; choose option 5 to designate, if this interests you. This analysis takes a bit longer to run than the previous -- have patience. My interactive screen trace follows:
% protpars protpars: can't read infile Please enter a new filename>prion.phy Protein parsimony algorithm, version 3.572c Setting for this run: U Search for best tree? Yes J Randomize input order of sequences? No. Use input order O Outgroup root? No, use as outgroup species 1 T Use Threshold parsimony? No, use ordinary parsimony M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Print out steps in each site No 5 Print sequences at all nodes of tree No 6 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) j Random number seed (must be odd)? 875 Number of times to jumble? 1 Protein parsimony algorithm, version 3.572c Setting for this run: U Search for best tree? Yes J Randomize input order of sequences? Yes (seed = 875, 1 times) O Outgroup root? No, use as outgroup species 1 T Use Threshold parsimony? No, use ordinary parsimony M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Print out steps in each site No 5 Print sequences at all nodes of tree No 6 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) o Type number of the outgroup: 6 Protein parsimony algorithm, version 3.572c Setting for this run: U Search for best tree? Yes J Randomize input order of sequences? Yes (seed = 875, 1 times) O Outgroup root? Yes, at sequence number 6 T Use Threshold parsimony? No, use ordinary parsimony M Analyze multiple data sets? No I Input sequences interleaved? Yes 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Print out steps in each site No 5 Print sequences at all nodes of tree No 6 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) y Adding species: prio_macfa prio_mouse prio_sheep prio_mesau prio_human prio_musvi prio_bovin prio_ponpy prio_saisc prio_odohe prio_cebap prio_trast prio_mansp prio_calmo prio_colgu prio_pantr prio_calja prio_cerae prio_gorgo prio_aottr prio_atege prio_rat prio_prefr Doing global rearrangements !---------------------------------------------! ............................................. ............................................. Output written to output file Trees also written onto file
Don't forget to rename the output files:
% mv outfile prion.pars
% mv treefile prion.parstree
PHYLIP's parsimony programs do not calculate branch lengths. Dr. Felsenstein does not feel that parsimony does an adequate job of reconstructing branch lengths and so he does not incorporate it into his implementations. The very first part of the output text follows below; notice that a whole bunch of 'shortest' trees were found:
Protein parsimony algorithm, version 3.572c<
50 trees in all found
+--------------------------------------------------------------prio_musvi
!
! +-----prio_odohe
--5 +-----------------------------------------------------4
! ! +-----prio_sheep
! !
! ! +-----prio_cebap
+--3 +----------------------------------------------22
! ! +-----prio_calja
! !
! ! +-----------prio_ponpy
! ! +----------------------------------14
+----20 ! ! +--------prio_pantr
! ! +-13
! ! ! +-----prio_gorgo
! ! +-12
! ! +-----prio_human
+----11
! +-----prio_colgu
! +----------------------------18
! ! +-----prio_prefr
! !
! ! +----------prio_calmo
! ! +-----------------19
+----------15 ! ! +-------prio_mouse
! ! +--8
! ! ! +----prio_rat
! ! +--7
! ! +----prio_mesau
+-----6
! +-----prio_aottr
! +---------21
! ! ! +--prio_atege
! ! +-10
! ! +--prio_cerae
+-----------9
! +--prio_mansp
! +---17
! ! +--prio_macfa
+-------16
! +----prio_saisc
+--2
! +--prio_trast
+-1
+--prio_bovin
remember: (although rooted by outgroup) this is an unrooted tree!
requires a total of 185.000
///////////////////////////////////////////////////////////////////////////
The output tree file is quite simple, 50 trees in the nested Newick parenthetical form. We will use DrawGram to plot the Consense tree graphic of these trees later on. If fed a file containing more than one tree, the drawing programs in the PHYLIP system will only use the first tree given in such a file. If you really want to draw graphics of all the other trees, you will need to copy the file and create a series of tree files with editing so that you can produce output plots of each possible tree that was found. In all cases though, the PHYLIP program Consense can be run to generate the consensus tree from a tree file with multiple trees.
Run Consense on your ProtPars output tree file to find the consensus tree of all the shortest parsimony trees. Don't forget to designate your outgroup, if appropriate. However, again be careful; your outgroup likely changed its order in the file so check the input tree file. Mine switched from being entry six, to being the first entry, 1:
% consense consense: can't read infile Please enter a new filename>prion.parstree Majority-rule and strict consensus tree program, version 3.572c Settings for this run: O Outgroup root? No, use as outgroup species 1 R Trees to be treated as Rooted? No 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the sets of species Yes 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) o Type number of the outgroup: 1 Majority-rule and strict consensus tree program, version 3.572c Settings for this run: O Outgroup root? Yes, at species number 1 R Trees to be treated as Rooted? No 0 Terminal type (IBM PC, VT52, ANSI)? ANSI 1 Print out the sets of species Yes 2 Print indications of progress of run Yes 3 Print out tree Yes 4 Write out trees onto tree file? Yes Are these settings correct? (type Y or the letter for one to change) y Output written to output file Tree also written onto file
As always, rename the output files:
% mv outfile prion.pars_consense
% mv treefile prion.pars_consensetree
The abridged Consense output text file follows below. Notice that the contribution of each of the 50 input trees is normalized to equal a total of 1.00. This makes the branch lengths a decimal form of percentage, i.e. 0.8 means 80% of the trees had this node. Many of the branch lengths are now 1.0. This means that the same branching pattern about that node appeared in all input trees.
Majority-rule and strict consensus tree program, version 3.572c
Species in order:
prio musvi
prio odohe
prio sheep
prio cebap
prio calja
prio ponpy
prio pantr
prio gorgo
prio human
prio colgu
prio prefr
prio calmo
prio mouse
prio rat
prio mesau
prio aottr
prio atege
prio cerae
prio mansp
prio macfa
prio saisc
prio trast
prio bovin
Sets included in the consensus tree
Set (species in order) How many times out of 1.00
.......... .......... *** 1.00
.......... .****..... ... 1.00
.....****. .......... ... 1.00
.......... ..***..... ... 1.00
.......... ...**..... ... 1.00
.......... ........** ... 1.00
......***. .......... ... 1.00
.......... .......... .** 1.00
.......... .....***.. ... 1.00
.......... ......**.. ... 1.00
.**....... .......... ... 0.92
...**..... .......... ... 0.92
.......**. .......... ... 0.84
.......... .....***** ... 0.70
...******* ********** *** 0.64
.........* *......... ... 0.52
...**..... .****..... ... 0.50
...******* ********** ... 0.46
...**..... .********* ... 0.46
...**....* ********** ... 0.40
Sets NOT included in consensus tree:
Set (species in order) How many times out of 1.00
.......... .********* ... 0.34
.**....... .......... *** 0.30
.........* ********** ... 0.28
.....***** ********** *** 0.28
...**..... ********** ... 0.26
//////////////////////////////////
.......... ........** *** 0.02
...**....* .****...** ... 0.02
.......... .....***.. *** 0.02
CONSENSUS TREE:
the numbers at the forks indicate the number
of times the group consisting of the species
which are to the right of that fork occurred
among the trees, out of 1.00 trees
+---------prio saisc
+---------------------------1.0
! ! +----prio trast
! +--1.0
! +----prio bovin
!
! +----prio rat
! +--1.0
! +--1.0 +----prio mesau
! ! !
! +--1.0 +---------prio mouse
! ! !
! +--0.5 +--------------prio calmo
! ! !
+--0.6 ! ! +----prio calja
! ! ! +------------0.9
! ! +--0.5 +----prio cebap
! ! ! !
! ! ! ! +----prio macfa
! ! ! ! +-------1.0
! ! ! ! ! +----prio mansp
! ! ! +-------0.7
! ! +--0.4 ! +---------prio aottr
! ! ! ! +--1.0
! ! ! ! ! +----prio cerae
! ! ! ! +--1.0
! ! ! ! +----prio atege
+--1.0 ! ! !
! ! +--0.5 ! +----prio prefr
! ! ! +----------------------0.5
! ! ! +----prio colgu
! ! !
! ! ! +---------prio pantr
! ! ! +--1.0
! ! ! ! ! +----prio gorgo
! ! +-----------------1.0 +--0.8
! ! ! +----prio human
! ! !
! ! +--------------prio ponpy
! !
! ! +----prio sheep
! +-------------------------------------0.9
! +----prio odohe
!
+-------------------------------------------------prio musvi
remember: (though rerooted by outgroup) this is an unrooted tree!As seen here, parsimony analysis often finds more than one most parsimonious tree. The PHYLIP output is a list of all equally the most parsimonious trees, not an ordered ranking of the most likely trees. Bootstrapping can help you get a feel for this. Lineages that evolve much faster than others tend to confound many of the algorithms. This is known as "the long branch effect" and leads to the dreaded "Felsenstein Zone" where many algorithms are guaranteed to always find the wrong answer! I am always tempted to exclude sequences that appear to be causing this problem due to their extreme divergence but it is very hard to recognize. Other common problems in phylogenetic inference include sequencing and/or initial alignment mistakes, the occurence of active lateral transfer of genes between the groups being examined, and the fact that some of the organisms may have improper traditional taxonomies leading to a discrepancy between your inferred phylogeny and the normally accepted one.
An algorithm that I am not going to demonstrate, but that I heartily recommend is maximum likelihood. Maximum likelihood is a probabilistic statistical procedure that combines the best features of distance and parsimony methods. Unfortunately it is very slow. Data sets the size that we are dealing with could take many, many hours to process so I am not going to teach it. The PHYLIP maximum likelihood program is DNAML; ProtML is supplied as an adjunct to the package by J. Adachi and M. Hasegawa. These algorithms are arguably the most powerful available for ascertaining phylogenies from sequence data, however, they are also computationally the most involved. They can take a prohibitive amount of time to run depending on the size of your dataset. Dr. Gary Olsen has recoded DNAML to produce fastDNAML; this is also available on ribozyme and is somewhat faster, considerably speeding up the algorithm. The protein implementation, ProtML is particularly challenging because it is not a part of the regular release of PHYLIP; rather it is from different authors and is included with PHYLIP as an unsupported program, only provided as a supplement to the package. It runs somewhat differently than the normal PHYLIP programs and expects slightly different input format. Refer to the online documentation for assistance with this program. The DNA implementation, DNAML, is likewise slow, but it is a standard PHYLIP program and so will run in the normal manner. Gary Olsen's fastDNAML operates quite similarily, but has its own pecularities. If you have need to use any of these program, contact Steve Thompson for assistance.
Another somewhat quicker utilization of maximum likelihood statistics is to test a user defined tree. Within PHYLIP many programs will accept a user defined tree for evaluation and branch length calculations by appending your tree data to the input data. See the documentation files for the particular details of achieving this in different programs. The 'nonclock' parsimony programs require rooted trees while the 'nonclock' maximum likelihood and Fitch programs require their user defined trees to be unrooted. PHYLIP's ReTree program can be used to switch these formats back and forth.
6) Running PHYLIP in the background.
Unlike those GCG programs in which batch mode can help make your life easier, PHYLIP does not write its own command files and submit them automatically. As discussed in the introduction, we have prepared some simple script files for you to modify in order to make the process a little easier. More complicated scripts and documentation are also available for passing the output of one program on to the input of another. However, if the data set is small enough, you can run most PHYLIP programs interactively. If not, you should use some type of script file to run the programs in the background.
An example of one that could be used for a Fitch run is given below. This file must contain all the same parameters that the program would ask you, if you were running it interactively. It uses many of the same options as given in the previous interactive sessions, however, it runs ten jumbles on the dataset. The user needs to change the lines in the file to reflect their own data filename and chosen options. Pound signs, "#," indicate comments. Issue the following command to pull a copy of this simple script from $DOC_DIR to your directory:
% cp $DOC_DIR/phylip/scripts/fitch_parameter.file .
Now edit this file with the pico editor to customize the indicated lines.
prion.phydist o # designate outgroup option 6 # number of outgroup based on order in input file g # global rearangement option turned on j # jumble (randomize) order option turned on 8765 # supply odd random number seed 10 # how many jumbles? y # Are these all the options you want? Yes.
Change the file name given on line one to that for your desired input file. Next, options are indicated in the body of the file in the same order as they would be answered interactively. In this example lines two through six give the option change information just as they were entered in the above interactive session. Line seven changes from one to ten jumbles. If you had wanted to just use the default parameters, then only the y for "yes" is necessary. You must have at least the name of your input file and the y prompt for the program to run.
To run your script in the background, launch the name of the PHYLIP program that you want to use followed by appropriate file redirection and an ampersand, "&." All responses to the usual interactive menu need to be in your input fitch_parameter.file and you need to tell the program to use that script as input. When your script is in the shape you want, test it out using the following command line. The programs still produce their standard outfile and treefile containing the usual program output. These can cause problems if you are running other PHYLIP jobs in the same directory since the file names will overwrite one-another. Therefore, be sure to devote separate directories to each concurrent PHYLIP job that you run. The corresponding terminal output file will list potential problems if the job bombs.
% fitch < fitch_parameter.file > terminal &
This process will immediately launch the job in the background and you can go about whatever other computing tasks you would like. You can even log out of your current session and the background job will continue to run to completion. A handy UNIX command to check on the status of all of your currently running jobs is a variation of the process command that searches for your particular user name:
% ps -ef | grep your_account_name
After you submit the job take a look at the terminal file created by the run. If it is very quickly filling up with a statement about not being able to read the input file, then you have not specified the input file correctly in your parameter file. Imm ediately kill the process using the process number (PID) obtained in the above ps command with the following UNIX command:
% kill -9 PID
Also delete the huge terminal file in your account. You can easily completely fill up your entire quota with one of these mistakes so that nothing else will work in your account. Go back and fix your parameter file until it works correctly. Resubmit yo ur job and then check the new terminal file to see that you have at least gotten into the program and that it was loading in sequences. If it wasn't doing that, something is still wrong. It is hard to have accurate ideas about how long a PHYLIP task might take. Some processes can run for days depending on the parameters selected, the size of the data set, and the user load on the computer. Experience will help.
Do not wait for this job to finish, it may take an hour or so if everything is running properly, -- go on with the rest of the exercise. When the job does finish, the output outfile and treefile will end up in the same directory that the job was started in. Remember to rename them. Be sure to delete the terminal file after everything has gone well. They are only for troubleshooting and serve no other purpose. There's no need to fill up your account with unnecessary files. Take a look at the output. This may not be until later on in the week. It should be very similar to what you saw when you ran the program interactively. Differences can be attributed to the increased jumble option parameter and should reflect a more confident answer.
7) Plotting phylogenetic trees with PHYLIP.
With your generated Newick tree files in hand, it's time to create plots of the results. There are two PHYLIP programs that draw plots of Newick style tree file data, DrawTree and DrawGram. DrawTree produces an unrooted 'network' representation; DrawGram produces several varieties of rooted style 'grams.' In both cases, branch lengths will be proportional to the evolutionary distance separating the entries or the level of consensus if it is a Consense tree. Both programs require that the user provide information designating the input file and the font set to be used. The PHYLIP system has five fonts to work with. These are the so-called Hershey fonts; they have the file names font1 through font5. These fonts must either be copied into the directory that the drawing programs are run in or you must specify the full path name to them each time that you specify one. On ribozyme:
/disk2/usr/local/soft/genetics/phylip2/font*
To simplify the following procedures, copy one of the fonts to your current directory at this point. You only need one, I like font2, but chose any one of them for your own use.
Since the package is meant to be graphics device independent, it doesn't use text characters in plots by default; rather it gives detailed drawing instructions for each character used. This makes changing the names on the generated plots very challenging, so the names should be changed, if desired, at earlier points in the work. If you haven't gotten the names in your initial data matrix the way that you want them to end up, then you can either change them with a text editor in the Newick tree file or use PHYLIP's ReTree program to do it interactively.
DrawTree and DrawGram operate in a similar fashion. They can typically be run in two different manners -- one to directly create a PostScript file of the plot and the other to create a graphics file that can be visualized on the screen using GCG Tektronix emulation.
7.a. On screen visualization of PHYLIP graphics.
I'll illustrate the onscreen visualization first. To do this in an environment typical of many VADMS users, that of a Tektronix color 4105 or 4107 emulator, we can take advantage of PHYLIP's Hewlett-Packard graphics language output option as an intermediate and the Homegrown software phylip_plot. When either drawing program, DrawTree or DrawGram, is launched it looks for a file in the current directory entitled "treefile," if it doesn't find one, it returns a request for the required input file and then it asks for a font file. It then prompts you for the type of output device to be used. To designate the Hewlett-Packard graphics language choose h. Often you will not have the proper display device available for previewing the results, so respond with n to the next question posed. A number of options are then listed governing the actual plot itself. There are a whole slew of these. To begin, try the defaults first, if they don't give you what you want, repeat the process and select different ones at that point. I suggest running them many, many times until you see what you prefer.
Follow the sample screen trace below using the results from the first Fitch tree of the prion data set to get a feel for running DrawGram in this fashion. Notice that you need to specify the name of the input tree and then the font file. Be sure to specify h for Hewlett-Packard plotter language and do not preview by choosing n when asked. For this first pass, I will use all default drawing parameters. Follow the screen trace:
% drawgram
drawgram: can't read treefile
Please enter a new filename>prion.fitchtree
DRAWGRAM from PHYLIP version 3.572c
Reading tree ...
Tree has been read.
Loading the font ....
drawgram: can't read fontfile
Please enter a new filename>font2
Font loaded.
Which plotter or printer will the tree be drawn on?
(many other brands or models are compatible with these)
type: to choose one compatible with:
L Apple Laserwriter (with Postscript)
M MacDraw PICT format
R Rayshade 3D rendering program file
J Hewlett-Packard Laserjet
K TeKtronix 4010 graphics terminal
H Hewlett-Packard 7470 plotter
D DEC ReGIS graphics (VT240 terminal)
B Houston Instruments plotter
E Epson MX-80 dot-matrix printer
C Prowriter/Imagewriter dot-matrix printer
O Okidata dot-matrix printer
T Toshiba 24-pin dot-matrix printer
P PC Paintbrush monochrome PCX file format
X X Bitmap format
F FIG 2.0 format
U other: one you have inserted code for
Choose one:
h
Which type of screen will it be previewed on?
type: to choose one compatible with:
N will not be previewed
K TeKtronix 4010 graphics terminal
D DEC ReGIS graphics (VT240 terminal)
U other: one you have inserted code for
Choose one:
n
Here are the settings:
(1) Tree grows: Vertically
(2) Style of tree: Cladogram
(3) Use branch lengths: Yes
(4) Angle of labels: 45.0
(5) Horizontal margins: 1.92 cm
(5) Vertical margins: 1.44 cm
(6) Scale of branch length: Automatically rescaled
(7) Depth/Breadth of tree: 0.53
(8) Stem-length/tree-depth: 0.05
(9) Character ht / tip space: 0.3333
(10) Ancestral nodes: Intermediate
Do you want to accept these? (Yes or No)
Type Y or N or the number (1-10) of the one to change:
y
Writing plot file ...
Finished.
End of run.Rename the output, this time called plotfile. Use an extension that identifies the content as Hewlett-Packard graphics:
% mv plotfile prion_fitchgram.hpgl
The result of this process is a text file containing the plotting instructions for the desired plot in Hewlett-Packard graphics language, HPGL. You can take a look at it if you want, but it is very long and quite meaningless to most human beings. Some comments on the distinctive format and contents of this file should be mentioned regardless. The more complex the plot is, the longer the generated plotting instruction file will be. The PHYLIP draw programs display the structure of the plot in black (white on the black background of standard Tek 4105/4107 graphics terminals) and the names of the sequences being worked with in green. HPGL commands for pen color are SPx, where SP is the pen color command and the x denotes the color to be used, 1 is for black and 2 for green. Since the instructions were written for a plotter, there are mainly PU for pen up and PD for pen down instructions in this file. The PU commands are similar in function to GCG Figure program move commands in that they position the pen over a spot on the page. PD commands put the pen down on the page and draw a line to the location given in the instruction. The command IN starts a plot and SP ends one.
The HPGL files generated from these programs can then be plotted to your screen with the locally developed piece of software, phylip_plot. Since this software was written with GCG graphics subroutines, GCG needs to be activated and the graphics device type set in order to run the program. Follow the screen trace below to run phylip_plot:
% phylip_plot PHYLIP plotting program Please enter the filename.ext prion_fitchgram.hpgl When your VERSATERM-TEK4105 attached to tty is ready, press <rtn>. <rtn>

The results can be very disappointing on screen. However, because it was written with GCG subroutines, phylip_plot responds to all the command switches that are possible in all GCG graphics programs. This enables you to change the size of the resulting output to meet your needs. The data set can be expanded, moved around on the page, and turned 90 degrees from the original orientation to fit the page better. If your plot is higher than it is wide, use portrait to rotate it to fit better on the long direction of the screen. Use scale to control the actual size of the plot on the page. Xpan and ypan move the plot around on the page. A negative value in xpan moves the picture to the left and a negative value in ypan moves the picture down on the page. However, if you use the portrait switch this exchanges the two pan and scale x and y functions and tends to confuse the user. A listing of several of these command switches is given in Appendix B. The best way to get a good picture is just to try various things until you like what you see. An example of a phylip_plot command line with some of these GCG options and the resultant plot is given below:
% phylip_plot -xscale=1.5 -yscale=2.25 -xpan=40 -ypan=30

This considerably improves the readability of the plot. However, it is obvious that we should not have accepted the default 45 degree label angle when running DrawGram. If playing this game, you should experiment with various values until you find the best for your graphic. Once you've determined the optimal, 'magical' settings, it is best to write them down somewhere as they will be lost if you proceed with the next step. The Figure file picture generated in the next step is the same size as that of the original data you started with prior to all the adjustments, even if you add all your magic conditions to the command line. To produce a Figure file of your graphic, issue a variation of the following phylip_plot command line:
% phylip_plot -figure=prion_fitchgram.figure
Writing the output to a Figure file instead of directly to a plot with the figure option allows greater flexibility. You can edit the Figure file to modify the information in it for publication or posters. Figure legends and other types of annotation can be incorporated in this process.
After modifying the Figure file the way you want it, the results of your efforts can be seen on the screen using the Figure program. Use the same magic conditions derived earlier when running the phylip_plot program to get the desired output, because the production of the Figure file saved your data in its original size and orientation. An example of a Figure command line that corresponds to the above phylip_plot command line is given below:
% figure -xscale=1.5 -yscale=2.25 -xpan=40 -ypan=30 prion_fitchgram.figure
When satified with your results, you might want to use the GCG graphics switch nomark to suppress the time stamp that the Figure program puts on its wares to produce another better version of your plot. You can also use the speed option to slow down the plotting procedure to improve resolution, if you are using an HP plotter as your graphics device.
7.b. Publication quality PHYLIP PostScript graphics.
You are welcome to play with phylip_plot as much as you like to visualize the trees produced in this week's exercise on your terminal screen. However, to produce publication quality output I usually let PHYLIP directly produce PostScript plots. PHYLIP's internal PostScript driver does a very good job of sizing the tree to the page and producing a legible final graphic. To produce a PostScript graphic representation of the tree, launch either DrawTree or DrawGram and specify that you want to use "Apple Laserwriter (with Postscript)" as your graphics device. Again, do not preview the graphic. Just as before, you need to specify the input tree file and a Hershey font logical name. However, because we're directly using PostScript we can change to a non-bitmapped, PostScript supported, font once within the program. Be sure that the PostScript printer that you are using recognizes the font that you specify. PostScript font names are case sensitive. Most printers can generate a test page of all their supported fonts to help you choose.
Follow the next screen trace to see how to produce PostScript trees directly out of PHYLIP. For this example, I will draw a phenogram of the second tree that we estimated in this week's exercise, the bootstrapped neighbor-joining tree. I'll illustrate with some of my favorite option combinations. Check out all the options that I change to get just the effect that I prefer below:
% drawgram
drawgram: can't read treefile
Please enter a new filename>prion.neighbor_bootconsensetree
DRAWGRAM from PHYLIP version 3.572c
Reading tree ...
Tree has been read.
Loading the font ....
drawgram: can't read fontfile
Please enter a new filename>font2
Font loaded.
Which plotter or printer will the tree be drawn on?
(many other brands or models are compatible with these)
type: to choose one compatible with:
L Apple Laserwriter (with Postscript)
M MacDraw PICT format
R Rayshade 3D rendering program file
J Hewlett-Packard Laserjet
K TeKtronix 4010 graphics terminal
H Hewlett-Packard 7470 plotter
D DEC ReGIS graphics (VT240 terminal)
B Houston Instruments plotter
E Epson MX-80 dot-matrix printer
C Prowriter/Imagewriter dot-matrix printer
O Okidata dot-matrix printer
T Toshiba 24-pin dot-matrix printer
P PC Paintbrush monochrome PCX file format
X X Bitmap format
F FIG 2.0 format
U other: one you have inserted code for
Choose one:
l
Which type of screen will it be previewed on?
type: to choose one compatible with:
N will not be previewed
K TeKtronix 4010 graphics terminal
D DEC ReGIS graphics (VT240 terminal)
U other: one you have inserted code for
Choose one:
n
Here are the settings:
(1) Tree grows: Vertically
(2) Style of tree: Cladogram
(3) Use branch lengths: Yes
(4) Angle of labels: 45.0
(5) Horizontal margins: 1.73 cm
(5) Vertical margins: 2.24 cm
(6) Scale of branch length: Automatically rescaled
(7) Depth/Breadth of tree: 0.53
(8) Stem-length/tree-depth: 0.05
(9) Character ht / tip space: 0.3333
(10) Ancestral nodes: Intermediate
(11) Font : Hershey
Do you want to accept these? (Yes or No)
Type Y or N or the number (1-11) of the one to change:
1
Here are the settings:
(1) Tree grows: Horizontally
(2) Style of tree: Cladogram
(3) Use branch lengths: Yes
(4) Angle of labels: 45.0
(5) Horizontal margins: 1.73 cm
(5) Vertical margins: 2.24 cm
(6) Scale of branch length: Automatically rescaled
(7) Depth/Breadth of tree: 0.53
(8) Stem-length/tree-depth: 0.05
(9) Character ht / tip space: 0.3333
(10) Ancestral nodes: Intermediate
(11) Font : Hershey
Do you want to accept these? (Yes or No)
Type Y or N or the number (1-11) of the one to change:
2
What style tree is this to be:
Cladogram, Phenogram, curVogram, Eurogram, or Swoopogram
(C, P, V, E, or S)
Choose one:
p
Here are the settings:
(1) Tree grows: Horizontally
(2) Style of tree: Phenogram
(3) Use branch lengths: Yes
(4) Angle of labels: 45.0
(5) Horizontal margins: 1.73 cm
(5) Vertical margins: 2.24 cm
(6) Scale of branch length: Automatically rescaled
(7) Depth/Breadth of tree: 0.53
(8) Stem-length/tree-depth: 0.05
(9) Character ht / tip space: 0.3333
(10) Ancestral nodes: Intermediate
(11) Font : Hershey
Do you want to accept these? (Yes or No)
Type Y or N or the number (1-11) of the one to change:
4
(Considering the tree as if it "grew" vertically:)
Are the labels to be plotted vertically (90),
horizontally (0), or at a 45-degree angle?
Choose an angle in degrees from 90 to 0:
90
Here are the settings:
(1) Tree grows: Horizontally
(2) Style of tree: Phenogram
(3) Use branch lengths: Yes
(4) Angle of labels: 90.0
(5) Horizontal margins: 1.73 cm
(5) Vertical margins: 2.24 cm
(6) Scale of branch length: Automatically rescaled
(7) Depth/Breadth of tree: 0.53
(8) Stem-length/tree-depth: 0.05
(9) Character ht / tip space: 0.3333
(10) Ancestral nodes: Intermediate
(11) Font : Hershey
Do you want to accept these? (Yes or No)
Type Y or N or the number (1-11) of the one to change:
8
New value of stem length as fraction of tree depth?
0
Here are the settings:
(1) Tree grows: Horizontally
(2) Style of tree: Phenogram
(3) Use branch lengths: Yes
(4) Angle of labels: 90.0
(5) Horizontal margins: 1.73 cm
(5) Vertical margins: 2.24 cm
(6) Scale of branch length: Automatically rescaled
(7) Depth/Breadth of tree: 0.53
(8) Stem-length/tree-depth: 0.00
(9) Character ht / tip space: 0.3333
(10) Ancestral nodes: Intermediate
(11) Font : Hershey
Do you want to accept these? (Yes or No)
Type Y or N or the number (1-11) of the one to change:
11
Enter font name or "Hershey" for the default font
Helvetica-Bold
Here are the settings:
(1) Tree grows: Horizontally
(2) Style of tree: Phenogram
(3) Use branch lengths: Yes
(4) Angle of labels: 90.0
(5) Horizontal margins: 1.73 cm
(5) Vertical margins: 2.24 cm
(6) Scale of branch length: Automatically rescaled
(7) Depth/Breadth of tree: 0.53
(8) Stem-length/tree-depth: 0.00
(9) Character ht / tip space: 0.3333
(10) Ancestral nodes: Intermediate
(11) Font : Helvetica-Bold
Do you want to accept these? (Yes or No)
Type Y or N or the number (1-11) of the one to change:
y
Writing plot file ...
Finished.
End of run.Some explanation is required. I first flip the graphic on its side by specifying that I want it horizontal. Then I designate the phenogram representation with 90 degree labels to align the labels with the graphic. I also change option 8 "Stem-length/tree-depth" to zero; this suppresses the drawing of an implied root on the tree. Finally I specify a PostScript font; notice that this specification is case sensitive and must match a font that your printer has installed.
As always when running PHYLIP programs, immediately rename the output file:
% mv plotfile prion.neighbor_bootconsensetree_ps
The output PostScript file can now be sent to any printer that interprets PostScript. This may involve 'ftping' the file to a Mac or PC network for printing. If you have access to the VADMS Teaching Lab, then you can use the lpr command to print the PostScript to the default printer located there. However, ribozyme requires a special queue for printing these, so use the following variation of the lpr command:
% lpr -Plps PHYLIP_plot_filename_ps
A copy of this bootstrapped neighbor-joining tree as a phenogram graphic, as well as a curvogram of the consensus parsimony tree, and a Fitch eurogram PostScript output, is attached to the end of this exercise.
Finally I will prepare a consensus tree from the three previous trees. The three separate tree files need to be merged into one to do this. You can either use the UNIX cat command or a text editor to do this. A sample cat command line follows:
% cat prion.fitchtree prion.neighbor_bootconsensetree prion.pars_consensetree > combined.trees
After combining the files, run Consense on the combined file. I'm not going to show a screen trace of this Consense run as I have already shown you two. Don't forget to rename the output files afterward:
% mv outfile combined.consense
% mv treefile combined.consensetree
The final consensus tree will show how well the three methods -- Fitch least-squares distance fit, bootstrapped distance neighbor-joining, and parsimony -- agree with each other. Their branch lengths will be proportional to the level of agreement between the three methods. A handy thing to do with this type of tree is to impose actual branch lengths on the consensus topology. This can be done by taking advantage of PHYLIP's user tree option in many of the programs. Merely append the desired Newick tree topology below the expected input data file for the desired program along with a number identifier telling how many user trees you are providing and specify that you want to use the tree that you are supplying once in the program. Details for doing this are available in the PHYLIP documentation.
As a grand finale, I will take the above consensus tree and load it into PHYLIP's tree manipulation tool ReTree. This program enables one to change the names, root placement, appearance, branch lengths, and topology of trees. The appearance of a tree can be drastically changed without changing its topology by flipping and rerooting. Naturally, if you actually change the branch lengths or orders in your tree, then you are changing the topology and no longer accepting the conclusion of the inference software. I will use ReTree to change the names of the sequences to reflect the names of organisms and to change the appearance of the tree to more closely match our preconceived idea of what it should look like without changing the topology of the tree at all. A much abridged ReTree screen trace follows below; use the question mark key to see the help screen:
% retree
Tree Rearrangement, version 3.572c
Settings for this run:
U Initial tree (arbitrary, user, specify)? User tree from tree file
N Use the Nexus format to write out trees? No
0 Graphics type (IBM PC, VT52, ANSI)? ANSI
W Width of terminal screen, of plotting area? 80, 80
L Number of lines on screen? 24
Are these settings correct? (type Y or the letter for one to change)
y
Reading tree file ...
retree: can't read intree
Please enter a new filename>combined.consensetree
*********1:prio human
*****31
*********30 *********2:prio pantr
* *
********29 *********3:prio gorgo
* *
* **********4:prio ponpy
*
* **********5:prio calja
* **34
* * **********6:prio atege
* *
* * *********7:prio cerae
* * *****39
*****28 * * * *********8:prio mansp
* * * *****38 *****40
* * **33 * * *********9:prio macfa
* * * * ******37 *
* * * * * * ********10:prio colgu
* * * * * *
* * * * **36 ********11:prio cebap
** 25 lines below screen **
NEXT? (Options: R . = U W O M T F B N H J K L C + ? X Q) (? for Help)
?
R Rearrange a tree by moving a node or group
. Redisplay the same tree again
= Redisplay the same tree without lengths
U Undo the most recent rearrangement
W Write tree to a file
O select an Outgroup for the tree
M Midpoint root the tree
T Transpose immediate branches at a node
F Flip (rotate) subtree at a node
B Change or specify the length of a branch
N Change or specify the name(s) of tip(s)
H Move viewing window to the left
J Move viewing window downward
K Move viewing window upward
L Move viewing window to the right
C show only one Clade (subtree) (might be useful if tree is too big)
+ Read next tree from file
? Help (this screen)
Q (Quit) Exit from program
X Exit from program
TO CONTINUE, PRESS ON THE Return OR Enter KEY <rtn>
NEXT? (Options: R . = U W O M T F B N H J K L C + ? X Q) (? for Help)
j
** 12 lines above screen **
* * *********7:prio cerae
* * *****39
*****28 * * * *********8:prio mansp
* * * *****38 *****40
* * **33 * * *********9:prio macfa
* * * * ******37 *
* * * * * * ********10:prio colgu
* * * * * *
* * * * **36 ********11:prio cebap
* * * * * *
* * * * * * ********12:prio aottr
* * * **35 ******41
* **32 * ********13:prio saisc
**27 * *
* * * *********14:prio prefr
* * *
* * * ********15:prio rat
* * * *****44
* * * *********43 ********16:prio mouse
* * * * *
** 14 lines below screen **
NEXT? (Options: R . = U W O M T F B N H J K L C + ? X Q) (? for Help)
j
** 24 lines above screen **
* **32 * ********13:prio saisc
**27 * *
* * * *********14:prio prefr
* * *
* * * ********15:prio rat
* * * *****44
* * * *********43 ********16:prio mouse
* * * * *
*****26 * **42 ********17:prio mesau
* * * *
* * * *********18:prio calmo
* * *
* * ********19:prio odohe
**25 *
* * * ********20:prio bovin
* * *****45
*24 * ********21:prio trast
* *
* ********22:prio musvi
*
** 2 lines below screen **
NEXT? (Options: R . = U W O M T F B N H J K L C + ? X Q) (? for Help)
j
** 36 lines above screen **
* * ********19:prio odohe
**25 *
* * * ********20:prio bovin
* * *****45
*24 * ********21:prio trast
* *
* ********22:prio musvi
*
********23:prio sheepFirst I will reroot the tree by specifying node 43 as the outgroup. This is the rodent clade. I decided, based on the previous analyses, that the Mink sequence was not an appropriate outgroup because it kept associating with the Artiodactyls. Perhaps the Mink prion sequence is a result of horizontal gene transfer and, in fact, this actually could be true because of the pathogenic and extremely contagious state of the prion associated disease scrapie. A strange senario can be imagined. Perhaps the Mink sequence arose, because some ancestral mink critter, in the evolutionary past within a population bottleneck, ate a scrapie infected Sheep carcass, the gene somehow got incorporated into its germ line, and then it was passed on to its offspring before the host died. However, in the offspring the gene was not pathological and it became established in the lineage. It is interesting that there are no other carnivore prion sequences in the database other than in the mink family; only time will tell whether this is just because they have not yet been sampled or whether prions do not normally exist in carnivores. Do the same in your tree; specify whichever sequence or node is appropriate in your dataset. Outgrouping with an entire node is a very nice feature and not available within the other PHYLIP programs.
NEXT? (Options: R . = U W O M T F B N H J K L C + ? X Q) (? for Help)
o
Which node should be the new outgroup? 43
**********15:prio rat
*******44
*****43 **********16:prio mouse
* *
* **********17:prio mesau
*
* **********18:prio calmo
* *
*24 * ***********19:prio odohe
* * *
* * ******27 ***********20:prio bovin
* * * * *******45
* * * * * ***********21:prio trast
* * * ***26
* * * * ***********22:prio musvi
*****42 * *******25
* ***28 **************23:prio sheep
* * *
* * * ***********1:prio human
* * * *******31
* * * ***********30 ***********2:prio pantr
** 25 lines below screen **
NEXT? (Options: R . = U W O M T F B N H J K L C + ? X Q) (? for Help)
n
Specify name of which tip? (enter its number or 0 to quit): 15
The current name of tip 1 is "prio rat"
Enter new tip name: RatGo through and change the name of each tip to the common name of each organism. Above, I changed the name of the entry prio_rat in the rodent outgroup clade to Rat. After changing the names, experiment with flipping and transposing nodes. Neither of these operations change the topology of the tree, yet they can make a drastic difference in the relationships implied by the tree. The psychological effect of this can be enormous -- don't overlook the potential. After getting your tree into exactly the form that you want, exit the program and write out the new tree. If you have good faith in your outgroup, go ahead and write the tree out in a rooted format.
*************1:Rat
*******26
************25 *************2:Mouse
* *
* *************3:Golden Hamster
*
* *************4:Dusky Titi
* *
*24 * *************5:Marmoset
* * ***30
* * * *************6:Spider Monkey
* * *
* * * *************7:Macaque
* * * ******36
* * * ********35 *************8:Mandrill
**27 * * *
* ****29 *******34 ************9:Green Monkey
* * * * *
* * * ********33 ************10:Colobus
* * * * *
* * * ***32 ***********11:Capuchin
** 25 lines below screen **
NEXT? (Options: R . U W O T F B N H J K L C + ? X Q) (? for Help) x
Do you want to write out the tree to a file? (Y or N) y
Enter R if the tree is to be rooted
OR enter U if the tree is to be unrooted: r
Tree written to fileBe sure that you write out your new tree before quitting the program! The new file will have the name outtree. Rename it immediately:
% mv outtree combined_final.consensetree
This final tree can than have a PHYLIP PostScript graphic made of it with either DrawTree or DrawGram. When running DrawTree, I usually switch option 2, "Angel of Labels," in the settings menu to a for "Along" or r for "Radial;" there's less chance of them overwriting one another that way. A PostScript DrawTree graphic of this final consensus tree is attached to the end of the exercise.
8) Finishing up and some concluding remarks.
To receive credit for this week's laboratory exercise I want you to prepare four different PostScript tree files. I do not care what options or tree representation form you choose, as long as the sequence names are legible in the output. Use the same naming convention that we've been using all semester, (your_last_name).(program_name)_tree_ps, for all four PostScript files.
Also copy over and fill out a report form for the exercise using the same procedure as before; rename it to have your last name keeping the week10s extension. In addition to the report form, I also want to be able to print out all of the PostScript trees. Therefore, be sure that all the required files follow the proper naming convention, then send them to the teacher account. Since all required files will have the same filename (i.e. your lastname) but unique, identifying extensions you can rcp them all at once to teacher by utilizing a wildcard.
One vital point that can't be repeated often enough, is the importance of your multiple sequence alignments. All subsequent analyses are absolutely dependent upon them. Another point that needs to be considered in phylogenetic inference is do not base an organism's phylogeny on just one gene. There can be many complicating factors that make interpretation difficult. Use several genes -- the Ribosomal Database Project (RDP) (http://rdp.life.uiuc.edu/) provides a good, largely accepted, alignment and phylogenetic framework with which other phylogenies can be compared. The complete RDP is installed on ribozyme in aligned GCG format and can be used in the same manner as the sequences explored in this execise. Anytime the phylogenies of organisms based on two different genes do not agree, there is either some type of problem with the analysis or you have found a case of lateral transfer of genetic material. Gene phylogenies are another story altogether and should be based, if at all possible, on sequences all from the same organism. Please contact Steve Thompson for further information about using VADMS implementation of RDP and other tools for molecular phylogenetic inference not covered in this week's exercise.
Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, WA, USA. See http://evolution.genetics.washington.edu/phylip.html
I am indebted to Dr. Gary Olsen, whose lectures provided the gist and heart of this tutorial. Thank you Gary. SMT
Appendix A: The GCG Figure program.
Figure files can be edited to change and enhance GCG graphics. A Figure input file contains one instruction per line. Each type of instruction has a special code. Those that you most likely will want to change or use as reference points are listed below. There are, of course, many others; refer to the GCG Program Manual for more information on the Figure program and its complete instruction set.
.d xx.xx yyy.yy draw to location Draws a line to the location on the device given by xx.xx yyy.yy, the platten coordinates. .m xx.xx yyy.yy move to location Move the pen to the location on the device given by xx.xx yyy.yy, the platten coordinates. .nc x new color Changes the pen color or sets it to a new value. .pt string plot text string The command to plot the text string shown on the line
Another instruction that you won't often find is that to designate the linestyle for drawn lines. This may be something that you might want to change to reflect a particular branch of the tree that you feel is very important or shakey. It is also very helpful for differentiating between lines in multiple parameter plots without the use of color, e.g. black and white PostScript.
.ls x x.xx linestyle Changes the nature of the line to be drawn. A linestyle line of .ls 1 0.10 would produce a solid line of width .1 platten units. A line with .ls 2 0.50 would produce a dotted line with the dots being .5 platten units apart.
Other instructions that will be useful when you modify these files are those which control text font, orientation, character size, and text angle.
.fo x font to be used Out of 22 available fonts listed in the GCG Program Manual. .to x text orientation Determines the relationship of the text to the location given in the previous move command. A value of 2 means that the text starts just to the right of the location. .ch x.x character height Sets the height of the characters used. .td xx.x text angle Set the angle at which the text is drawn. The default is horizontal.
Finally, to add entire figure legends you may want to experiment with the paragraph commands.
.wd x and .sp y These commands specify the width and spacing of your paragraph.
Appendix A: The GCG Figure program (continued).
An example of part of a Figure output file from DrawTree data is given below. It shows what new pen, move, and draw instructions look like. Although this type of file may be long and appear awkward at first, using the Figure file allows you complete editorial control of your plot files. The data is in a format that can be worked with by itself, or be included within other GCG figure files. Its uses are only up to your imagination.
! .! Figure output from "" October 22, 1992 11:41 .! .vp 0 150 0 100 !(establishes the window for the plot) .wn 0 150 0 100 .nc 1 !(sets the color to black) .vp 0 150 0 100 .nc 1 !(sets the color to black) .m 60.81 31.61 .d 55.27 37.15 .m 55.27 37.15 .d 47.43 37.15 .m 47.43 37.15 .d 39.59 37.15 .m 39.59 37.15 .d 34.04 31.61 .m 39.59 37.15 .d 31.74 37.15 .m 47.43 37.15 .d 47.43 44.99 .m 55.27 37.15 .d 63.11 37.15 .m 60.81 31.61 .d 66.36 26.06 .nc 2 !(sets the color to green) .m 15.85 31.58 !(this is the beginning of the character .d 14.75 28.7 ! drawing portion of the file) .m 15.85 31.58 .d 16.94 28.7 .m 15.16 29.66 .d 16.53 29.66 .m 17.08 28.7 .m 19 30.21 .d 18.87 30.48 .d 18.46 30.62 .d 18.04 30.62 .d 17.63 30.48 .d 17.49 30.21 .d 17.63 29.93 .d 17.91 29.79 .d 18.59 29.66 .d 18.87 29.52 .d 19 29.24 .d 19 29.11 .d 18.87 28.83 //////////////////////////////////////////////////////////////////////////// .! .! End of Page 1 .!
Appendix B: GCG graphics command switches
The following is a listing of some of the command switches that all GCG graphics routines including phylip_plot will respond to. A brief description of what happens to the data is given with each command switch. The required part of the command is shown in UPPER-CASE bold type. "I" denotes an integer number and "x" a real one.
-COLor=I The entire structure is drawn in the color of pen I for the graphics device regardless of the designations within the file being worked with. -FIGure=yyyyy.yyy Writes the contents of the plot into a text file called yyyyy.yyy that can be used as an input file for the program Figure. Figure files are editable and device independent. See the Program Manual for more information. -FONT=I Draws all the text characters in the plot with font I. GCG software will work with 22 different character sets. But the name labels within PHYLIP plots are not text characters unless you use PHYLIP's PostScript driver and specify PostScript fonts. -PLOt=yyyyy.yyy Redirects the graphics output to file yyyyy.yyy using whatever graphics language you have previously initialized. -PORtrait Rotates the plot 90 degrees. -SCAle=x Scales the size of the plot by the factor x. The factor can either expand or contract the size of the plot depending on the value of the number used, i.e. x=2.0 would twice as big, x=.5 would be half as small. -XSCAle=x Scales the size of the x axis by thefactor x. -YSCAle=y Scales the size of the y axis by the factor y. -XPAN=xx.xx Moves the plot to the right along the x axis xx.xx platen units. -YPAN=yy.yy Moves the plot to up along the y axis yy.yy platen units -SPEed=x Controls the speed of the pen, the default is 10.0 and is quite fast. A value of 1.0 or 2.0 slows down the plotting process and produces better lines on a HP pen plotter for a higher quality plot. This only applies to HPGL plotters. -DENsity=x Not available on all GCG graphics programs, however, on those where it is available, this function controls how many pages the graphic will be drawn on. By default graphics are calculated to best fit one page. If you want to superimpose the graphics from sequences of different length, this parameter allows the scale to remain identical. Another use is to zoom in on a portion of a graphic without losing the rest of the information -- it will merely be spreadout onto more pages.
The Dayhoff matrix, PAM distances, Fitch derived tree drawn by DrawGram's eurogram option. Vertical distance is meaningless in this representation; horizontal distance is directly proportional to evolutionary divergence in the number of substituted amino acids per 100.

The bootstrapped neighbor-joining tree (distances by Kimura) phenogram. Here the vertical component of the line is again meaningless. The horizontal component of the line is now proportional to the amount of agreement between the 100 neighbor-joining trees found from the 100 bootstrapped datasets fed to the procedure, i.e. the bootstrap value.

The consensus of the fifty most parsimonious trees drawn as a curvogram. Again the vertical distance component is meaningless and the horizontal component signifies agreement. Here that agreement is between the fifty most parsimonious trees found from the dataset. There are some distinct differences between this tree and the two previous distance based methods' trees.

The final consensus tree of the three methods used in the exercise as presented by DrawTree. Now overall branch length indicates the amount of agreement between the methods used. Three different relative branch lengths can be seen: one of three, where all three methods disagree; two of three, where there's some agreement; and three of three, where all methods consistently gave the same answer. For instance, the rodent and great ape clades are consistently outside of all others while the branching order in the middle of the tree is pretty arbitrary.
