'96 BC/BP 378

Week 12

This week's exercise will introduce the techniques available through VADMS for inferring molecular phylogenies. Based on your multiple sequence alignment from last week, you will learn how to use GCG's and several of PHYLIP's molecular evolutionary inference programs. These programs analyze aligned sequence datasets to ascertain phylogenies using several different algorithms. The suitability and limitations of the algorithms will be discussed and explored while you learn how they run.

Author:

Steven M. Thompson and Susan Jean Johns

Phylogenetic inference -- a brief introduction

The inscription on the inner cover of Evolution by Dobzhansky, Ayala, Stebbins and Valentine reads "Nothing in biology makes sense except in the light of evolution." (Dobzhansky, 1973). These words ring true in my ears. To me, evolution provides the single, unifying, cohesive force that allows all of life to be explained. It is to the life sciences, what the missing holy grail of the unified field theory is to astrophysics.

The concepts of common ancestry and homology extend beyond anatomy, physiology, and genes down to the smallest heritable unit of life, the individual base pairs in DNA. Thus, even after homologous molecules are identified, it is necessary to establish the correspondence between individual sequence positions. A multiple sequence alignment is a hypothesis about evolutionary history. Therefore, use everything available to insure that you have prepared a satisfying one! Remember the old adage: "garbage in -- garbage out!" Some general guidelines to remember include the following:

(1) If the homology of a region is in doubt, then throw that data out.

(2) Avoid the most diverged parts of molecules; they are the greatest source of systematic error.

(3) Do not include sequences that are more diverged than necessary for the analysis at hand.

(4) In pairwise distance, use only one `out-group' species, if any. In parsimony and maximum likelihood, consider using a few, if you are using them at all, to subdivide the long out-group branch. In all cases, if you are using an out-group in order to root your tree, try to make your out-group as close to your in-groups as possible without actually becoming an in-group.

(5) To help deal with random errors consider testing with bootstrap techniques.

Bootstrap resampling provides a confidence limit on the groups that are separated by a given tree branch by randomly selecting characters from the original dataset up to the same number of characters as the original dataset and reperforming the inference algorithm.

(6) When all else fails, use more Data! Often the most practical solution to uncertain relationships is to use more sequence data, i.e. more molecules, and/or longer sequences.

Using any moleclular evolution inference program effectively depends upon your understanding of the operation of the software packages and the quality of the sequence alignments that are used. Quality alignments mean everything for obtaining meaningful results from phylogenetic inference algorithms.

Devoting considerable time and energy toward developing the most satisfying multiple sequence alignment possible is invaluable in evolutionary inference. Use as many types of analyses as available to insure that your alignment is as good as it can be. Use all of your understanding of the sequences involved to insure that the alignment makes biological sense. Make sure that known enzymatic, regulatory and structural elements all align, for the results of your inference are absolutely dependent upon your alignment. All of the molecular sequence phylogenetic inference programs make the validity of your input alignment their first and most critical assumption.

The available software can process any alignment that you feed it of the proper format. Whether or not they are appropriate and should be used, is up for you to determine. Beware of comparing "apples and oranges!" Make sure that the family of sequences that you align are in fact related and that the alignment is meaningful. The programs will work with almost any input sequences but only make sense if they actually do belong to the same family. Either make paralogous ('pair-wise' evolution, i.e. gene duplication) or orthologous (within one ancestral loci) comparisons, try not to mix them up. Extremely misleading interpretations can result otherwise. Also be wary of trying to align genomic sequences with cDNA sequences when working with DNA. Similarly, don't try and align mature and precursor proteins; even though they do align, it doesn't make evolutionary sense as one is not evolved from the other, rather one is the other. These are all easy mistakes to make and can cause all sorts of problems.

After you have obtained a satisfying alignment, the next step is to feed it to some program for evolutionary reconstruction. GCG has a phylogenetic inference program section. These programs, Distances and GrowTree, offer a neighbor-joining distance matrix solution using several different methods for the correction of multiple substitutions at homologous sites. PAUP (Phylogenetic Analysis Using Parsimony) by David Swofford of the Smithsonian Institute will be incorporated into GCG in a future release; licensing and documentation problems prevented its planned release in version 8.0. This week's session will also introduce the University of Washingtion's Dr. Joseph Felsenstein's PHYLIP (PHYLogenentic Inference Package) programs. PHYLIP is a comprenisive suite of 31 inference programs that can handle both molecular sequence and morphological character data.

PHYLIP's molecular sequence programs have versions that accept either amino acid or nucleotide sequence data. Again, I strongly encourage all searching and alignment to be performed on an amino acid basis if dealing with coding sequence. If you want to perform your phylogenetic inference on nucleotide sequence data, which may be desirable particularly with very similar sequences, it should be converted to nucleotides after alignment. In addition to the more easily achieved alignment this also insures that gaps are not placed within codons. Unfortunately the VADMS suite does not contain an easily used utility to perform this task. Another general guideline to keep in mind when using any phylogenetic inference algorithm is to never impose a molecular clock. The existence of a good clock for your system can be evaluated after the phylogeny has been inferred. Don't base your entire analysis on this often untrue principle; the universal application of a clock is a hotly disputed issue. This is the main reason why I never recommend using UPGMA analysis -- it mandates a perfect clock. I will only illustrate those programs in this exercise that do not impose a clock and that accept proteins.

One of the biggest problems in this field is that of sequence format. Each suite of programs requires different sequence format. We have so far introduced GCG sequence format both as single and multiple sequences and PIR format. Now we have to deal with still more types. PHYLIP has its own unique input data format requirements. A couple of different programs are available to allow us to convert the alignment into a format usable with the PHYLIP package. One interface, EGCG's ToPHYLIP, converts alignments generated by GCG into usable PHYLIP format. Another program on the VADMS system, ReadSeq by Dr. Don Gilbert at Indiana University, allows for the back and forth conversion between several different formats.

Acess to PHYLIP is always available on ribozyme. All paths and environment variables are active in all VADMS accounts after you have logged on. All of the currently available software and documentation on the PHYLIP package is assessible to the user, including the EGCG interface program, ToPHYLIP. Since this interface program and the plotting program for displaying results to a Tek 4100 series window were written using GCG and EGCG subroutines, both the GCG and EGCG package need to be activated to have a fully functional system. Review the available PHYLIP documentation found through the environment variable $DOC_DIR in the /phylip subdirectory to aid you in selecting the appropiate software to fit your needs. Pay particular atttention to the main documentation, main.doc, that describes the options available within the package. Some of them should definitely be taken advantage of since they can make your analysis much more robust. Note that documentation for each PHYLIP program is available with the file names program_name.doc. All PHYLIP programs require special input files; some accept sequence data, others are meant to work on the output data created by another program in the package. Be sure you are familiar with the requirements for running each program you wish to use prior to trying it.

Many programs for phylogenetic inference are best run in batch mode when faced with very large data sets. If the data set is small enough, you can run most PHYLIP programs interactively. If not, you need to use the batch queues. Determining when to do this is based on both the program and the size of your dataset -- it's largely a matter of experience. Unlike those GCG programs in which batch mode can help make your life easier, PHYLIP does not write its own command files and submit them to the queue automatically. Batch mode documentation and special template files have been created to assist you in operating this way. These instructions are available in $DOC_DIR/phylip under the subdirectory /scripts. A template file has been created for two pieces of the package; use them as a model to create others. Review this information prior to running any of these programs in batch mode.

Many of the programs also generate information that can be used to create plots of the results. The PHYLIP package can output graphics to many different devices. However most are not readily amiable to displaying terminal images through a network connection. Therefore, a locally produced program, phylip_plot, allows you to take advantage of HP plotter text output from PHYLIP. This text file is processed by phylip_plot to produce GCG style graphics. Because this plotting software was written with GCG subroutines, the size and appearance of the plotted data can be manipulated with the usual GCG plotting command switches.

I reiterate, the most important factor in inferring reliable phylogenies is the accuracy of the multiple sequence alignment. The interpretation of your results is utterly dependent on the quality of your input. Many experts advice against using any parts of the sequence data that are at all questionable. Only analyze those portions which assuredly do align. If any portions of the alignment are in doubt, throw them out. This usually means trimming down the alignment's terminal ends and may require internal trimming as well. Another possibility is to exclude portions with PHYLIP's Weight option. Many of the PHYLIP programs allow the user to differentially weight different parts of their alignment to reflect their confidence in it. This can be a handy trick with some data sets, especially those with both highly conserved and highly variable regions.

Understanding the packages

It helps to understand a little of the philosophies behind the two software packages used in this week's exercise. GCG was created with the user in mind. In spite of common conception, it is quite user friendly, expects little in the way of background from its users, and provides excellent documentation for its software. GCG provides default parmeters for the programs which, while not always ideal, in each case will produce meaningful results. PHYLIP, on the other hand, was developed with the accomplished evolutionary biologist in mind. It expects a great deal from its users in the way of background knowledge and experience. It expects the user to know enough about the subject matter to be able to effectivley make decisions regarding the options available in each program. While documentation is provided, it is written with these expectations in mind. The use of alignments created by GCG in the PHYLIP programs means that the user needs to understand the characteristics of both packages in regard to displaying the names of sequences, the interpertion of symbols used, and file handling conventions.

The PHYLIP system has a limitation of ten characters for naming purposes. GCG, while it will truncate names, does not cut them back quite so far, rather it uses the name itself as the determining factor for the length of the name field used. CGC will truncate the name field so that fifty bases or amino acids will be shown on the data line. There is also a difference in the manner in which the two systems treat names. PHYLIP only lists a sequence name once in a data set. GCG displays the name of the sequence as often as it can to avoid user confusion. PHYLIP's ten character name limitation can be gotten around, but it is a bit of a bother. In general, try to begin the process with the names that you will want on the final graphic.

Because GCG and PHYLIP were created independently they use alignment symbols differently. PHYLIP's symbol usage needs to be clarified because of this. GCG alignment programs insert periods, "."'s, to represent gaps in the alignment. However, periods mean "the same symbol as the above sequence" to PHYLIP, therefore, ToPhylip converts periods to minuses, "-"'s, which mean deletions to PHYLIP. ReadSeq does not do this automatically and it must be done afterwards. However, not all gaps in sequences should be interpreted as deletions. Interior gaps are probably okay to represent this way, as regardless of whether a deletion, insertion or a duplication event created the gap, logically they will be treated the same by the algorithms. These are called indels. However, end gaps should not be represented as indels because a lack of information beyond the length of a given sequence may not be due to a deletion or insertion event; it may have nothing to do with the particular stretch being analyzed at all. It may just not have been sequenced! Therefore, you need to manually edit the output from any alignment format converter to change leading and trailing minus symbols to either "x"'s or "?"'s depending on the situation. These symbols mean "unknown amino acid (or base)" and "unknown amino acid (or base) or deletion" respectively to PHYLIP. This will assure that the progams do not make incorrect assumptions about your sequences.

The two systems also treat files differently. PHYLIP does not allow you to assign output file names; it uses a set of standard output file names of the format outfile, treefile, and plotfile. Because of this, and UNIX's lack of a version number, if you run concurrent PHYLIP analyses, you must run them in separate directories, otherwise the output from one program will overwrite the output of any other. Therefore, if you want to use a particular input file in a number of different analysis processes at the same time, it is best to create copies of the original data in separate directories so that all the processes have their own set of data to work with. Furthermore, even when running PHYLIP programs sequentially, be sure to rename every output file as it is produced or the same `clobbering' effect will destroy your results when the next program is run. This process can be very annoying and bothersome but it is vitally important to keep track of!

The length of time PHYLIP takes to process any given data set depends on the nature of the data being used and the program doing the processing. Some of the PHYLIP programs are just very slow by the nature of what they are trying to do. Others only slow down when a large number of sequences are being worked with that are relatively long (i.e. over 500 bases or amino acids). The individual program documentation is quite good about letting the user know if the program is slow or not. Read through some level of documentation on a program prior to running it. If you are familiar with the ideas behind evolutionary analyses, you may need to only read the main.doc information on how to run the programs. Otherwise, read the documentation on each program being used to get the entire breadth available.

Exercise for week 12:

This week's exercise will explore several methods for inferring molecular phylogenies using the sequence alignment that you prepared last week. I will illustrate the techniques with the alignment that I prepared from the Elongation Factor 1/Tu representative.list. You will go through the procedures with your own data to determine the most reasonable phylogeny for the members of your favorite branch off the tree of life subset alignment that you prepared last week. For extra credit you can estimate the best evolutionary tree from last week's extra credit EF-2/G alignment.

1) Activate the machine, connect to ribozyme and log into your account.

By this time you should know how to activate the machine you want to use, make connections with ribozyme, and log into your account. If you still need help with these functions, refer to the beginning of the exercises for weeks 2 and 3 for step by step instructions.

2) Move to this week's subdirectory and copy the necessary files to it.

% cd twelve

Now move over all the files needed to do this week's exercise. They are located in the directory location $UGRAD_DIR/week12.

% cp $UGRAD_DIR/week12/* .

3) Run the demo that describes this weeks activities.

This week's demo shows some sample representations of various types of evolutionary trees produced by the various packages. The dataset is the same human TM7 G-protein couple receptor set that you saw in last week's demo. To launch this week's demo issue the following command:

% demo12

Activate gcg and use setplot to designate your graphics configuration before attempting to perform any sequence analysis functions in the following exercise.

4) Reevaluate your alignment!

Copy your MSF file from your week 11 directory to your week 12 directory:

% cp ../eleven/*.msf .

Remember last week you determined where the quality of your alignment degraded near the amino and carboxy termini for profile analysis. Well this week we are going to use the same two pieces of information for phylogenetic study. If you don't remember or can't find your notes as to where those coordinates should be, rerun the similarity visualization analyses to determine them. Here, in the interest of time, we will only trim out the jagged ends of the alignment. Even though the interior of your alignment should look pretty good, in a research situation it should be carefully evaluated and adjusted as necessary, nonetheless. As I stated in Exercise #11, many alignment editors are available for cleaning up multiple sequence alignments. However, here in the interest of time, I will merely show how the EGCG program ToPHYLIP can be used to extract and convert the portion of interest from an MSF file.

If you were to change any GCG MSF alignment with a non-GCG compatible editor, remember that any time any changes are made in any GCG sequence data file with any non-GCG savvy editor, it must be reformatted. However, reformatting MSF files requires a couple of tricks. If this step is not done exactly correct, you can get very weird results. You must use the -msf option and you must specify all the sequences within your MSF file, {*}:

% reformat -msf ef.msf{*}

You should not need to use this command in this exercise but I wanted to show you how it is used. If used, the new MSF file will follow all GCG conventions, with updated format, numbering, and checksums. Even though it's not required here, be sure to check the updated checksums after reformatting. If any of the sequence's checksums are now the same, they should be excluded from the rest of the analysis because they are now identical. Since the sequences would be identical it doesn't make any sense to ascertain their relationship. (If this were to happen, then you would need to comment out all but one of each identical sequence by placing an exclamation mark in front of their checksum lines and then running reformat -msf again to remove them from the file so that they would be ignored in subsequent steps.)

5) Format conversion: ToPHYLIP and ReadSeq.

Once a user understands the possible problems associated with converting data from GCG format into PHYLIP format, it is time to try it out. Data conversion can be done with either the ToPHYLIP or the ReadSeq program. I will show ToPHYLIP's use, as it has a couple of important advantages. It allows the user to select a range within an MSF file and it automatically converts all of the GCG period indel markers to hyphens. To use tophylip, supply the name of the MSF file to be used (using GCG's {*} designation), the beginning and ending coordinates, and a name for the generated output file. My example follows; repeat with your own dataset:

% tophylip

TOPHYLIP writes GCG sequences into a single file in PHYLIP format.

 TOPHYLIP uses any sequences

 TOPHYLIP of what sequence(s)  ? ef.msf{*}

             ef.msf{Eftu_Ecoli}, len: 488
             ef.msf{Eftu_Myctu}, len: 488
             ef.msf{Eftu_Anani}, len: 488
             ef.msf{Eftu_Theaq}, len: 488
             ef.msf{Eftu_Mycga}, len: 488
             ef.msf{Eftu_Thema}, len: 488
             ef.msf{Eftu_Chltr}, len: 488
             ef.msf{Ef1a_Arath}, len: 488
             ef.msf{Ef1a_Wheat}, len: 488
             ef.msf{Ef1a_Euggr}, len: 488
             ef.msf{Ef10_Xenla}, len: 488
             ef.msf{Ef11_Human}, len: 488
             ef.msf{Ef11_Drome}, len: 488
             ef.msf{Ef1a_Oncvo}, len: 488
             ef.msf{Ef1a_Yeast}, len: 488
             ef.msf{Ef1a_Enthi}, len: 488
             ef.msf{Ef1a_Tetpy}, len: 488
             ef.msf{Ef1a_Dicdi}, len: 488
             ef.msf{Ef1a_Plafk}, len: 488
             ef.msf{Ef1a_Giala}, len: 488
             ef.msf{Ef1a_Pyrwo}, len: 488
             ef.msf{Ef1a_Theac}, len: 488
             ef.msf{Eftu_Halma}, len: 488
             ef.msf{Eftu_Metva}, len: 488
             ef.msf{Ef1a_Sulso}, len: 488

                   Start (* 1 *) ?  8
                   End (* 488 *) ?  466

 What should I call the output file (* eftu_myctu.phylip *) ? ef.phy

Load your new file into the pico editor to take a look at the new format and review the data that was converted. PHYLIP requires a very simple format with the size of the data matrix specified on the top line, in my case 25 by 459. While the file is in the editor make the "?" change described in the introduction. As discussed earlier, the ToPHYLIP routine will replace gaps in the data with deletion symbols, "-," which is logically correct in most cases. However, gaps at the ends and beginnings of sequences should not have hyphens unless you really know that a deletion-insertion is responsible for the length discrepancy. Therefore, edit the ToPHYLIP output file to replace leading and trailing deletion symbols in any of your sequences that meet this criteria with the ambiquity symbol "x" or "?" depending on which is more appropriate. In my example the only sequence needing this adjustment was that from Giardia. This is also a good point at which to verify that the sequence names are exactly as you wish them to appear in the final resultant plots. PHYLIP sequence names can contain punctuation and capitalization, and up to ten characters or less. An abridged version of my corrected ToPHYLIP output follows:

 25 459
Eftu_Ecoli  RTKPHVNVGT IGHVDHGKTT LTAAITTVLA ---------- ---KTYG--G
Eftu_Myctu  RTKPHVNIGT IGHVDHGKTT LTAAITKVLH ---------- ---DKFPDLN
Eftu_Anani  RTKPHANIGT IGHVDHGKTT LTAAITTVLA ---------- ---KA-G-MA
Eftu_Theaq  RTKPHVNVGT IGHVDHGKTT LTAALTYVAA ---------- ---AE-NPNV
Eftu_Mycga  RSKPHVNIGT IGHIDHGKTT LTAAICTVLS ---------- ---KA--GTS
Eftu_Thema  RTKPHVNVGT IGHIDHGKST LTAAITKYLS ---------- ---LKV--LA
Eftu_Chltr  RNKPHINIGA IGHVDHGRTT LTAAITRTLS ---------- -----GDGLA
Ef1a_Arath  KEKFHINIVV IGHVDSGKST TTGHLIYKLG GIDKRVIERF EKEAAEMNKR
Ef1a_Wheat  KEKTHINIVV IGHVDSGKST TTGHLIYKLG GIDKRVIERF EKEAAEMNKR
Ef1a_Euggr  KEKVHISLVV IGHVDSGKST TTGHLIYKCG GIDKRTIEKF EKEASEMGKG
Ef10_Xenla  KEKTHINIVV IGHVDSGKST TTGHLIYKCG GIDKRTIEKF EKEAAEMGKG
Ef11_Human  KEKTHINIVV IGHVDSGKST TTGHLIYKCG GIDKRTIEKF EKEAAEMGKG
Ef11_Drome  KEKIHINIVV IGHVDSGKST TTGHLIYKCG GIDKRTIEKF EKEAQEMGKG
Ef1a_Oncvo  KEKTHINIVV IGHVDSGKST TTGHLIYKCG GIDKRTIEKF EKEAQEMGKG
Ef1a_Yeast  KEKSHINVVV IGHVDSGKST TTGHLIYKCG GIDKRTIEKF EKEAAELGKG
Ef1a_Enthi  KEKTHINIVV IGHVDSGKST TTGHLIYKCG GIDQRTIEKF EKESAEMGKG
Ef1a_Tetpy  GDKVHINLVV IGHVDSGKST TTGHLIYKCG GIDKRVIEKF EKESAEQGKG
Ef1a_Dicdi  SEKTHINIVV IGHVDAGKST TTGHLIYKCG GIDKRVIEKY EKEASEMGKQ
Ef1a_Plafk  KEKTHINLVV IGHVDSGKST TTGHIIYKLG GIDRRTIEKF EKESAEMGKG
Ef1a_Giala  ?????????? ????????ST LTGHLIYKCG GIDQRTIDEY EKRATEMGKG
Ef1a_Pyrwo  KDKPHVNIVF IGHVDHGKST TIGRLLYDTG NIPEQIIKKF -EEMGEKGK-
Ef1a_Theac  SQKPHLNLIT IGHVDHGKST LVGRLLYEHG EIPAHIIEEY RKEAEQKGKA
Eftu_Halma  SDEQHQNLAI IGHVDHGKST LVGRLLYETG SVPEHVIEQH KEEAEEKGKG
Eftu_Metva  KTKPILNVAF IGHVDAGKST TVGRLLLDGG AIDPQLIVRL RKEAEEKGKA
Ef1a_Sulso  SQKPHLNLIV IGQVDHGKST LVGRLLMDRG FIDEKTVKEA EEAAKKLGKE

            AARAFDQIDN APEEKARGIT INTSHVEYDT PTRHYAHVDC PGHADYVKNM
            ETKAFDQIDN APEERQRGIT INIAHVEYQT DKRHYAHVDA PGHADYIKNM
            KARAYADIDA APEEKARGIT INTAHVEYET GNRHYAHVDC PGHADYVKNM
            EVKDYGDIDK APEERARGIT INTAHVEYET AKRHYSHVDC PGHADYIKNM
            EAKKYDEIDA APEEKARGIT INTAHVEYAT QNRHYAHVDC PGHADYVKNM
            QYIPYDQIDK APEEKARGIT INITHVEYET EKRHYAHIDC PGHADYIKNM
            DFRDYSSIDN TPEEKARGIP INASHVEYET ANRHYAHVDC PCHADYVKNM
            SFKYAWVLDK LKAERERGIT IDIALWKFET TKYYCTVIDA PGHRDFIKNM
            SFKYAWVLDK LKAERERGIT IDIALWKFET TKYYCTVIDA PGHRDFIKNM
            SFKYAWVLDK LKAERERCIT IDIALWKFET AKSVFTIIDA PGHRDFIKNM
            SFKYAWVLDK LKAERERGIT IDISLWKFET SKYYVTIIDA PGHRDFIKNM
            SFKYAWVLDK LKAERERGIT IDISLWKFET SKYYVTIIDA PGHRDFIKNM
            SFKYAWVLDK LKAERERGIT IDIALWKFET AKYYVTIIDA PGHRDFIKNM
            SFKYAWVLDK LKAERERGIQ IDIALWKFET PKYYITIIDA PGHRDFIKNM
            SFKYAWVLDK LKAERERGIT IDIALWKFET PKYQVTVIDA PGHRDFIKNM
            SFKYAWVLDN LKAERERGIT IDISLWKFET SKYYFTIIDA PGHRDFIKNM
            SFKYAWVLDK LKAERERGIT IDISLWKFET AKYHFTIIDA PGHRDFIKNM
            SFKYAWVMDK LKAERERGIT IDIALWKFET SKYYFTIIDA PGHRDFIKNM
            SFKYAWVLDK LKAERERGIT IDIALWKFET PRYFFTVIDA PGHKDFIKNM
            SFKYAWVLDQ LKDERERGIT INIALWKFET KKYIVTIIDA PGHRDFIKNM
            SFKFAWVMDR LREERERGIT IDVAHTKFET PHRYITIIDA PGHRDFVKNM
            TFEFAWVMDR FKEERERGVT IDLAHRKFET DKYYFTLIDA PGHRDFVKNM
            GFEFAYVMDN LAEERERGVT IDIAHQEFST DTYDFTIVDC PGHRDFVKNM
            GFEFAYVMDG LKEERERGVT IDVAHKKFPT AKYEVTIVDC PGHRDFIKNM
            SEKFAFLLDR LKEERERGVT INLTFMRFET KKYFFTIIDA PGHRDFVKNM

//////////////////////////////////////////////////////////////////

            ---PEGVEMV MPGDNIKMVV TLIHPIAMDD GL------RF AIREGGRTVG
            ---PEGTEMV MPGDNTNISV KLIQPVAMDE GL------RF AIREGGRTVG
            ---GSAAEMV IPGDRIKMTV ELINPIAIEQ GM------RF AIREGGRTIG
            ---PQGVEMV MPGDNVTFTV ELIKPVALEE GL------RF AIREGGRTVG
            ---KEGTEMV MPGDNTEIIV ELISSIACEK GS------KF SIREGGRTVG
            ---PEGVEMV MPGDHVEMEI ELIYPVAIEK GQ------RF AVREGGRTVG
            ---PEGVEMV MPGDNVEFEV QLISPVALEE GM------RF AIREGGRTIG
            --EKE-PKFL KNGDAGMVKM TPTKPMVVET FSEYPPLGRF AVRDMRQTVA
            --EAL-PKFL KNGDAGIVKM IPTKPMVVET FATYPPLGRF AVRDMRQTVA
            --EAE-PKFI KSGDAAIVLM KPQKPMCVES FTDYPPLG-V SCGDMRQTVA
            --E-DNPKFL KSGDAAIVDM IPGKPMCVES FSDYPPLGRF AVRDMRQTVA
            --E-DGPKFL KSGDAAIVDM VPGKPMCVES FSDYPPLGRF AVRDMRQTVA
            --E-ENPKFI KSGDAAIVNL VPSKPLCVEA FQEFPPLGRF AVRDMRQTVA
            --E-DNPKSL KSGDAGIIDL IPTKPLCVET FTEYPPLGRF AVRDMRQTVA
            --E-DHPKFL KSGDAALVKF VPSKPMCVEA FSEYPPLGRF AVRDMRQTVA
            --EGGEPEYI KNGDSALVKI VPTKPLCVEE FAKFPPLGRF AVRDMKQTVA
            --E-ENPKFI KNGDAALVTL IPTKALCVEV FQEYPPLGRY AVRDMKQTVA
            AKEGTAAVVL KNGDAAMVEL TPSRPMCVES FTEYPPLGRF AVRDMRQTVA
            E---ENPKAI KSGDSALVSL EPKKPMVVET FTEYPPLGRF AIRDMRQTIA
            KPEMENPPDA GRGDCIIVKM VPQKPLCCET FNDYAPLGPF AVR???????
            E---ENPQFI KTGDAAIVIL RPMKPVVLEP VKEIPQLGRF AIRDMGMTIA
            K---EKPDFI KNGDVAIVKV IPDKPLVIEK VSEIPQLGRF AVLDMGQTVA
            E---ENPDFI QNGDAAVVTV RPQKPLSIEP SSEIPELGSF AIRDMGQTIA
            E---ENPDFL KAGDAAIVKL IPTKPMVIES VKEIPQLGRF AIRDMGMTVA
            E---KNPQFL KQGDVAIVKF KPIKPLCVEK YNEFPPLGRF AMRDMGKTVG

            AGVVAKVLS
            AGRVTKIIK
            AGVVSKILQ
            AGVVTKILE
            AGTVVEVLE
            AGVVTEVIE
            AGTISKIIA
            VGVIKSVDK
            VGVIKGVEK
            VGVIKSVNK
            VGVIKAVEK
            VGVIKAVDK
            VGVIKAVNF
            VGVIKNVD-
            VGVIKSVD-
            VGVVKAVTP
            VGVIKKVEK
            VGVIKSTVK
            VGIINQLKR
            ?????????
            AGMVISIQR
            AGQCIDLEK
            AGKVLGVNE
            AGMAIQVTA
            VGIIVDVKP

Don Gilbert's program ReadSeq can also be used to change the format into something acceptable to PHYLIP use; however it does not allow you to select a range within the alignment nor does it automatically convert periods to hyphens. It is more flexible in that it can convert between many different file formats. If you ever do use ReadSeq, note that it runs a bit backward from what you are used to. It asks for an appropriate output file name first, and then you choose between the offered formats, 12 for the current PHYLIP format. Finally, the program asks you to designate an input sequence (Do not use the GCG {*} designator; this is not a GCG program.), and then you specify All the sequences. When the program again asks for an input sequence, you press return, and let it do its thing. A screen trace of my session is shown below, but you should not need to use this program at this point:

% readseq

readSeq (1Feb93), multi-format molbio sequence reader.

Name of output file (?=help, defaults to display): ef.phylip
         1. IG-Stanford           10. Olsen (in-only)
         2. GenBank-GB            11. Phylip3.2
         3. NBRF                  12. Phylip
         4. EMBL                  13. Plain-Raw
         5. GCG                   14. PIR-CODATA
         6. DNAStrider            15. MSF
         7. Fitch                 16. ASN.1
         8. Pearson-Fasta         17. PAUP-NEXUS
         9. Zuker (in-only)       18. Pretty (out-only)

Choose an output format (name or #):
12

Name an input sequence or -option:
ef.msf
Sequences in ef_trimmed.msf  (format is 15. MSF)
 1)  Eftu_Ecoli       Len:   458  Check: 3692  Weight:  1.00
 2)  Eftu_Myctu       Len:   458  Check: 6588  Weight:  1.00
 3)  Eftu_Anani       Len:   458  Check: 1340  Weight:  1.00
 4)  Eftu_Theaq       Len:   458  Check: 6646  Weight:  1.00
 5)  Eftu_Mycga       Len:   458  Check: 8215  Weight:  1.00
 6)  Eftu_Thema       Len:   458  Check: 6282  Weight:  1.00
 7)  Eftu_Chltr       Len:   458  Check:  593  Weight:  1.00
 8)  Ef1a_Arath       Len:   458  Check: 3369  Weight:  1.00
 9)  Ef1a_Wheat       Len:   458  Check: 2977  Weight:  1.00
 10)  Ef1a_Euggr       Len:   458  Check:  449  Weight:  1.00
 11)  Ef10_Xenla       Len:   458  Check: 1707  Weight:  1.00
 12)  Ef11_Human       Len:   458  Check: 1966  Weight:  1.00
 13)  Ef11_Drome       Len:   458  Check: 6824  Weight:  1.00
 14)  Ef1a_Oncvo       Len:   458  Check: 8500  Weight:  1.00
 15)  Ef1a_Yeast       Len:   458  Check: 5533  Weight:  1.00
 16)  Ef1a_Enthi       Len:   458  Check: 1857  Weight:  1.00
 17)  Ef1a_Tetpy       Len:   458  Check:  351  Weight:  1.00
 18)  Ef1a_Dicdi       Len:   458  Check: 5016  Weight:  1.00
 19)  Ef1a_Plafk       Len:   458  Check: 1002  Weight:  1.00
 20)  Ef1a_Giala       Len:   458  Check: 2365  Weight:  1.00
 21)  Ef1a_Pyrwo       Len:   458  Check:  995  Weight:  1.00
 22)  Ef1a_Theac       Len:   458  Check: 2700  Weight:  1.00
 23)  Eftu_Halma       Len:   458  Check: 8074  Weight:  1.00
 24)  Eftu_Metva       Len:   458  Check: 7851  Weight:  1.00
 25)  Ef1a_Sulso       Len:   458  Check: 5115  Weight:  1.00

Choose a sequence (# or All):
all

Name an input sequence or -option:  <rtn>

Again, if you were to use ReadSeq, you would need to make a couple more essential changes before it is correct for PHYLIP. First of all, as described above, periods, ".," will not work to represent indels (gaps); they must all be changed to hyphens, "-." Unlike ToPHYLIP ReadSeq does not do this step for you. Secondly, just as we did with the ToPHYLIP output, you would need to evaluate the terminal ends of your data matrix; if any of the implied indels were uncertain (especially true if sequence lengths were different), then you would substitute question marks, "?" for the new hyphens. Output from either conversion program should work exactly the same in PHYLIP. ToPHYLIP will require less editing work afterwards, but it will only do GCG to PHYLIP conversions. ReadSeq is more flexible but requires some major changes to its output.

6) Phylogenetic tree estimation -- GCG and PHYLIP.

6.a.1. Distance methods: GCG's Distances and GrowTree.

With the data in a usable format, it is time to go on to evolutionary studies. First let's step back and use the GCG MSF file to check out GCG's programs. Presently GCG only includes a couple of distance methods for phylogenetic inference. The evolutionary distance between each sequence is first calculated, compensating for multiple substitutions at homologous sites, and then those distances are used in a neighbor-joining algorithm to estimate the best evolutionary tree for the data. The distance calculation program is appropriately named distances. Please use GCG's online genmanual help system to read their complete description of the Molecular Evolution program Distances -- they describe problems and considerations for this whole area very well. Since Distances is a GCG program, the {*} designation is required. Please restrict your analysis to the same range that you determined earlier. You can accept the default parameters for the program, however, I encourage you to learn about the various evolutionary models incorporated. A screen trace of the process for my example follows; use your own dataset in your run:

% distance ef.msf{*}

DISTANCES creates a table of the pairwise distances within a group of
aligned sequences.

                   Start (* 1 *) ?  8
                   End (* 488 *) ?  466

 Reading sequences...
                Eftu_Ecoli: 459 total, 459 read
                Eftu_Myctu: 459 total, 459 read
                Eftu_Anani: 459 total, 459 read
                Eftu_Theaq: 459 total, 459 read
                Eftu_Mycga: 459 total, 459 read
                Eftu_Thema: 459 total, 459 read
                Eftu_Chltr: 459 total, 459 read
                Ef1a_Arath: 459 total, 459 read
                Ef1a_Wheat: 459 total, 459 read
                Ef1a_Euggr: 459 total, 459 read
                Ef10_Xenla: 459 total, 459 read
                Ef11_Human: 459 total, 459 read
                Ef11_Drome: 459 total, 459 read
                Ef1a_Oncvo: 459 total, 459 read
                Ef1a_Yeast: 459 total, 459 read
                Ef1a_Enthi: 459 total, 459 read
                Ef1a_Tetpy: 459 total, 459 read
                Ef1a_Dicdi: 459 total, 459 read
                Ef1a_Plafk: 459 total, 459 read
                Ef1a_Giala: 459 total, 459 read
                Ef1a_Pyrwo: 459 total, 459 read
                Ef1a_Theac: 459 total, 459 read
                Eftu_Halma: 459 total, 459 read
                Eftu_Metva: 459 total, 459 read
                Ef1a_Sulso: 459 total, 459 read

 Which distance correction method to use ?

      1 Uncorrected distance
      2 Jukes-Cantor distance
      3 Kimura protein distance

 Choose the method to use: (* 3 *)  <rtn>


 What should I call the distance matrix file (* Ef.distances *) ? <rtn>

Computing distances using Kimura method...

 1 x  2:  29.58    1 x  3:  29.58
 1 x  4:  31.62    1 x  5:  35.07
 1 x  6:  36.62    1 x  7:  41.61
 1 x  8: 137.12    1 x  9: 138.51
 1 x 10: 145.30    1 x 11: 140.88
 1 x 12: 140.88    1 x 13: 145.15
 1 x 14: 138.13    1 x 15: 137.67
 1 x 16: 126.24    1 x 17: 130.47
 1 x 18: 137.59    1 x 19: 133.48
 1 x 20: 150.80    1 x 21: 116.45
 1 x 22: 127.63    1 x 23: 122.34
 1 x 24: 131.62    1 x 25: 137.57
 2 x  3:  40.65    2 x  4:  32.21
 2 x  5:  40.35    2 x  6:  43.03
 2 x  7:  44.73    2 x  8: 134.46
 2 x  9: 133.14    2 x 10: 140.88
 2 x 11: 135.51    2 x 12: 135.51
 2 x 13: 140.88    2 x 14: 140.88
 2 x 15: 136.37    2 x 16: 135.33
 2 x 17: 137.16    2 x 18: 137.62
 2 x 19: 136.22    2 x 20: 141.92
 2 x 21: 116.20    2 x 22: 122.34
 2 x 23: 116.27    2 x 24: 132.55
 2 x 25: 130.47    3 x  4:  33.13
//////////////////////////////////////////////////////////////////
20 x 22:  75.07   20 x 23:  80.20
20 x 24:  77.45   20 x 25:  72.60
21 x 22:  47.41   21 x 23:  57.46
21 x 24:  51.95   21 x 25:  60.29
22 x 23:  57.96   22 x 24:  48.87
22 x 25:  55.50   23 x 24:  53.12
23 x 25:  71.61   24 x 25:  66.64

Statistics on pairwise distances:

 171 of 300 pairs have distances exceeding 70.0.


*** If the distance between many pairs of sequences    ***
*** exceeds 70.0, the tree will not be reliable.       ***

Notice the warning message at the conclusion of my run. My dataset may be too diverse to attempt GCG's distance based neighbor-joining analysis with; yours should be much better. In general the GCG programs work best with a much more cohesive dataset. Things that could be done to increase reliablity include further editing of the initial alignment to exclude the most diverse regions. The GCG Distance program excludes positions with gaps when using Kimura protein corrections and only exact matches contribute to the match score. It is not the most robust of methods; several PHYLIP alternatives to the GCG programs will be explored this week.

To infer the neighbor-joining tree of the distance data run GCG's GrowTree program. Again, all defaults can be used for this example. Whatever you do, do not use UPGMA. UPGMA imposes an absolutely uniform molecular clock across all of your data. This is seldom, if ever, the situation. A screen trace of my example follows:

% growtree

GrowTree creates a phylogenetic tree from a distance matrix created by
Distances using either the UPGMA or neighbor-joining method. You can create
a text or graphics output file.

 What is the distance matrix ?  ef.distances

 Which method to use ?

      1 Neighbor-joining
      2 UPGMA

 Choose the method to use: (* 1 *)  <rtn>


 What should I call the trees file (* ef.trees *) ? <rtn>

 23 internal, 25 terminal nodes

 When your TEK4107 attached to tty is ready, press <Return>. <rtn>

Branch lengths are proportional to evolutionary divergence and have the unit of substitutions per 100 characters. The exact numbers can be seen in the output .trees file from the program. This tree file is written in a nested parenthetical connotation known as the Newick standard. Many phylogeny programs, including PHYLIP, can work with this format data. I recognize some immediate problems with the GCG interpretation. Based on the sequence names I know, some of the relationships found are incorrect! In particular EF1a_Oncvo is EF-1[[proportional]] from Onchocerca volvulus, a Spirurian nematode that parasitizes human beings. This represents an anciently diverged group of metazoans -- but not the implied out-group to all of life! Several problems can also be seen in some of the protista assignments, but these critters' taxonomy is a bit controversial in the literature also.

6.a.2. Estimate the distance matrix: PHYLIP's ProtDist.

Let's use the PHYLIP package now to see how its results differ. First I will show the use of a couple of the distance methods in the package. All the distance methods require that a distance matrix first be prepared from the alignment matrix. The GCG distance matrix is not suitable as input to PHYLIP. As PHYLIP is installed on ribozyme it uses standardized file input and output names depending on the program. The programs are all triggered by entering their name. Unless you have a file named infile in your directroy, the first thing that you see is an error message about not being able to read the input file; supply the program with your input file's name. Once in the program a menu comes up allowing you to pick the desired options. An option that I will not be showing, but that you should use with your subset alignment, is the designation of an out-group. Since the example data matrix is a survey of all cellular life, there is no appropriate outgroup within it. Can you think of a way to use an out-group to establish the root of such a tree? The answer to this question will garner some extra-credit points for you. When you've chosen all your desired options, respond with y to the prompt about the settings being correct, and off the process goes.

PHYLIP's protein distance matrix program, protdist, allows the use of three alternate multiple substitution correction models. Refer to the documentation for an explanation of the models. The default PAM model is much more robust than the Kimura correction previously used. However, it takes a while to run with the size data set I am showing. Therefore, I will toggle to PHYLIP's version of the Kimura correction model by typing p at the "Are these settings correct?" prompt. For your run with the subset alignment use the default PAM correction since it should be a much smaller data set. For my run I specify ef.phy as the input alignment matrix. Follow the screen trace below, substituting your data for mine and accepting the Dayhoff coorection model:

% protdist

Please enter a new filename>ef.phy

Protein distance algorithm, version 3.55c

Settings for this run:
  P  Use PAM, Kimura or categories model?  Dayhoff PAM matrix
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes

Are these settings correct? (type Y or the letter for one to change)
p

(Do not switch to Kimura with your data!)

Protein distance algorithm, version 3.55c

  P  Use PAM, Kimura or categories model?  Kimura formula
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes

Are these settings correct? (type Y or the letter for one to change)
y

Computing distances:
  Eftu_Ecoli
  Eftu_Myctu   .
  Eftu_Anani   ..
  Eftu_Theaq   ...
  Eftu_Mycga   ....
  Eftu_Thema   .....
  Eftu_Chltr   ......
  Ef1a_Arath   .......
  Ef1a_Wheat   ........
  Ef1a_Euggr   .........
  Ef10_Xenla   ..........
  Ef11_Human   ...........
  Ef11_Drome   ............
  Ef1a_Oncvo   .............
  Ef1a_Yeast   ..............
  Ef1a_Enthi   ...............
  Ef1a_Tetpy   ................
  Ef1a_Dicdi   .................
  Ef1a_Plafk   ..................
  Ef1a_Giala   ...................
  Ef1a_Pyrwo   ....................
  Ef1a_Theac   .....................
  Eftu_Halma   ......................
  Eftu_Metva   .......................
  Ef1a_Sulso   ........................

Output written to output file

Whenever you run any PHYLIP program, immediately rename the output so that subsequent runs will not overwrite your files. In my case ProtDist produced an output distance matrix file named outfile so I rename it to ef.protdist:

% mv outfile ef.protdist

These matrices can be wider than your screen so they don't look very good when displayed but you're welcome to take a look.

6.a.2.1. Fit the best tree to the distance data: PHYLIP's Fitch.

Next we can pass the new distance matrix to one of several distance programs in PHYLIP. I'll start with Fitch, a least-squares fit algorithm. Least-square fit is probably the most powerful way of estimating a tree from distance data but it is fairly computationally intense. Notice the time it takes to run in the following session. You need to specify the input distance matrix; the program will produce an output text file and an output tree file. Note the options I choose in the example below. You should always improve tree reliability by taking advantage of the Global rearrangement and Jumble (Randomize) sequence order options. The jumble option is quite powerful and should usually be taken advantage of since the results of most of the tree construction algorithms are dependent on the order of sequence input. It is a very good idea to jumble multiple times, at least ten; however, here, in the interest of time, I will only jumble once. You may want to multiply jumble with your own dataset since it is smaller than mine. Also realize that the inferred tree produced by my example below will not likely be as realiable as possible since I only used the Kimura correction model and we already saw with GCG that the Kimura model does not work very well with this data set. My Fitch screen trace follows:

% fitch

fitch:  can't read infile
Please enter a new filename>ef.protdist

Fitch-Margoliash method version 3.53c

Settings for this run:
  U                 Search for best tree?  Yes
  P                                Power?  2.00000
  -      Negative branch lengths allowed?  No
  O                        Outgroup root?  No, use as outgroup species 1
  L         Lower-triangular data matrix?  No
  R         Upper-triangular data matrix?  No
  S                        Subreplicates?  No
  G                Global rearrangements?  No
  J     Randomize input order of species?  No. Use input order
  M           Analyze multiple data sets?  No
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)
g

Fitch-Margoliash method version 3.53c

Settings for this run:
  U                 Search for best tree?  Yes
  P                                Power?  2.00000
  -      Negative branch lengths allowed?  No
  O                        Outgroup root?  No, use as outgroup species 1
  L         Lower-triangular data matrix?  No
  R         Upper-triangular data matrix?  No
  S                        Subreplicates?  No
  G                Global rearrangements?  Yes
  J     Randomize input order of species?  No. Use input order
  M           Analyze multiple data sets?  No
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)
j
Random number seed (must be odd)?
123
Number of times to jumble?
1

Fitch-Margoliash method version 3.53c

Settings for this run:
  U                 Search for best tree?  Yes
  P                                Power?  2.00000
  -      Negative branch lengths allowed?  No
  O                        Outgroup root?  No, use as outgroup species 1
  L         Lower-triangular data matrix?  No
  R         Upper-triangular data matrix?  No
  S                        Subreplicates?  No
  G                Global rearrangements?  Yes
  J     Randomize input order of species?  Yes (seed =     123,  1 times)
  M           Analyze multiple data sets?  No
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)
y

Adding species:
   Ef1a_Tetpy
   Ef1a_Giala
   Ef1a_Wheat
   Ef11_Drome
   Ef1a_Sulso
   Ef1a_Arath
   Eftu_Myctu
   Ef1a_Dicdi
   Ef1a_Enthi
   Eftu_Theaq
   Eftu_Anani
   Eftu_Mycga
   Ef1a_Oncvo
   Ef1a_Euggr
   Eftu_Thema
   Ef1a_Yeast
   Ef10_Xenla
   Ef1a_Theac
   Ef1a_Plafk
   Eftu_Ecoli
   Eftu_Metva
   Ef11_Human
   Ef1a_Pyrwo
   Eftu_Chltr
   Eftu_Halma
Doing global rearrangements
  !-----------------------!
   ......................
Output written to output file

Tree also written onto file

Don't forget to immediately rename the output files.

% mv outfile ef_kim_protdist.fitch

% mv treefile ef_kim_protdist.fitch_tree

I apologize for the long ugly file names, but it's very important to keep track of what all the files are. Fitch produces both a tree file and text file. The text file shows the least-squares fit formula used, an ASCII representation of the tree, and it lists all branch lengths. The text output from this Fitch run is shown below:

% more ef_kim_protdist.fitch

  25 Populations

Fitch-Margoliash method version 3.53c

                  __ __             2
                  \  \   (Obs - Exp)
Sum of squares =  -_ -_  ------------
                                2
                   i  j      Obs

Negative branch lengths not allowed

global optimization

  +-----Eftu_Anani
  !
  !     +----Eftu_Myctu
  !     !
  !     !     +---Eftu_Theaq
  !     !     !
  !     !     !                       +------Ef1a_Pyrwo
  !     !     !                       !
  !     !     !                       !        +--------Eftu_Halma
  !     !     !  +--------------------5     +-23
  !     !     !  !                    !  +-19  +------Eftu_Metva
  !     !     !  !                    !  !  !
  !     !     !  !                    !  !  +------Ef1a_Theac
  !     !     !  !                    +-21
  !     !  +--9  !                       !  +---------Ef1a_Sulso
  !  +-10  !  !  !                       !  !
  !  !  !  !  !  !                       +-16     +-------Ef1a_Giala
  !  !  !  !  !  !                          !     !
  !  !  !  !  !  !                          +-----3  +-----Ef1a_Plafk
  !  !  !  !  !  !                                !  !
  !  !  !  !  !  !                                +--7  +---Ef1a_Enthi
  !  !  !  !  !  !                                   !  !
  !  !  !  !  !  !                                   !  !  +---Ef1a_Tetpy
  !  !  !  !  !  !                                   +-17  !
  !  !  !  !  !  !                                      !  !     +---Ef1a_Dicdi
  !  !  !  !  +-13                                      !  !     !
  !  !  !  !     !                                      +--2  +--6  +---Ef1a_Yeast
  !  !  !  !     !                                         !  !  !  !
  !  !  !  !     !                                         !  !  +-14  +---Ef1a_Oncvo
  !  !  +--8     !                                         !  !     !  !
-18-22     !     !                                         !  !     +-11     +Ef 11_Human
  !  !     !     !                                         +--1        !  +-20
  !  !     !     !                                            !        +-15  +Ef 10_Xenla
  !  !     !     !                                            !           !
  !  !     !     !                                            !           +--Ef11_Drome
  !  !     !     !                                            !
  !  !     !     !                                            !  +--Ef1a_Euggr
  !  !     !     !                                            +-12
  !  !     !     !                                               !  +Ef1a_Arath
  !  !     !     !                                               +--4
  !  !     !     !                                                  +Ef1a_Wheat
  !  !     !     !
  !  !     !     +----Eftu_Thema
  !  !     !
  !  !     +-----Eftu_Mycga
  !  !
  !  +-------Eftu_Chltr
  !
  +---Eftu_Ecoli


remember: this is an unrooted tree!

Sum of squares =     1.08505

Average percent standard deviation =     4.25965

examined 2465 trees

Between        And            Length
-------        ---            ------
  18          Eftu_Anani        0.18009
  18            22              0.01715
  22            10              0.01323
  10          Eftu_Myctu        0.16693
  10             8              0.00967
   8             9              0.02307
   9          Eftu_Theaq        0.12224
   9            13              0.04593
  13             5              0.71041
   5          Ef1a_Pyrwo        0.22071
   5            21              0.01252
  21            19              0.01659
  19            23              0.03278
  23          Eftu_Halma        0.29008
  23          Eftu_Metva        0.24108
  19          Ef1a_Theac        0.23225
  21            16              0.03238
  16          Ef1a_Sulso        0.32437
  16             3              0.18183
   3          Ef1a_Giala        0.25953
   3             7              0.04840
   7          Ef1a_Plafk        0.19324
   7            17              0.01526
  17          Ef1a_Enthi        0.14175
  17             2              0.01758
   2          Ef1a_Tetpy        0.14047
   2             1              0.02012
   1             6              0.00469
   6          Ef1a_Dicdi        0.13864
   6            14              0.01109
  14          Ef1a_Yeast        0.12813
  14            11              0.01120
  11          Ef1a_Oncvo        0.10671
  11            15              0.00514
  15            20              0.05389
  20          Ef11_Human        0.01744
  20          Ef10_Xenla        0.01988
  15          Ef11_Drome        0.09499
   1            12              0.01663
  12          Ef1a_Euggr        0.11003
  12             4              0.09289
   4          Ef1a_Arath        0.02677
   4          Ef1a_Wheat        0.02154
  13          Eftu_Thema        0.15424
   8          Eftu_Mycga        0.19458
  22          Eftu_Chltr        0.26297
  18          Eftu_Ecoli        0.11573

Some of the lines wrap making the tree a little funny looking, but you get the idea. At least the nematode sequence is now in with the rest of the animals! Take a look at the tree file to get a feel for the Newick format. The corresponding Fitch tree file follows:

% type ef_kim_protdist.fitch_tree

(Eftu_Anani:0.18009,((Eftu_Myctu:0.16693,((Eftu_Theaq:0.12224,
((Ef1a_Pyrwo:0.22071,(((Eftu_Halma:0.29008,Eftu_Metva:0.24108):0.03278,
Ef1a_Theac:0.23225):0.01659,(Ef1a_Sulso:0.32437,(Ef1a_Giala:0.25953,
(Ef1a_Plafk:0.19324,(Ef1a_Enthi:0.14175,(Ef1a_Tetpy:0.14047,
((Ef1a_Dicdi:0.13864,(Ef1a_Yeast:0.12813,(Ef1a_Oncvo:0.10671,
((Ef11_Human:0.01744,Ef10_Xenla:0.01988):0.05389,Ef11_Drome:0.09499):0.00514:0.
01120):0.01109):0.00469,
(Ef1a_Euggr:0.11003,(Ef1a_Arath:0.02677,Ef1a_Wheat:0.02154):0.09289):0.01663:0.
02012):0.01758):0.01526):0.04840):0.18183):0.03238):0.01252):0.71041,
Eftu_Thema:0.15424):0.04593):0.02307,Eftu_Mycga:0.19458):0.00967):0.01323,
Eftu_Chltr:0.26297):0.01715,Eftu_Ecoli:0.11573);

Many PHYLIP programs generate tree file data in Newick format. Two PHYLIP drawing routines, DrawTree and DrawGram, can produce graphical tree representations from tree files. However, we will delay using them until we've run through all the PHYLIP tree inference programs that I want to introduce.

6.a.2.2. Bootstrapping neighbor-joining techniques.

Next I will show how to run bootstrapped neighbor-joining analysis. Bootstrapping is a statistical method for ascertaining input data reliability. It randomly selects column subsets of your alignment data matrix up to the same overall size as the original to create as many test sets as specified. A consensus is then made of all the resultant analyses; the longest branch lengths are those best resolved. Normally bootstrapping is done with a minimum of 100 replicates. Because of this neighbor-joining is often used to estimate the best tree of the results because it is so much faster than all other methods even though it is also the least reliable. All methods accept the multiple dataset input option necessary for bootstrapping -- they all have tradeoffs, one way or another. Use seqboot to bootstrap your original PHYLIP compatible input data set using the following screen trace as an example:

% seqboot

seqboot:  can't read infile
Please enter a new filename>ef.phy

Random number seed (must be odd)?
431

Bootstrapped sequences algorithm, version 3.53c

Settings for this run:
  D   Sequence, Morph, Rest., Gene Freqs?  Molecular sequences
  J     Bootstrap, Jackknife, or Permute?  Bootstrap
  R                  How many replicates?  100
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes

Are these settings correct? (type Y or the letter for one to change)
y

completed replicate number   10
completed replicate number   20
completed replicate number   30
completed replicate number   40
completed replicate number   50
completed replicate number   60
completed replicate number   70
completed replicate number   80
completed replicate number   90
completed replicate number  100

Output written to output file

Rename the output file:

% mv outfile ef.seqboot

The output alignment data matrix now contains 100 randomly selected sequence data matrices. To generate 100 distance matrices, launch protdist, being very careful to specify multiple datasets with the M option. Here you may want to take advantage of the speed of Kimura's approximation model, although, as stated previously, Dayhoff's model often seems to offer better estimates of protein divergence. Issue the following command, to see the accompanying, much abridged screen trace (This will take a while to run!):

% protdist

protdist:  can't read infile
Please enter a new filename>ef.seqboot

Protein distance algorithm, version 3.53c

Settings for this run:
  P  Use PAM, Kimura or categories model?  Dayhoff PAM matrix
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes

Are these settings correct? (type Y or the letter for one to change)
p

Protein distance algorithm, version 3.53c

Settings for this run:
  P  Use PAM, Kimura or categories model?  Kimura formula
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes

Are these settings correct? (type Y or the letter for one to change)
m
How many data sets?
100

Protein distance algorithm, version 3.53c

Settings for this run:
  P  Use PAM, Kimura or categories model?  Kimura formula
  M           Analyze multiple data sets?  Yes, 100 sets
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes

Are these settings correct? (type Y or the letter for one to change)
y

////////////////////////////////////////////////////////////////////

Data set # 32:

Computing distances:
  Eftu_Ecoli
  Eftu_Myctu   .
  Eftu_Anani   ..
  Eftu_Theaq   ...
  Eftu_Mycga   ....
  Eftu_Thema   .....
  Eftu_Chltr   ......
  Ef1a_Arath   .......
  Ef1a_Wheat   ........
  Ef1a_Euggr   .........
  Ef10_Xenla   ..........
  Ef11_Human   ...........
  Ef11_Drome   ............
  Ef1a_Oncvo   .............
  Ef1a_Yeast   ..............
  Ef1a_Enthi   ...............
  Ef1a_Tetpy   ................
  Ef1a_Dicdi   .................
  Ef1a_Plafk   ..................
  Ef1a_Giala   ...................
  Ef1a_Pyrwo   ....................
  Ef1a_Theac   .....................
  Eftu_Halma   ......................
  Eftu_Metva   .......................
  Ef1a_Sulso   ........................

////////////////////////////////////////////////////////////////////

Output written to output file

Don't forget to rename the output:

% mv outfile ef_seqboot.kim_protdist

Now submit the 100 matrix distance file to neighbor. This program reads the input distance matrix and outputs an outfile and treefile. Be sure to use the Multiple option and specify 100 multiple datasets. Always randomize the input order with the Jumble option and if it really matters make multiple runs with different seed numbers and compare results. Issue the following command to see the shortened screen trace below:

% neighbor

neighbor:  can't read infile
Please enter a new filename>ef_seqboot.kim_protdist

Neighbor-Joining-UPGMA method version 3.5

Settings for this run:
  N       Neighbor-joining or UPGMA tree?  Neighbor-joining
  O                        Outgroup root?  No, use as outgroup species  1
  L         Lower-triangular data matrix?  No
  R         Upper-triangular data matrix?  No
  S                        Subreplicates?  No
  J     Randomize input order of species?  No. Use input order
  M           Analyze multiple data sets?  No
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)
j
Random number seed (must be odd)?
531
Neighbor-Joining-UPGMA method version 3.5

Settings for this run:
  N       Neighbor-joining or UPGMA tree?  Neighbor-joining
  O                        Outgroup root?  No, use as outgroup species  1
  L         Lower-triangular data matrix?  No
  R         Upper-triangular data matrix?  No
  S                        Subreplicates?  No
  J     Randomize input order of species?  Yes (random number seed =     531)
  M           Analyze multiple data sets?  No
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)
m
How many data sets?
100

Neighbor-Joining-UPGMA method version 3.5

Settings for this run:
  N       Neighbor-joining or UPGMA tree?  Neighbor-joining
  O                        Outgroup root?  No, use as outgroup species  1
  L         Lower-triangular data matrix?  No
  R         Upper-triangular data matrix?  No
  S                        Subreplicates?  No
  J     Randomize input order of species?  Yes (random number seed =    531)
  M           Analyze multiple data sets?  Yes, 100 sets
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)
y

////////////////////////////////////////////////////////////////////

Data set # 100:

CYCLE  22: OTU   3 (   0.19325) JOINS OTU   7 (   0.21724)
CYCLE  21: NODE  3 (   0.04934) JOINS OTU   1 (   0.09480)
CYCLE  20: NODE  3 (   0.02367) JOINS OTU   2 (   0.12767)
CYCLE  19: NODE  3 (   0.03237) JOINS OTU   4 (   0.13317)
CYCLE  18: NODE  3 (   0.03404) JOINS OTU   5 (   0.17077)
CYCLE  17: NODE  3 (   0.04335) JOINS OTU   6 (   0.13085)
CYCLE  16: OTU  24 (   0.21758) JOINS OTU  23 (   0.25517)
CYCLE  15: NODE 24 (   0.06488) JOINS OTU  22 (   0.18498)
CYCLE  14: OTU  21 (   0.23252) JOINS NODE  3 (   0.74259)
CYCLE  13: NODE 24 (   0.02977) JOINS NODE 21 (   0.02573)
CYCLE  12: NODE 24 (   0.01256) JOINS OTU  25 (   0.30032)
CYCLE  11: NODE 24 (   0.13793) JOINS OTU  20 (   0.23718)
CYCLE  10: OTU   9 (   0.02849) JOINS OTU   8 (   0.02749)
CYCLE   9: OTU  12 (   0.02101) JOINS OTU  11 (   0.01657)
CYCLE   8: OTU  13 (   0.07494) JOINS NODE 12 (   0.04168)
CYCLE   7: NODE 24 (   0.09052) JOINS OTU  19 (   0.23978)
CYCLE   6: NODE 24 (   0.00888) JOINS OTU  16 (   0.15109)
CYCLE   5: OTU  18 (   0.12845) JOINS OTU  15 (   0.09703)
CYCLE   4: NODE 24 (   0.01419) JOINS OTU  17 (   0.13658)
CYCLE   3: NODE  9 (   0.07550) JOINS OTU  10 (   0.10234)
CYCLE   2: NODE 24 (   0.01430) JOINS NODE  9 (   0.01234)
CYCLE   1: NODE 24 (   0.00717) JOINS NODE 18 (   0.01696)
LAST CYCLE:
 NODE 13  (   0.01920) JOINS OTU  14  (   0.11079) JOINS NODE 24  (0.00712)

Output written on output file

Tree written on tree file

Rename the output files immediately:

% mv outfile ef_seqboot_kim_protdist.neighbor

% mv treefile ef_seqboot_kim_protdist.neighbor_tree

The output text file is huge. You may want to look it over, but then delete it as it takes up an awful lot of space in your account. While deleting things it's a good idea to get rid of the SeqBooted alignment and distance matrices also since they are also very large and will not be needed any more in the exercise. Be sure not to delete the tree file output from this last run of Neighbor. Above I named it ef_seqboot_kim_protdist.neighbor_tree.

To condense all this information down into a single consensus tree with branch lengths proportional to bootstrap confidence levels run the PHYLIP program consense. Launch the program and specify the input tree file. The program will produce an output text file and the output consensus tree. Be sure to designate the outgroup in your dataset. Use the following abridged screen trace as a guide:

% consense

consense:  can't read infile
Please enter a new filename>ef_seqboot_kim_protdist.neighbor_tree

Majority-rule and strict consensus tree program, version 3.53c

Settings for this run:
  O                        Outgroup root?  No, use as outgroup species  1
  R        Trees to be treated as Rooted?  No
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1         Print out the sets of species  Yes
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)
y

Output written to output file

Tree also written onto file

Again, rename the output:

% mv outfile seqboot_neighbor.consense

% mv seqboot_neighbor.consense_tree

Check out the text file output from the Consense program. Again some of the tree's branches have been wrapped confusing the issue but you can get a feel for what happened. The numbers at the branch forks indicate the bootstrap value. The following abridged display show the results of this Consense run:

% more seqboot_neighbor.consense

Majority-rule and strict consensus tree program, version 3.53c

Species in order:

  Ef1a Oncvo
  Ef1a Dicdi
  Eftu Ecoli
  Eftu Chltr
  Eftu Anani
  Eftu Myctu
  Eftu Mycga
  Eftu Theaq
  Eftu Thema
  Eftu Halma
  Eftu Metva
  Ef1a Theac
  Ef1a Pyrwo
  Ef1a Sulso
  Ef1a Giala
  Ef1a Enthi
  Ef1a Plafk
  Ef1a Yeast
  Ef1a Tetpy
  Ef1a Wheat
  Ef1a Arath
  Ef1a Euggr
  Ef11 Drome
  Ef10 Xenla
  Ef11 Human

////////////////////////////////////////////////////////////////

CONSENSUS TREE:
the numbers at the forks indicate the number
of times the group consisting of the species
which are to the right of that fork occurred
among the trees, out of 100.00 trees

 
                                                                                  +----Eftu Anani  
                                                                             +-44.0
                                                                             !    +----Eftu Ecoli
                                                                        +-47.0
                                                                        !    !
                                                                   +-36.0    +---------Eftu Chltr
                                                                   !    !
                                                              +-56.0    +--------------Eftu Mycga
                                                              !    !
                                                         +-74.0    +-------------------Eftu Myctu
                                                         !    !
                                                    +100.0    +------------------------Eftu Theaq
                                                    !    !
                                                    !    +-----------------------------Eftu Thema
                                                    !
                                               +-51.0                             +----Eftu Metva
                                               !    !                        +-68.0
                                               !    !                   +-41.0    +----Eftu Halma
                                               !    !                   !    !
                                          +100.0    +----------------33.0    +---------Ef1a Theac
                                          !    !                        !
                                          !    !                        +--------------Ef1a Pyrwo
                                     +100.0    !
                                     !    !    +---------------------------------------Ef1a Sulso
                                +-44.0    !
                                !    !    +--------------------------------------------Ef1a Giala
                           +-70.0    !
                           !    !    +-------------------------------------------------Ef1a Enthi
                      +-65.0    !
                      !    !    +------------------------------------------------------Ef1a Plafk
                      !    !
                 +-65.0    +-----------------------------------------------------------Ef1a Tetpy
                 !    !
                 !    !                                                      +---------Ef1a Euggr
                 !    +---------------------------------------------------71.0
            +-43.0                                                           !    +----Ef1a Arath
            !    !                                                           +100.0 
            !    !                                                                +----Ef1a Wheat
       +-61.0    !
       !    !    +---------------------------------------------------------------------Ef1a Dicdi
       !    !
  +-82.0    +--------------------------------------------------------------------------Ef1a Yeast
  !    !
  !    !                                                                          +----Ef11 Human
  !    !                                                                     +100.0
  !    +------------------------------------------------------------------48.0    +----Ef10 Xenla
  !                                                                          !
  !                                                                          +---------Ef11 Drome
  !
  +------------------------------------------------------------------------------------Ef1a Oncvo

  remember: this is an unrooted tree!

Notice several nodes are very well resolved with some having bootstrap values as high as 100%. This is a great indication of the reliability of the phylogenetic inference of those nodes based on the prepared dataset. However, several others, in particular those that have had trouble with other techniques, all reflect very low confidence values with numbers ranging from about 50% on down. In general bootstrap values much above about 50% tend to underrepresent the actual confidence level of the branching pattern, whereas numbers below 50% overrepresent confidence levels.

The output tree file follows. Notice that the branch lengths are now representing bootstrap values, not evolutionary distances. We will plot this data later on.

% more seqboot_neighbor.consense_tree

(((((((((((((((((Eftu_Anani:100.0,Eftu_Ecoli:100.0):44.0,Eftu_Chltr:100.0):4.0,
Eftu_Mycga:100.0):36.0,Eftu_Myctu:100.0):56.0,Eftu_Theaq:100.0):74.0,Eftu_Thma:
100.0):100.0,
(((Eftu_Metva:100.0,Eftu_Halma:100.0):68.0,Ef1a_Theac:100.0):41.0,Ef1a_Pyrwo100
.0):33.0):51.0,
Ef1a_Sulso:100.0):100.0,Ef1a_Giala:100.0):100.0,Ef1a_Enthi:100.0):44.0,Ef1a_laf
k:100.0):70.0,
Ef1a_Tetpy:100.0):65.0,(Ef1a_Euggr:100.0,(Ef1a_Arath:100.0,Ef1a_Wheat:100.0)100
.0):71.0):65.0,
Ef1a_Dicdi:100.0):43.0,Ef1a_Yeast:100.0):61.0,((Ef11_Human:100.0,Ef10_Xenla:00.
0):100.0,
Ef11_Drome:100.0):48.0):82.0,Ef1a_Oncvo:100.0);

Just as in the GCG implementation of neighbor-joining with Kimura distances, the nematode sequence is found to lie furthest out on this phylogeny. And the bootstrapping even gives it a 100% confidence value. At least it is grouped with the other animals. This dramatically drives home the point that just because a program can be run, it most suredly does not guarantee it to find the correct solution! Nematodes are not the ancestor to all of life! Distances by Dayhoff's PAM correction model as implemented in PHYLIP would undoubtedly give a more satisfying solution, as did the previous Fitch tree even though we used Kimura distances in its solution.

6.b. Parsimony methods: PHYLIP's ProtPars.

Finally I will show how to run one of PHYLIP's parsimony programs, protpars. Parsimony analysis directly uses the aligned sequence data matrix as input. The program explores all 'tree space' in an attempt to answer the question: "Which trees lead to the observed alignment with the least number of symbol changes?" ProtPars uses an approximate heuristic, tree-rearrangement solution, although a much slower branch-and-bound, exact solution program, DNAPenny, is also available in PHYLIP. Begin the program using the screen trace below as an example; again take advantage of the jumble options. I'll illustrate with only one jumble, but in actual research you should probably jumble a minimum of ten times. Another option available in parsimony is the reconstruction of ancestral node sequences; choose 5 to designate this, if it interests you. My analysis took a while to run; your datasets should go much faster.

% protpars

protpars:  can't read infile
Please enter a new filename>ef.phy

Protein parsimony algorithm, version 3.53c

Setting for this run:
  U                 Search for best tree?  Yes
  J   Randomize input order of sequences?  No. Use input order
  O                        Outgroup root?  No, use as outgroup species  1
  T              Use Threshold parsimony?  No, use ordinary parsimony
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4          Print out steps in each site  No
  5  Print sequences at all nodes of tree  No
  6       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)
j
Random number seed (must be odd)?
753
Number of times to jumble?
1

Protein parsimony algorithm, version 3.53c

Setting for this run:
  U                 Search for best tree?  Yes
  J   Randomize input order of sequences?  Yes (seed =     753,  1 times)
  O                        Outgroup root?  No, use as outgroup species  1
  T              Use Threshold parsimony?  No, use ordinary parsimony
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4          Print out steps in each site  No
  5  Print sequences at all nodes of tree  No
  6       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)
y

Adding species:
   Eftu_Thema
   Ef1a_Wheat
   Ef1a_Yeast
   Ef1a_Theac
   Eftu_Theaq
   Ef1a_Arath
   Ef1a_Euggr
   Eftu_Mycga
   Ef11_Human
   Ef1a_Plafk
   Ef11_Drome
   Ef1a_Oncvo
   Eftu_Chltr
   Ef1a_Dicdi
   Eftu_Metva
   Ef1a_Sulso
   Eftu_Ecoli
   Eftu_Halma
   Ef1a_Giala
   Ef1a_Pyrwo
   Eftu_Myctu
   Eftu_Anani
   Ef10_Xenla
   Ef1a_Enthi
   Ef1a_Tetpy

Doing global rearrangements
  !-------------------------------------------------!
   .................................................
   .................................................

Output written to output file

Trees also written onto file

Rename the two output files:

% mv outfile ef.protpars

% mv treefile ef_protpars.tree

PHYLIP's parsimony programs do not calculate branch lengths. Dr. Felsenstein does not feel that parsimony does an adequate job of reconstructing branch lengths and so he does not incorporate it into his implementations. The output text, with annoyingly wrapped lines, follows below:

% more ef.protpars

Protein parsimony algorithm, version 3.53c

   2 trees in all found

         +-----------------------------------------------------------------Eftu_Mycga
         !
         !  +--------------------------------------------------------------Eftu_Thema
         !  !
         !  !  +-----------------------------------------------------------Eftu_Theaq
      +--4  !  !
      !  !  !  !        +--------------------------------------------------Ef1a_Pyrwo
      !  !  !  !        !
      !  !  !  !     +-20                                               +--Eftu_Metva
      !  !  !  !     !  !  +-------------------------------------------23
      !  +--5  !     !  !  !                                            +--Eftu_Halma
      !     !  !     !  +-22
      !     !  !     !     !     +-----------------------------------------Ef1a_Theac
      !     !  !     !     !     !
      !     !  !     !     +----21  +--------------------------------------Ef1a_Sulso
      !     !  !     !           !  !
      !     !  !     !           +-24  +-----------------------------------Ef1a_Giala
      !     !  !     !              !  !
      !     !  !     !              +-19  +--------------------------------Ef1a_Plafk
      !     +--3     !                 !  !
      !        !     !                 +-18  +-----------------------------Ef1a_Enthi
      !        !     !                    !  !
      !        !     !                    !  !  +--------------------------Ef1a_Tetpy
      !        !     !                    +-15  !
      !        !     !                       !  !              +-----------Ef1a_Yeast
   +--2        !  +--7                       !  !  +----------14
   !  !        !  !  !                       +-16  !           !  +--------Ef11_Drome
   !  !        !  !  !                          !  !           +-12
   !  !        !  !  !                          !  !              !  +-----Ef1a_Oncvo 
   !  !        !  !  !                          !  !              +-13
   !  !        !  !  !                          +-10                 !  +--Ef11_Human 
   !  !        !  !  !                             !                 +-11
   !  !        !  !  !                             !                    +--Ef10_Xenla 
   !  !        !  !  !                             !
   !  !        +--6  !                             !              +--------Ef1a_Dicdi
   !  !           !  !                             +-------------17
 --1  !           !  !                                            !  +-----Ef1a_Euggr
   !  !           !  !                                            +--9
   !  !           !  !                                               !  +--Ef1a_Wheat 
   !  !           !  !                                               +--8
   !  !           !  !                                                  +--Ef1a_Arath 
   !  !           !  !
   !  !           !  +-----------------------------------------------------Eftu_Chltr
   !  !           !
   !  !           +--------------------------------------------------------Eftu_Anani
   !  !
   !  +--------------------------------------------------------------------Eftu_Myctu
   ! 
   +-----------------------------------------------------------------------Eftu_Ecoli

  remember: this is an unrooted tree!

requires a total of   3148.000


        +-----------------------------------------------------------------Eftu_Mycga
        !
        !  +--------------------------------------------------------------Eftu_Thema
        !  !
        !  !  +-----------------------------------------------------------Eftu_Theaq
     +--4  !  !
     !  !  !  !        +--------------------------------------------------Ef1a_Pyrwo
     !  !  !  !        !
     !  !  !  !     +-20                                               +--Eftu_Metva
     !  !  !  !     !  !  +-------------------------------------------23
     !  +--5  !     !  !  !                                            +--Eftu_Halma 
     !     !  !     !  +-22
     !     !  !     !     !     +-----------------------------------------Ef1a_Theac
     !     !  !     !     !     !
     !     !  !     !     +----21  +--------------------------------------Ef1a_Sulso
     !     !  !     !           !  !
     !     !  !     !           +-24   +-----------------------------------Ef1a_Giala
     !     !  !     !              !  !
     !     !  !     !              +-19  +--------------------------------Ef1a_Plafk
     !     +--3     !                 !  !
     !        !     !                 +-18  +-----------------------------Ef1a_Enthi
     !        !     !                    !  !
     !        !     !                    !  !  +--------------------------Ef1a_Tetpy
     !        !     !                    +-15  !
     !        !     !                       !  !              +-----------Ef1a_Oncvo
  +--2        !  +--7                       !  !              !
  !  !        !  !  !                       +-16  +----------13        +--Ef1a_Yeast
  !  !        !  !  !                          !  !           !  +----14
  !  !        !  !  !                          !  !           !  !     +--Ef11_Drome
  !  !        !  !  !                          !  !           +-12
  !  !        !  !  !                          +-10              !     +--Ef11_Human 
  !  !        !  !  !                             !              +----11
  !  !        !  !  !                             !                    +--Ef10_Xenla
  !  !        !  !  !                             !
  !  !        +--6  !                             !              +--------Ef1a_Dicdi
  !  !           !  !                             +-------------17
--1  !           !  !                                            !  +-----Ef1a_Euggr
  !  !           !  !                                            +--9
  !  !           !  !                                               !  +--Ef1a_Wheat
  !  !           !  !                                               +--8
  !  !           !  !                                                  +--Ef1a_Arath
  !  !           !  !
  !  !           !  +-----------------------------------------------------Eftu_Chltr
  !  !           !
  !  !           +--------------------------------------------------------Eftu_Anani
  !  !
  !  +--------------------------------------------------------------------Eftu_Myctu
  !
  +-----------------------------------------------------------------------Eftu_Ecoli

  remember: this is an unrooted tree!

requires a total of   3148.000

The output tree file is quite simple. We will use DrawGram to plot the Consense tree graphic of these trees later on. The tree file from this run of ProtPars is shown below:

% type ef_protpars.tree

(((Eftu_Mycga,(Eftu_Thema,(Eftu_Theaq,(((Ef1a_Pyrwo,((Eftu_Metva,Eftu_Halma)
(Ef1a_Theac,(Ef1a_Sulso,(Ef1a_Giala,(Ef1a_Plafk,(Ef1a_Enthi,(Ef1a_Tetpy,
((Ef1a_Yeast,(Ef11_Drome,(Ef1a_Oncvo,(Ef11_Human,Ef10_Xenla)))),(Ef1a_Dicdi,
(Ef1a_Euggr,(Ef1a_Wheat,Ef1a_Arath)))))))))))),Eftu_Chltr),Eftu_Anani)))),
Eftu_Myctu),Eftu_Ecoli)[0.5000];
(((Eftu_Mycga,(Eftu_Thema,(Eftu_Theaq,(((Ef1a_Pyrwo,((Eftu_Metva,Eftu_Halma)
(Ef1a_Theac,(Ef1a_Sulso,(Ef1a_Giala,(Ef1a_Plafk,(Ef1a_Enthi,(Ef1a_Tetpy,
((Ef1a_Oncvo,((Ef1a_Yeast,Ef11_Drome),(Ef11_Human,Ef10_Xenla))),(Ef1a_Dicdi,
(Ef1a_Euggr,(Ef1a_Wheat,Ef1a_Arath)))))))))))),Eftu_Chltr),Eftu_Anani)))),
Eftu_Myctu),Eftu_Ecoli)[0.5000];

As seen here, parsimony analysis may find more than one most parsimonious tree. The PHYLIP system does not rank these trees; they are all equally the most parsimonious. The output is a list of all possible best trees, not an ordered ranking of the most likely tree. Bootstrapping can help you get a feel for this. The drawing programs in the PHYLIP system will only use the first tree given in such a file. If you really want to draw graphics of all the other trees, you will need to copy the file and create a series of tree files with editing so that you can produce output plots of each possible tree that was found.

In all cases though, Consense can be run to generate the consensus tree from a tree file with multiple trees. Use PHYLIP's Consense program on your ProtPars output, if more than one tree was found. I showed you a Consense run earlier when I illustrated bootstrapping; therefore, I will not repeat the screen trace and expect you to do it on your own. The Newick tree that results from the Consense run of the two ProtPars trees that I found follows below:

((((((((((((((((((Ef1a_Wheat:1.0,Ef1a_Arath:1.0):1.0,Ef1a_Euggr:1.0):1.0,
Ef1a_Dicdi:1.0):1.0,(((Ef11_Human:1.0,Ef10_Xenla:1.0):1.0,(Ef11_Drome:1.0,
Ef1a_Yeast:1.0):0.5):0.5,Ef1a_Oncvo:1.0):1.0):1.0,Ef1a_Tetpy:1.0):1.0,Ef1a_Ethi
:1.0):1.0,
Ef1a_Plafk:1.0):1.0,Ef1a_Giala:1.0):1.0,Ef1a_Sulso:1.0):1.0,Ef1a_Theac:1.0):.0,
(Eftu_Metva:1.0,Eftu_Halma:1.0):1.0):1.0,Ef1a_Pyrwo:1.0):1.0,Eftu_Chltr:1.0)1.0
,
Eftu_Anani:1.0):1.0,Eftu_Theaq:1.0):1.0,Eftu_Thema:1.0):1.0,(Eftu_Ecoli:1.0,
Eftu_Myctu:1.0):1.0):1.0,Eftu_Mycga:1.0);

Notice that most all of the branch lengths are now one. This means that the same branching pattern appeared in both input trees. Only two nodes have discrepancies and both are probably caused by that pesky nematode sequence. I would be tempted to exclude that sequence if this was an actual research setting because it consistently causes problems in the programs due to its extreme divergence. This is actually a pretty common phenomena in parasites. They are often very highly evolved, existing as they do, in a very unique culture. A plot will be made of this tree in the next section.

Lineages that evolve much faster than others tend to confound many of the algorithms, This is known as "the long branch effect" and leads to the dreaded "Felsenstein Zone" where algorithms are guaranteed to always find the wrong answer! Other common problems in phylogenetic inference include sequencing and/or initial alignment mistakes, the occurence of active lateral transfer of genes between the groups being examined, and the fact that some of the organisms may have improper traditional taxonomies leading to a discrepancy between your inferred phylogeny and the normally accepted one.

An algorithm that I am not going to introduce, but that I heartily recommend is maximum likelihood. Maximum likelihood is a probabilistic statistical procedure that synthesizes the best features of distance and parsimony methods. Unfortunately it is very slow. Data sets the size that we are dealing with could takes days to process so I am not going to teach it. The PHYLIP maximum likelihood program is DNAML; ProtML is supplied as an adjunct to the package by J. Adachi and M. Hasegawa. These algorithms are undoubtedly among the most powerful available for ascertaining phylogenies from sequence data, however, they are also computationally the most involved. They can take a prohibitive amount of time to run depending on the size of your dataset. One somewhat quicker utilization of maximum likelihood statistics is to test a user defined tree. Within PHYLIP many programs will accept a user defined tree for evaluation and branch length calculation by appending your tree data to the input data. See the documentation files for the particular details of achieving this in different programs. The protein implementation, ProtML is particularly challenging because it is not a part of the regular release of PHYLIP; rather it is from different authors and is included in PHYLIP as an unsupported program, not officially supported and only provided as a supplement to the package. It has many pecularities and the user defined tree included in your data file must be unrooted versus rooted. ReTree can be used to switch these styles back and forth. If you have need to use this program, contact Steve Thompson for assistance. The DNA implementation, DNAML, is likewise slow, but it is a standard PHYLIP program and so will run in the normal manner.

7) Plotting phylogenetic trees with PHYLIP.

With your generated Newick tree files in hand, it is time to create plots of the results. Two PHYLIP programs that draw plots of Newick style tree file data are installed on ribozyme, DrawTree and DrawGram. DrawTree produces an unrooted `network' representation; DrawGram produces several varieties of rooted style `grams. In both cases branch lengths will be proportional to the evolutionary distance separating the entries if it is available or the level of consensus if it is a Consense tree. Both programs require that the user provide information designating the input file and the font set to be used; the output file generated will be named plotfile. The PHYLIP system has only five fonts to work with. These have the logical names font1 through font5. Since the package is meant to be graphics device independent, it doesn't use text characters in plots; rather it gives detailed drawing instructions for each character used. This makes changing the names on the generated plots very challenging, so the names should be changed, if desired, at earlier points in the work. If you haven't gotten the names in your initial data matrix the way that you want them to end up , then you can either change them with a text editor in the Newick tree file or use PHYLIP's ReTree program to do it interactively.

DrawTree and DrawGram operate in a similar fashion. They can typically be run in two different manners -- one to directly create a PostScript file of the plot and the other to create a graphics file that can be visualized on the screen using Tektronix emulation.

7.a. On screen visualization of PHYLIP graphics.

I'll illustrate the onscreen visualization first. To do this in an environment typical of many VADMS users, that of a Tektronix color 4105 or 4107 emulator, we can take advantage of PHYLIP's Hewlett-Packard graphics language output option as an intermediate and the Homegrown software phylip_plot. When either drawing program, DrawTree or DrawGram, is launched it requests an input file and a font file and then it prompts for an output device. To use phylip_plot designate the Hewlett-Packard graphics language by choosing H. Previewing the results is usually not a good idea as most of the designated preview devices are inappropriate, so respond with N to the next question posed. A number of options are then listed governing the actual plot itself. There are a slew of options in these programs. Begin by trying the defaults, if they don't give you what you want, repeat the process and select different ones. I suggest running them many, many times to get what you prefer.

Follow the sample screen trace below using the results from our first Fitch tree of the EF 1-Tu representative data set to get a feel for running drawtree in this fashion. Use your own data in your run through the program. Notice that you need to specify the name of the input tree and then of the font file. Be sure to specify h for Hewlett-Packard plotter language and do not preview by choosing n when asked. When using DrawTree, I usually switch option 2, "Angel of Labels," in the settings menu to a for "Along;" there's less chance of them overwriting one another that way.

% drawtree fitch_tree.hpgl

DRAWTREE from PHYLIP version 3.53c

drawtree:  can't read infile
Please enter a new filename>ef_kim_protdist.fitch_tree
Reading tree ...
Tree has been read.
drawtree:  can't read fontfile
Please enter a new filename>font2
Loading the font ...
Font loaded.

Which plotter or printer will the tree be drawn on?
(many other brands or models are compatible with these)

   type:       to choose one compatible with:

        L         Apple Laserwriter (with Postscript)
        M         MacDraw PICT format
        R         Rayshade 3D rendering program file
        J         Hewlett-Packard Laserjet
        K         TeKtronix 4010 graphics terminal
        H         Hewlett-Packard 7470 plotter
        D         DEC ReGIS graphics (VT240 terminal)
        B         Houston Instruments plotter
        E         Epson MX-80 dot-matrix printer
        C         Prowriter-Imagewriter dot-matrix printer
        O         Okidata dot-matrix printer
        T         Toshiba 24-pin dot-matrix printer
        P         PC Paintbrush monochrome PCX file format
        X         X Bitmap format
        F         FIG 2.0 format
        U         other: one you have inserted code for
 Choose one:
h

Which type of screen will it be previewed on?

   type:       to choose one compatible with:

        N         will not be previewed
        K         TeKtronix 4010 graphics terminal
        D         DEC ReGIS graphics (VT240 terminal)
        U         other: one you have inserted code for
 Choose one:
n

Here are the settings:

 (1)        Use branch lengths:  Yes
 (2)           Angle of labels:  Fixed angle of  0.0 degrees
 (3)          Rotation of tree:  90.0
 (4)     Angle of arc for tree:  360.0
 (5)   Iterate to improve tree:  Yes
 (6)    Scale of branch length:  Automatically rescaled
 (7)        Horizontal margins:  1.92 cm
 (7)          Vertical margins:  1.44 cm
 (8) Relative character height:  0.3333
 (9)       Enthusiasm constant:  0.04000


 Do you want to accept these? (Yes or No)
 Type Y or N or the number (1- 9) of the one to change:
2

Do you want labels to be Fixed angle, Radial, or Along branches?
 Type F, R, or A
a

Here are the settings:

 (1)        Use branch lengths:  Yes
 (2)           Angle of labels:  Along branches
 (3)          Rotation of tree:  90.0
 (4)     Angle of arc for tree:  360.0
 (5)   Iterate to improve tree:  Yes
 (6)    Scale of branch length:  Automatically rescaled
 (7)        Horizontal margins:  1.92 cm
 (7)          Vertical margins:  1.44 cm
 (8) Relative character height:  0.3333
 (9)       Enthusiasm constant:  0.04000


 Do you want to accept these? (Yes or No)
 Type Y or N or the number (1- 9) of the one to change:
y
Writing plot file ...
Finished.
End of run.

The result of this process is a text file containing the plotting instructions for the desired plot in Hewlett-Packard graphics language, HPGL, named plotfile. Rename it before you proceed:

% mv plotfile fitch.tree_ps

You can take a look at it if you want with the more command. Some comments on the distinctive format and contents of this file should be mentioned regardless. The more complex the plot is, the longer the generated plotting instruction file will be. The PHYLIP draw programs display the structure of the plot in black (white on the black background of standard Tek 4105-4107 graphics terminals) and the names of the sequences being worked with in green. HPGL commands for pen color are SPx, where SP is the pen color command and the x denotes the color to be used, 1 is for black and 2 for green. Since the instructions were written for a plotter, there are mainly PU for pen up and PD for pen down instructions in this file. The PU commands are similar in function to GCG Figure program move commands in that they position the pen over a spot on the page. PD commands put the pen down on the page and draw a line to the location given in the instruction. The command IN starts a plot and SP ends one.

The HPGL files generated from these programs can then be plotted to your screen with the locally developed piece of software, phylip_plot. Since this software was written with GCG graphics subroutines, GCG needs to be activated and the graphics device type set in order to run the program. This program is not a regular part of the PHYLIP package. Follow the screen trace below to run phylip_plot:

% phylip_plot

PHYLIP plotting program

Please enter the filename.ext fitch_tree.hpgl

 When your TEK4107 attached to tty is ready, press <rtn>. <rtn>

The results are a bit disappointing on screen. However, because it was written with GCG subroutines, phylip_plot responds to all the command switches that are possible in all GCG graphics programs. This enables you to change the size of the resulting output to meet your needs. The data set can be expanded, moved around on the page and turned 90 degrees from the original orientation to fit the page better. If your plot is higher than it is wide, use -portrait to rotate it to fit better on the long direction of the screen. Use -scale= to control the actual size of the plot on the page. -xpan and -ypan move the plot around on the page. A negative value in xpan moves the picture to the left and a negative value in ypan moves the picture down on the page. When you use the -portrait switch this exchanges the two pan functions and really confuses the user. A listing of several of these command switches is given in Appendix B. The best way to get a good picture is just to try various things until you like what you see. An example of a phylip_plot command line with some GCG options is given below:

% phylip_plot -portrait -scale=1.70 -xpan=45.0 -ypan=25.0

If playing this game, you should play around with various values until you find the best for your graphic. Once you've determined the optimal, 'magical' settings, it is best to write them down somewhere as they will be lost if you proceed with the next step.

Writing the output to a Figure file instead of directly to a plot with the -figure= option allows you to go into the file with a text editor to modify the information in the plot that is incorrect or needs to be expanded for publication purposes. Figure legends and other types of annotation can even be incorporated into the plot. If you need to engage in this process contact Susan Johns or Steve Thompson for assistance.

After modifying the Figure file the way you want it, the results of your efforts can be seen on the screen using the Figure program. Use the same magic conditions derived earlier when running the phylip_plot program to get the desired output, because the production of the Figure file saved your data in its original size and orientation. A Figure file picture generated this way is the same as that of the original data you started with prior to all the adjustments. An example of a Figure command line that corresponds to the above phylip_plot command line is given below:

% figure -portrait -scale=1.70 -xpan=45.0 -ypan=25.0

When satified with your results, you might want to use the GCG graphics switch -nomark to suppress the time stamp that the Figure program puts on its plots to produce another better version of your plot. Use the -speed= switch to slow down the plotting procedure if you are using an HP plotter as your graphics device to improve resolution.

7.b. Publication quality PHYLIP PostScript graphics.

You are welcome to play with phylip_plot as much as you like to visualize the trees produced in this week's exercise on your terminal screen. However, to produce publication quality output I usually let PHYLIP directly produce PostScript plots. PHYLIP's internal PostScript driver does a very good job of sizing the tree to the page and producing a legible final graphic. To produce a PostScript graphic representation of the tree launch either DrawTree or DrawGram and specify that you want to use "Apple Laserwriter (with Postscript)" as your graphics device. Again, do not preview the graphic. Just as before, you need to specify the input tree file, a Hershey font logical name. However, because we're directly using PostScript we can change to a non-bitmapped, PostScript supported, font once within the program. Be sure that the PostScript printer that you are using recognizes the font that you specify.

Follow the next screen trace to see how to produce PostScript trees directly out of PHYLIP. For this example, I will draw a phenogram of the second tree that we estimated in this week's exercise, the bootstrapped neighbor-joining tree. I'll illustrate with some of my favorite option combinations. Check out all the options that I change to get just the effect that I prefer.

% drawgram

DRAWGRAM from PHYLIP version 3.53c

drawgram:  can't read infile
Please enter a new filename>seqboot_neighbor.consense_tree
Reading tree ...
Tree has been read.
drawgram:  can't read fontfile
Please enter a new filename>font2
Loading the font ...
Font loaded.

Which plotter or printer will the tree be drawn on?
(many other brands or models are compatible with these)

   type:       to choose one compatible with:

        L         Apple Laserwriter (with Postscript)
        M         MacDraw PICT format
        R         Rayshade 3D rendering program file
        J         Hewlett-Packard Laserjet
        K         TeKtronix 4010 graphics terminal
        H         Hewlett-Packard 7470 plotter
        D         DEC ReGIS graphics (VT240 terminal)
        B         Houston Instruments plotter
        E         Epson MX-80 dot-matrix printer
        C         Prowriter-Imagewriter dot-matrix printer
        O         Okidata dot-matrix printer
        T         Toshiba 24-pin dot-matrix printer
        P         PC Paintbrush monochrome PCX file format
        X         X Bitmap format
        F         FIG 2.0 format
        U         other: one you have inserted code for
 Choose one:
l

Which type of screen will it be previewed on?

   type:       to choose one compatible with:

        N         will not be previewed
        K         TeKtronix 4010 graphics terminal
        D         DEC ReGIS graphics (VT240 terminal)
        U         other: one you have inserted code for
 Choose one:
n

Here are the settings:

 (1)               Tree grows:  Vertically
 (2)            Style of tree:  Cladogram
 (3)       Use branch lengths:  Yes
 (4)          Angle of labels:  45.0
 (5)       Horizontal margins:  1.73 cm
 (5)         Vertical margins:  2.24 cm
 (6)   Scale of branch length:  Automatically rescaled
 (7)    Depth-Breadth of tree:  0.53
 (8)   Stem-length-tree-depth:  0.05
 (9) Character ht - tip space:  0.3333
(10)          Ancestral nodes:  Intermediate
(11)          Font           :  Hershey

 Do you want to accept these? (Yes or No)
 Type Y or N or the number (1-11) of the one to change:
1

Here are the settings:

 (1)               Tree grows:  Horizontally
 (2)            Style of tree:  Cladogram
 (3)       Use branch lengths:  Yes
 (4)          Angle of labels:  45.0
 (5)       Horizontal margins:  1.73 cm
 (5)         Vertical margins:  2.24 cm
 (6)   Scale of branch length:  Automatically rescaled
 (7)    Depth-Breadth of tree:  0.53
 (8)   Stem-length-tree-depth:  0.05
 (9) Character ht - tip space:  0.3333
(10)          Ancestral nodes:  Intermediate
(11)          Font           :  Hershey

 Do you want to accept these? (Yes or No)
 Type Y or N or the number (1-11) of the one to change:
2

What style tree is this to be:
   Cladogram, Phenogram, curVogram, Eurogram,  or Swoopogram
 (C, P, V, E, or S)
 Choose one:
p

Here are the settings:

 (1)               Tree grows:  Horizontally
 (2)            Style of tree:  Phenogram
 (3)       Use branch lengths:  Yes
 (4)          Angle of labels:  45.0
 (5)       Horizontal margins:  1.73 cm
 (5)         Vertical margins:  2.24 cm
 (6)   Scale of branch length:  Automatically rescaled
 (7)    Depth-Breadth of tree:  0.53
 (8)   Stem-length-tree-depth:  0.05
 (9) Character ht - tip space:  0.3333
(10)          Ancestral nodes:  Intermediate
(11)          Font           :  Hershey

 Do you want to accept these? (Yes or No)
 Type Y or N or the number (1-11) of the one to change:
4

(Considering the tree as if it "grew" vertically:)
Are the labels to be plotted vertically (90),
 horizontally (0), or at a 45-degree angle?
 Choose an angle in degrees from 90 to 0:
90

Here are the settings:

 (1)               Tree grows:  Horizontally
 (2)            Style of tree:  Phenogram
 (3)       Use branch lengths:  Yes
 (4)          Angle of labels:  90.0
 (5)       Horizontal margins:  1.73 cm
 (5)         Vertical margins:  2.24 cm
 (6)   Scale of branch length:  Automatically rescaled
 (7)    Depth-Breadth of tree:  0.53
 (8)   Stem-length-tree-depth:  0.05
 (9) Character ht - tip space:  0.3333
(10)          Ancestral nodes:  Intermediate
(11)          Font           :  Hershey

 Do you want to accept these? (Yes or No)
 Type Y or N or the number (1-11) of the one to change:
8
New value of stem length as fraction of tree depth?
0

Here are the settings:

 (1)               Tree grows:  Horizontally
 (2)            Style of tree:  Phenogram
 (3)       Use branch lengths:  Yes
 (4)          Angle of labels:  90.0
 (5)       Horizontal margins:  1.73 cm
 (5)         Vertical margins:  2.24 cm
 (6)   Scale of branch length:  Automatically rescaled
 (7)    Depth-Breadth of tree:  0.53
 (8)   Stem-length-tree-depth:  0.00
 (9) Character ht - tip space:  0.3333
(10)          Ancestral nodes:  Intermediate
(11)          Font           :  Hershey

 Do you want to accept these? (Yes or No)
 Type Y or N or the number (1-11) of the one to change:
11
Enter font name or "Hershey" for the default font
Helvetica-Bold

Here are the settings:

 (1)               Tree grows:  Horizontally
 (2)            Style of tree:  Phenogram
 (3)       Use branch lengths:  Yes
 (4)          Angle of labels:  90.0
 (5)       Horizontal margins:  1.73 cm
 (5)         Vertical margins:  2.24 cm
 (6)   Scale of branch length:  Automatically rescaled
 (7)    Depth-Breadth of tree:  0.53
 (8)   Stem-length-tree-depth:  0.00
 (9) Character ht - tip space:  0.3333
(10)          Ancestral nodes:  Intermediate
(11)          Font           :  Helvetica-Bold

 Do you want to accept these? (Yes or No)
 Type Y or N or the number (1-11) of the one to change:
y
Writing plot file ...
Finished.
End of run. 

Rename the output plot:

% mv plotfile seqboot_neighbor.tree_ps

Some explanation is required. I first flip the graphic on its side by specifying that I want it horizontal. Then I designate the phenogram representation with 90 degree labels to align the labels with the graphic. I change option 8 "Stem-length-tree-depth" to zero; this suppresses the drawing of an implied root on the tree. Finally I specify a PostScript font; notice that this specification is case sensitive and must match a font that your printer has installed.

The output PostScript file, seqboot_neighbor.tree_ps in this case, can now be sent to a printer that interprets PostScript. This may involve `ftping' the file to a Mac or PC network for printing. If you have access to the VADMS Teaching Lab, then you can use the lpr command to print the PostScript there. However, our printer there does not like to print PHYLIP PostScript in its normal print que. If sending PHYLIP PostScript to our HP4MV printer in the teaching lab, use the following variation of the lpr command:

% lpr -Plps filename.ps

A copy of this phenogram bootstrapped neighbor-joining tree graphic, as well as a cladogram of the consensus parsimony tree, and the Fitch DrawTree PostScript output, are all attached to the end of this exercise.

Finally I want you to prepare a consensus tree from the three previous methods' trees. The three separate tree files need to be merged into one to do this. You can either use the UNIX cat command or a text editor to do this. The following command line illustrates the use of cat to combine three files into a new combined output file:

% cat file_one file_two file_three > new_combined_file

After combining the files, run Consense on the combined file. The final consensus tree will show how well the three methods -- Fitch least-squares distance fit, bootstrapped distance neighbor-joining, and parsimony -- agree with each other. As a grand finale, take this tree and load it into the powerful PHYLIP tree manipulation tool ReTree. This program enables one to change the names, root placement, appearance, branch lengths, and topology of trees. The appearance of a tree can be drastically changed without changing its topology by flipping and rerooting. Naturally, if you actually change the branch lengths or topology of your tree, you are no longer accepting the conclusion of the inference software. I will use retree to change the names of the sequences to reflect the names of organisms and to change the appearance of the tree to more closely match our preconceived idea of what it should look like by designating a different node as the outgroup. This will not change the topology of the tree at all. A much abridged ReTree screen trace of my session follows below:

% retree

retree:  can't read infile
Please enter a new filename>final_consense.tree

Tree Rearrangement, version 3.53c

Settings for this run:
  U      Initial tree (arbitrary, user, specify)?  User tree from tree file
  N      Use the Nexus format to write out trees?  No
  0           Graphics type (IBM PC, VT52, ANSI)?  ANSI
  W   Width of terminal screen, of plotting area?  80, 80
  L                    Number of lines on screen?  24

Are these settings correct? (type Y or the letter for one to change)
y

Reading tree file ...
                                                     ****1:Ef1a Arath
                                                  **42
                                              ***41  ****2:Ef1a Wheat
                                              *   *
                                              *   ***3:Ef1a Euggr
                                              *
                                              *              ****4:Ef11 Human
                                              *           **47
                                           **40        **46  ****5:Ef10 Xenla
                                           *  *        *  *
                                           *  *     **45  ***6:Ef11 Drome
                                           *  *     *  *
                                           *  *  **44  ****7:Ef1a Oncvo
                                       ***39  *  *  *
                                       *   *  **43  ***8:Ef1a Yeast
                                       *   *     *
                                    **38   *     ***9:Ef1a Dicdi
                                    *  *   *
                                    *  *   **10:Ef1a Tetpy
                                ***37  *
                                *   *  ***11:Ef1a Enthi
** 29 lines below screen **

NEXT? (Options: R . = U W O M T F B N H J K L C + ? X Q) (? for Help) ?

R Rearrange a tree by moving a node or group
. Redisplay the same tree again
= Redisplay the same tree without lengths
U Undo the most recent rearrangement
W Write tree to a file
O select an Outgroup for the tree
M Midpoint root the tree
T Transpose immediate branches at a node
F Flip (rotate) subtree at a node
B Change or specify the length of a branch
N Change or specify the name(s) of tip(s)
H Move viewing window to the left
J Move viewing window downward
K Move viewing window upward
L Move viewing window to the right
C show only one Clade (subtree) (might be useful if tree is too big)
+ Read next tree from file
? Help (this screen)
Q (Quit) Exit from program
X Exit from program


TO CONTINUE, PRESS ON THE Return OR Enter KEY
                                                     ****1:Ef1a Arath
                                                  **42
                                              ***41  ****2:Ef1a Wheat
                                              *   *
                                              *   ***3:Ef1a Euggr
                                              *
                                              *              ****4:Ef11 Human
                                              *           **47
                                           **40        **46  ****5:Ef10 Xenla
                                           *  *        *  *
                                           *  *     **45  ***6:Ef11 Drome
                                           *  *     *  *
                                           *  *  **44  ****7:Ef1a Oncvo
                                       ***39  *  *  *
                                       *   *  **43  ***8:Ef1a Yeast
                                       *   *     *
                                    **38   *     ***9:Ef1a Dicdi
                                    *  *   *
                                    *  *   **10:Ef1a Tetpy
                                ***37  *
                                *   *  ***11:Ef1a Enthi
** 29 lines below screen **

NEXT? (Options: R . = U W O M T F B N H J K L C + ? X Q) (? for Help) n
Specify name of which tip? (enter its number or 0 to quit): 1
The current name of tip 1 is "Ef1a Arath"
Enter new tip name: Arabidopsis

                                                      ***1:Arabidopsis
                                                  ***42
                                               **41   ***2:Ef1a Wheat
                                               *  *
                                               *  ****3:Ef1a Euggr
                                               *
                                               *                ***4:Ef11 Human
                                               *            ***47
                                            **40         **46   ***5:Ef10 Xenla
                                            *  *         *  *
                                            *  *      **45  ****6:Ef11 Drome
                                            *  *      *  *
                                            *  *  ***44  ***7:Ef1a Oncvo
                                        ***39  *  *   *
                                        *   *  **43   ***8:Ef1a Yeast
                                        *   *     *
                                     **38   *     ****9:Ef1a Dicdi
                                     *  *   *
                                     *  *   **10:Ef1a Tetpy
                                  **37  *
                                  *  *  ***11:Ef1a Enthi
** 29 lines below screen **
Specify name of which tip? (enter its number or 0 to quit): n
Specify name of which tip? (enter its number or 0 to quit): 2
The current name of tip 2 is "Ef1a Wheat"
Enter new tip name: Wheat

////////////////////////////////////////////////////////////////////////////

TO CONTINUE, PRESS ON THE Return OR Enter KEY
** 24 lines above screen **
                           *  *   **13:Giardia
                           *  *
                        **34  ***14:Sulfolobus
                        *  *
                        *  *  ***15:Thermoplasma
                        *  **48
                    ***33     *   **16:Methanococcus
                    *   *     ***49
                    *   *         **17:Haloarcula
                 **32   *
                 *  *   **18:Pyrococcus
              **31  *
              *  *  ***19:Thermotoga
           **30  *
           *  *  **20:Thermus
        **29  *
        *  *  **21:Mycobacterium
     **28  *
     *  *  **22:Mycoplasma
  **27  *
** 6 lines below screen **

Now to change the way the tree is drawn and yet not change its topological relationships, I choose "O" to designate a new out-group between the Archaebacteria and Eubacteria lineages.

NEXT? (Options: R . = U W O M T F B N H J K L C + ? X Q) (? for Help) o 
Which node should be the new outgroup? 33
** 24 lines above screen **
         *    *     *****13:Giardia
         *    *
     ***34    *****14:Sulfolobus
     *   *
     *   *   ****15:Thermoplasma
     *   ***48
  **33       *    *****16:Methanococcus
  *  *       ****49
  *  *            *****17:Haloarcula
*26  *
  *  *****18:Pyrococcus
  *
  *  *****19:Thermotoga
  **32
     *   ****20:Thermus
     ***31
         *   ****21:Mycobacterium
         ***30
             *  *****22:Mycoplasma
             **29
** 6 lines below screen **

NEXT? (Options: R . = U W O M T F B N H J K L C + ? X Q) (? for Help) x
Do you want to write out the tree to a file? (Y or N) y
Enter R if the tree is to be rooted
OR enter U if the tree is to be unrooted: u
Tree written to file

Be sure to write out your new tree before quitting the program! The output tree will be called outtree; rename it immediately:

% mv outtree final.tree

A PostScript swoopogram of this tree is attached to the end of the exercise. It has pretty much the accepted phylogeny for life, except that it places the Archaebacterian Sulfolobus with the Eukaryotes rather that the rest of the Archaebacteria. This is actually similar to what James Lake asserts with his Eocyte tree. However, I would be tempted to go back and estimate all the distances used for these phylogenies with Dayhoff's PAM style correction rather than Kimura's model to see the difference it would have on the results.

8) Doing it on your own with last week's EF-1-Tu subset alignment.

Now that you have seen how to run the PHYLIP package by reading through my examples, use your EF-1-Tu favorite branch subset alignment from last week to infer the phylogeny of its members. You probably have been doing this as you went through with my example; that's great, otherwise go back to be sure that you didn't leave any steps out with your dataset. There should not be more than about a dozen sequences in these alignments at the most and, therefore, all programs should run decently interactively. Furthermore, because I told you to include one, most similar sequence not of the same phylogenetic lineage, you can take advantage of PHYLIP's outgroup designation. Therefore, in all analyses with your subset, be sure that you designate the appropriate out-group with the "o" option. To receive credit for this week's laboratory exercise I want you to prepare four different PostScript tree files. I do not care what options or tree representation form you choose as long as the sequence names are legible in the output. Use the naming convention (your last name).(program name)_tree_ps for all four PostScript files. The four different EF-1-Tu favorite branch subset alignment trees that I want you to prepare are:

1) A Fitch least-squares fit tree with distances by PAM estimates;

2) A bootstrapped neighbor-joining consensus tree with distances by Kimura;

3) A consensus of the most parsimonious trees from the alignment; and

4) A consensus of the above three trees with the sequence names changed to reflect a standard organism name of each.

Be sure to rcp all four of these trees to teacher in order to get credit for them!

9) Optional work for extra credit.

Repeat the same type of analysis with EF 2-G as I illustrated with the representative list of EF-1-Tu sequences. From your optional EF 2-G protein multiple sequence alignment of last week infer the evolutionary relationships between all of the members. Explore different algorithms for tree inference with your alignment. Pick the most satisfying estimates and prepare a consensus tree from them. Produce a file of that consensus tree using PostScript graphics. Name this final PostScript file (your lastname).EF2_tree_ps. If you do the extra credit portion of the exercise, be sure to send the EF 2-G Consense PostScript tree to TEACHER in order to recieve credit for the optional work.

10) Conclusions and evaluations; finishing up.

To get credit for this lab, you need to complete the exercise report form that was copied into your directory at the beginning of the tutorial. Rename it to have your last name with the mv command but leave the extension .week12, and then go into the file using the pico editor to fill in the report.

% mv week12.week12 (your lastname).week12

% pico (your lastname).week12

In addition to the report form, I also want to be able to print out all of the PostScript trees from your Elongation Factor subset alignment and the extra credit output if you chose to tackle it. Therefore, be sure that all the required PostScript files have your last name as their filename. All required files should now have the same filename (i.e. your lastname) but unique, identifying extensions. Because of this you can rcp them all at once to teacher by utilizing a wildcard:

% rcp (your lastname).* teacher@ribozyme:receive

This concludes your computing session for this week. Log off ribozyme, get out of the emulator and back to the overlapping windows screen.

% exit

Press the <alt> and <x> keys together. You will be asked if you really want to exit the program. Respond with <y> to get out of the teemtalk emulator and return to the overlapping windows screen.

One vital point that can't be repeated often enough, is the importance of your multiple sequence alignments. All subsequent analyses are absolutely dependent upon them. Another point that needs to be considered in phylogenetic inference is do not base an organism's phylogeny on just one gene. There can be complicating factors, such as the nematode sequence seen with the example Elongation Factor phylogeny, that make interpretation difficult. Use several genes -- the Ribosomal Database Project (RDP) provides a good, largely accepted, alignment and phylogenetic framework with which other phylogenies can be compared. Please contact Steve Thompson for further information about using VADMS implementation of RDP and other tools for molecular phylogenetic inference not covered in this week's exercise.

Reference

Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, WA, USA.

Appendix A: The GCG Figure program.

Figure files can be edited to change GCG graphics into what you want. A figure input file contains one instruction per line. Each type of instruction has a special code. Those that you most likely will want to change or use as reference points are listed below. There are, of course, others; refer to the GCG Program Manual for more information on the Figure program and its complete instruction set.


.d xx.xx xxx.xx   draw to location   draws a line to the location on the device
					given by xx.xx xxx.xx, the platten coordinates.

.m xx.xx xxx.xx   move to location   move the pen to the location on the device
					given by xx.xx xxx.xx the platten coordinates.

.nc x               new color         changes the pen color or sets it to a 
                                        new value.

.pt string         plot text string    the command to plot the text string 
                                        shown on the line.

Another instruction that you won't often find is that to designate the linestyle for drawn lines. This may be something that you might want to change to reflect a particular branch of the tree that you feel is very important or particularly shakey.


.ls x x.xx        linestyle            changes the nature of the line to be drawn. 
                                         A linestyle line of .ls 1 0.10 would produce 
                                         a solid line of width .1 platten units.  
                                         A line with .ls 2 0.50 would produce a
                                         dotted line with the dots being .5 platten
                                         units apart.

Other instructions that will be useful when you modify these files are those which control text orientation, character size, and text angle.


.to x              text orientation     determines the relationship of the text 
                                         to the location given in the previous move 
                                         command. A value of 2 means that the text 
                                         starts just to the right of the location.
.ch x.x            character height     sets the height of the characters used.
.td xx.x           text angle           set the angle at which the text is drawn. 
                                         The default on this is horizontal.

Appendix A: The GCG Figure program (continued). An example of part of a Figure output file from DrawTree data is given below. It shows what new pen, move and draw instructions look like. Although this type of file may be long and appears awkward at first, using the Figure file allows you complete editoral control of your plot files. The data is in a format that can be worked with by itself, or be included within other GCG figure files. Its uses are only up to your imagation.

!
.! Figure output from ""  October 22, 1992  11:41
.!
.vp 0 150 0 100	                      !(establishes the window for the plot) 
.wn 0 150 0 100
.nc 1                                 !(sets the color to black)
.vp 0 150 0 100
.nc 1                                 !(sets the color to black) 
.m 60.81 31.61
.d 55.27 37.15
.m 55.27 37.15
.d 47.43 37.15
.m 47.43 37.15
.d 39.59 37.15
.m 39.59 37.15
.d 34.04 31.61
.m 39.59 37.15
.d 31.74 37.15
.m 47.43 37.15
.d 47.43 44.99
.m 55.27 37.15
.d 63.11 37.15
.m 60.81 31.61
.d 66.36 26.06
.nc 2                                 !(sets the color to green)
.m 15.85 31.58                        !(this is the beginning of the character)
.d 14.75 28.7                         !drawing portion of the file)
.m 15.85 31.58
.d 16.94 28.7
.m 15.16 29.66
.d 16.53 29.66
.m 17.08 28.7
.m 19 30.21
.d 18.87 30.48
.d 18.46 30.62
.d 18.04 30.62
.d 17.63 30.48
.d 17.49 30.21
.d 17.63 29.93
.d 17.91 29.79
.d 18.59 29.66
.d 18.87 29.52
.d 19 29.24
.d 19 29.11
.d 18.87 28.83

////////////////////////////////////////////////////////////////////////////

.!
.! End of Page 1
.!

Appendix B: GCG graphics command switches

The following listing describes command switches that GCG graphics routines and phylip_plot will respond to. A brief description of each's effect is given. The required part of the command is shown in upper-case bold type. "I" denotes an integer number and "x" a real one.


-COLor=I             the entire structure is drawn with the graphics
                      device's pen I color regardless of the color designations 
                      within the file.

-FIGure=yyyyy.yyy   writes the contents of the plot into a text 
                      file called yyyyy.yyy that can be used as an input file 
                      for the program Figure.  Figure files can be understood 
                      and changed if desired. See the Program Manual for more information on this program.

-PLOT=yyyyyy.yyy   writes the plot to the designated filename using whichever 
                      graphics language has been preconfigured by the user.

-FONT=I              draws all the text characters in the plot with font I.
                      GCG software will work with 22 different character sets. 
                     But, remember, the name labels within PHYLIP plots 
                      are not text characters.

-PORtrait            rotates the plot 90 degrees.

-SCAle=x             scales the size of the plot by the factor x. The
                      factor can either expand or contract the size of the plot 
                      depending on the value of the number used, i.e. x=2.0 
                      would be bigger, x=.5 would be smaller.

-XSCAle=x            scales the size of the x axis by the factor x.

-YSCAle=x            scales the size of the y axis by the factor x.

-XPAN=xx.xx          moves the plot to the right along the x axis xx.xx 
                      platen units

-YPAN=xx.xx          moves the plot to up along the y axis up xx.xx 
                      platen units

-SPEed=x             controls the speed of the pen, the default is 10.0 and
                      is quite fast. A value of 1.0 or 2.0 slows down the 
                      plotting process and produces better lines on a HP pen 
                      plotter for a higher quality plot.

-NOMARk              suppresses the file identifier stamp in lower-right 
                      corner of plot.

Appendix C: Tree graphics from the exercise

The Kimura distances, Fitch derived tree drawn by DrawTree.

The bootstrapped neighbor-joining tree (distances by Kimura) phenogram.

The consensus of the two most parsimonious trees drawn as a cladogram.

The final consensus tree of the three methods used in the exercise presented as a Swoopogram: