The main feature of the old package was the ability to carry out reliable multiple alignments of many sequences. The sensitivity of the program is as good as from any other program we have tried, with the exception of the programs of Vingron and Argos (1991), while it works in reasonable time on a microcomputer. The programs of Vingron and Argos are specialised for finding distant similarities between proteins but require mainframes or workstations and are more difficult to use.
The main new features are: profile alignments (alignments of old alignments); phylogenetic trees (Neighbor Joining trees calculated after multiple alignment with a bootstrapping option); better sequence input (automatically recognise and read NBRF/PIR, Pearson (Fasta) or EMBL/SwissProt formats); flexible alignment output (choose one of: old Clustal format, NBRF/PIR, GCG msf format or Phylip format); full command line interface (everything that you can do interactively can be specified on the command line).
In version 7 of the GCG package, there is a program called PILEUP which uses a very similar algorithm to the one in ClustalV. There are 2 main differences between the programs: 1) the metric used to compare the sequences for the initial "guide tree" uses a full global, optimal alignment in PILEUP instead of the fast, approximate ones in ClustalV. This makes PILEUP much slower for the comparison of long sequences. In principle, the distances calculated from PILEUP will be more sensitive than ours, but in practice it will not make much difference, except in difficult cases. 2) During the multiple alignment, terminal gaps are penalised in ClustalV but not in PILEUP. This will make the PILEUP alignments better when the sequences are of very different lengths (has no effect if there are no large terminal gaps).
This software may be distributed and used freely,
provided that you do not modify it or this documentation in any way without the permission of the authors.
If you wish to refer to ClustalV, please cite: Higgins,D.G. Bleasby,A.J. and Fuchs,R. (1991) CLUSTAL V: improved software for multiple sequence alignment. CABIOS, vol .8, 189-191.
The overall multiple alignment algorithm was described in: Higgins,D.G. and Sharp,P.M. (1989). Fast and sensitive multiple sequence alignments on a microcomputer. CABIOS, vol. 5, 151-153.
ACKNOWLEDGEMENTS
D.H.
would particularly like to thank Paul Sharp,
in whose lab.
this work originated.
We also thank Manolo Gouy,
Gene Myers,
Peter Rice and Martin Vingron for suggestions,
bug-fixes and help.
Des Higgins and Rainer Fuchs,
EMBL Data Library,
Heidelberg,
Germany.
Alan Bleasby,
Daresbury,
UK.
JUNE 1991
Interactive usage of Clustal V is completely menu driven. On-line help is provided, defaults are offered for all parameters and file names. With a little effort it should be completely self explanatory. The main menu, which appears when you run the programs is shown below. Each item brings you to a sub menu.
Main menu for Clustal V: 1. Sequence Input From Disc 2. Multiple Alignments 3. Profile Alignments 4. Phylogenetic trees S. Execute a system command H. HELP X. EXIT (leave program) Your choice:The options S and H appear on all the main menus. H will provide help and if you type S you will be asked to enter a command, such as DIR or LS, which will be sent to the system (does not work on Mac's). Before carrying out an alignment, you must use option 1 (sequence input); the format for sequences is explained below. Under menu item 2 you will be able to automatically align your sequences to each other. Menu item 3 allows you to do profile alignments. These are alignments of old alignments. This allows you to build up a multiple alignment in stages or add a new sequence to an old alignment. You can calculate phylogenetic trees from alignments using menu item 4.
To do a complete multiple alignment, we need to know the approximate relationships of the sequences to each other (which ones are most similar to each other). We do this by calculating a crude phylogenetic tree which we call a dendrogram (to distinguish it from the more sensitive trees available under the phylogenetic tree menu). This dendrogram is used as a guide to align bigger and bigger groups of sequences during the multiple alignment. The dendrogram is calculated in 2 stages: 1) all pairs of sequence are compared using the fast/approximate method of Wilbur and Lipman (1983); the result of each comparison is a similarity score. 2) the similarity scores are used to construct the dendrogram using the UPGMA cluster analysis method of Sneath and Sokal (1973).
The construction of the dendrogram can be very time consuming if you wish to align many sequences (e.g. for 100 sequences you need to carry out 100x99/2 sequence comparisons = 4950). During every multiple alignment, a dendrogram is constructed and saved to a file (something.dnd). These can be reused later.
******Multiple*Alignment*Menu****** 1. Do complete multiple alignment now 2. Produce dendrogram file only 3. Use old dendrogram file 4. Pairwise alignment parameters 5. Multiple alignment parameters 6. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu Your choice:So, if in doubt, and you have already loaded some sequences from the main menu, just try option 1 and press the <Return> key in response to any questions. You will be prompted for 2 file names e.g. if the sequence input file was called DRINK.PEP, you will be offered DRINK.ALN as the file to contain the alignment and DRINK.DND for the dendrogram.
If you wish to repeat a multiple alignment (e.g. to experiment with different gap penalties) but do not wish to make a dendrogram all over again use menu item 3 (providing you are using the same sequences). Similarly, menu item 2 allows you to produce the dendrogram file only.
********* WILBUR/LIPMAN PAIRWISE ALIGNMENT PARAMETERS ********* 1. Toggle Scoring Method :Percentage 2. Gap Penalty :3 3. K-tuple :1 4. No. of top diagonals :5 5. Window size :5 H. HELP Enter number (or [RETURN] to exit):The similarity scores are calculated from fast alignments generated by the method of Wilbur and Lipman (1983). These are 'hash' or 'word' or 'k-tuple' alignments carried out in 3 stages.
First you mark the positions of every fragment of sequence, K-tuple long (for proteins, the default length is 1 residue, for DNA it is 2 bases) in both sequences. Then you locate all k-tuple matches between the 2 sequences. At this stage you have to imagine a dot- matrix plot between the 2 sequences with each k-tuple match as a dot. You find those diagonals in the plot with most matches (you take the "No. of top diagonals" best ones) and mark all diagonals within Window size" of each top diagonal. This process will define diagonal bands in the plot where you hope the most likely regions of similarity will lie.
The final alignment stage is to find that head to tail arrangement of k-tuple matches from these diagonal regions that will give the highest score. The score is calculated as the number of exactly matching residues in this alignment minus a "gap penalty" for every gap that was introduced. When you toggle "Scoring method" you choose between expressing these similarity scores as raw scores or expressed as a percentage of the shorter sequence length.
K-TUPLE SIZE: Can be 1 or 2 for proteins; 1 to 4 for DNA. Increase this to increase speed; decrease to improve sensitivity.
GAP PENALTY: The number of matching residues that must be found in order to introduce a gap. This should be larger than K-Tuple Size. This has little effect on speed or sensitivity.
NO. OF TOP DIAGONALS: The number of best diagonals in the imaginary dot-matrix plot that are considered. Decrease (must be greater than zero) to increase speed; increase to improve sensitivity.
WINDOW SIZE: The number of diagonals around each "top" diagonal that are considered. Decrease for speed; increase for greater sensitivity.
SCORING METHOD: The similarity scores may be expressed as raw scores (number of identical residues minus a "gap penalty" for each gap) or as percentage scores. If the sequences are of very different lengths, percentage scores make more sense.
The alignments are carried out in a small amount of memory. Occasionally (it is hard to predict), you will run out of memory while doing these alignments; when this happens, it will say on the screen: "Sequences (a,b) partially aligned" (instead of "Sequences (a,b) aligned"). This means that the alignment score for these sequences will be approximate; it is not a problem unless many of the alignments do this. It can be fixed by using less sensitive parameters or increasing parameter FSIZE in clustalv.h .
91.0 0 0 2 012000 ! seq 2 joins seq 3 at 91% ID. 72.0 1 0 3 011200 ! seq 4 joins seqs 2,3 at 72% 71.1 0 0 2 000012 ! seq 5 joins seq 6 at 71% 35.5 0 2 4 122200 ! seq 1 joins seqs 2,3,4 21.7 4 3 6 111122 ! seqs 1,2,3,4 join seqs 5,6This LOOKS complicated but you do not normally need to care what is in here. Anyway, each row represents the joining together of 2 or more sequences. You progress from the top down, joining more and more sequences until all are joined together; for N sequences you have N-1 groupings hence there are 5 rows in the above file (there were 6 sequences). In each row, the first number is the similarity score of this grouping; ignore the next three columns for the moment; the last 6 digits in the line show which sequences are grouped; there is one digit for each sequence (the first digit is for the first sequence). The rule is: in each row, all of the "1"s join all of the "2"s; the zero's do nothing.
Hence, in the first row, sequence 2 joins sequence 3 at a similarity level of 91% identity; next, sequence 4 joins the previous grouping of 2 plus 3 at a level of 72% etc. This is shown diagrammatically below. Before leaving the dendrogram format, the other 3 columns of numbers are: a pointer to the row from which the "1" sequences were last joined (or zero if only one of them); a pointer to the row in which the "2"s were last joined; the total number of sequences joined in this line.
I------ 2
I------I
I I------ 3 Diagram of the sequence similarity
I----I
I I------------- 4 relationships shown in the above
I--I
I I------------------ 1 dendrogram file (branch lengths are
----I
I I------------- 5 not to scale).
I-------I
I------------- 6
********* MULTIPLE ALIGNMENT PARAMETERS ********* 1. Fixed Gap Penalty :10 2. Floating Gap Penalty :10 3. Toggle Transitions (DNA):Weighted 4. Protein weight matrix :PAM 250 H. HELP Enter number (or [RETURN] to exit):FIXED GAP PENALTY: Reduce this to encourage gaps of all sizes; increase it to discourage them. Terminal gaps are penalised same as all others. BEWARE of making this too small (approx 5 or so); if the penalty is too small, the program may prefer to align each sequence opposite one long gap.
FLOATING GAP PENALTY: Reduce this to encourage longer gaps; increase it to shorten them. Terminal gaps are penalised same as all others. BEWARE of making this too small (approx 5 or so); if the penalty is too small, the program may prefer to align each sequence opposite one long gap.
DNA TRANSITIONS = WEIGHTED or UNWEIGHTED: By default,
transitions (A versus G;
C versus T)
are weighted more strongly than transversions (an A aligned with a G will be preferred to an A aligned with a C or a T).
You can make all pairs of nucleotide equally weighted with this option.
PROTEIN WEIGHT MATRIX: For protein comparisons, a weight matrix is used to differentially weight different pairs of aligned amino acids. The default is the well known Dayhoff PAM 250 matrix. We also offer a PAM 100 matrix, an identity matrix (all weights are the same for exact matches) or allow you to give the name of a file with your own matrix. The weight matrices used by Clustal V are shown in full in the Algorithms and References section of this documentation.
If you input a matrix from a file, it must be in the following format. Use a 20x20 matrix only (entries for the 20 "normal" amino acids only; no ambiguity codes etc.). Input the lower left triangle of the matrix, INCLUDING the diagonal. The order of the amino acids (rows and columns) must be: CSTPAGNDEQHRKMILVFYW. The values can be in free format seperated by spaces (not commas). The PAM 250 matrix is shown below in this format.
12
0 2
-2 1 3
-3 1 0 6
-2 1 1 1 2
-3 1 0 -1 1 5
-4 1 0 -1 0 0 2
-5 0 0 -1 0 1 2 4
-5 0 0 -1 0 0 1 3 4
-5 -1 -1 0 0 -1 1 2 2 4
-3 -1 -1 0 -1 -2 2 1 1 3 6
-4 0 -1 0 -2 -3 0 -1 -1 1 2 6
-5 0 0 -1 -1 -2 1 0 0 1 0 3 5
-5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6
-2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5
-6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6
-2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4
-4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9
0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10
-8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17
Values must be integers and can be all positive or positive and negative as above.
These are SIMILARITY values.
*** We draw your attention to NBRF/PIR format in particular. This format is EXACTLY the same as one of the input formats. Therefore, alignments written in this format can be used again as input (to the profile alignments or phylogenetic trees). ***
********* Format of Alignment Output ********* 1. Toggle CLUSTAL format output = ON 2. Toggle NBRF/PIR format output = OFF 3. Toggle GCG format output = OFF 4. Toggle PHYLIP format output = OFF 5. Create alignment output file(s) now? H. HELP Enter number (or [RETURN] to exit):CLUSTAL FORMAT: This is a self explanatory alignment. The alignment is written out in blocks. Identities are highlighted and (if you use a PAM 250 matrix) positions in the alignment where all of the residues are "similar" to each other (PAM 250 score of 8 or more) are indicated.
NBRF/PIR FORMAT: This is the usual NBRF/PIR format with gaps indicated by hyphens ("-"). AS we have stressed before, this format is EXACTLY compatible with the sequence input format. Therefore you can read in these alignments again for profile alignments or for calculating phylogenetic trees.
GCG FORMAT: In version 7 of the Wisconsin GCG package, a new multiple sequence format was introduced. This is the MSF (Multiple Sequence Format) format. It can be used as input to the GCG sequence editor or any of the GCG programs that make use of multiple alignments. THIS FORMAT IS ONLY SUPPORTED IN VERSION 7 OF THE GCG PACKAGE OR LATER.
PHYLIP FORMAT: This format can be used by the Phylip package of Joe Felsenstein (see the references/algorithms section for details of how to get it). Phylip allows you to do a huge range of phylogenetic analyses (we just offer one method in this program) and is probably the most widely used set of programs for drawing trees. It also works on just about every computer you can think of, providing you have a decent Pascal compiler.
This menu is for taking two old alignments (or single sequences) and aligning them with each other. The result is one bigger alignment. The menu is very similar to the multiple alignment menu except that there is no mention of dendrograms here (they are not needed) and you need to input two sets of sequences. The menu looks like this:
******Profile*Alignment*Menu****** 1. Input 1st. profile/sequence 2. Input 2nd. profile/sequence 3. Do alignment now 4. Alignment parameters 5. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu Your choice:You must input profile number 1 first. When both profiles are loaded, use item 3 (Do alignment now) and the 2 profiles will be aligned. Items 4 and 5 (parameters and output options) are identical to the equivalent options on the multiple alignment menu.
The same input routines that are used for general input are used here i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA format, with gaps indicated by hyphens ("-"). This is why we have continualy drawn your attention to the NBRF/PIR format as a useful output format.
Either profile can consist of just one sequence. Therefore, if you have a favourite alignment of sequences that you are working on and wish to add a new sequence, you can use this menu, provided the alignment is in the correct format.
The total number of sequences in the two profiles must be less less than or equal to the MAXN parameter set in the clustalv.h header file.
******Phylogenetic*tree*Menu****** 1. Input an alignment 2. Exclude positions with gaps? = OFF 3. Correct for multiple substitutions? = OFF 4. Draw tree now 5. Bootstrap tree S. Execute a system command H. HELP or press [RETURN] to go back to main menu Your choice:The same input routine that is used for general input is used here i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA format, with gaps indicated by hyphens ("-"). This is why we have continualy drawn your attention to the NBRF/PIR format as a useful output format.
If you have input an alignment, then just use item 4 to draw a tree. The method used is the Neighbor Joining method of Saitou and Nei (1987). This is a "distance method". First, percent divergence figures are calculated between all pairs of sequence. These divergence figures are then used by the NJ method to give the tree. Example trees will be shown below.
There are two options which can be used to control the way the distances are calculated. These are set by options 2 and 3 in the menu.
EXCLUDE POSITIONS WITH GAPS? This option allows you to ignore all alignment positions (columns) where there is a gap in ANY sequence. This guarantees that "like" is compared with "like" in all distances i.e. the same positions are used to calculate all distances. It also means that the distances will be "metric". The disadvantage of using this option is that you throw away much of the data if there are many gaps. If the total number of gaps is small, it has little effect.
CORRECT FOR MULTIPLE SUBSTITUTIONS? As sequences diverge, substitutions accumulate. It becomes increasingly likely that more than one substitution (as a result of a mutation) will have happened at a site where you observe just one difference now. This option allows you to use formulae developed by Motoo Kimura to correct for this effect. It has the effect of stretching long branches in tres while leaving short ones relatively untouched. The desired effect is to try and make distances proportional to time since divergence.
The tree is sent to a file called BLAH.NJ, where BLAH.SEQ is the name of the input, alignment file. An example is shown below for 6 globin sequences.
DIST = percentage divergence (/100)
Length = number of sites used in comparison
1 vs. 2 DIST = 0.5683; length = 139
1 vs. 3 DIST = 0.5540; length = 139
1 vs. 4 DIST = 0.5315; length = 111
1 vs. 5 DIST = 0.7447; length = 141
1 vs. 6 DIST = 0.7571; length = 140
2 vs. 3 DIST = 0.0897; length = 145
2 vs. 4 DIST = 0.1391; length = 115
2 vs. 5 DIST = 0.7517; length = 145
2 vs. 6 DIST = 0.7431; length = 144
3 vs. 4 DIST = 0.0957; length = 115
3 vs. 5 DIST = 0.7379; length = 145
3 vs. 6 DIST = 0.7361; length = 144
4 vs. 5 DIST = 0.7304; length = 115
4 vs. 6 DIST = 0.7368; length = 114
5 vs. 6 DIST = 0.2697; length = 152
Neighbor-joining Method
Saitou, N. and Nei, M. (1987) The Neighbor-joining Method:
A New Method for Reconstructing Phylogenetic Trees.
Mol. Biol. Evol., 4(4), 406-425
This is an UNROOTED tree
Numbers in parentheses are branch lengths
Cycle 1 = SEQ: 5 ( 0.13382) joins SEQ: 6 ( 0.13592)
Cycle 2 = SEQ: 1 ( 0.28142) joins Node: 5 ( 0.33462)
Cycle 3 = SEQ: 2 ( 0.05879) joins SEQ: 3 ( 0.03086)
Cycle 4 (Last cycle, trichotomy):
Node: 1 ( 0.20798) joins
Node: 2 ( 0.02341) joins
SEQ: 4 ( 0.04915)
The output file first shows the percent divergence (distance)
figures between each pair of sequence.
Then a description of a NJ tree is given.
This description shows which sequences (SEQ:)
or which groups of sequences (NODE: ,
a node is numbered using the lowest sequence that belongs to it)
join at each level of the tree.
This is an unrooted tree!! This means that the direction of evolution through the tree is not shown. This can only be inferred in one of two ways: 1) assume a degree of constancy in the molecular clock and place the root (bottom of the tree; the point where all the sequences radiate from) half way along the longest branch. **OR** 2) use an "outgroup", a sequence from an organism that you "know" must be outside of the rest of the sequences i.e. root the tree manually, on biological grounds.
The above tree can be represented diagramatically as follows:
SEQ 1 SEQ 4
I I
13.6 I 28.1 I 4.9 5.9
SEQ 6 ----------I I I I--------- SEQ 2
I I I I
I--------I-----------I----------I
13.4 I 33.5 20.8 2.3 I 3.1
SEQ 5 ----------I I--------- SEQ 3
The figures along each branch are percent divergences along that branch.
If you root the tree by placing the root along the longest branch (33.5%)
then you can draw it again as follows,
this time rooted:
13.6
I-------------------- SEQ 6
I---------I 13.4
I I-------------------- SEQ 5
I 33.5
-----I 28.1
I I-------------------- SEQ 1
I I
I---------I 4.9
I 20.8 I----------- SEQ 4
I--------I
I 5.9
I 2.3 I----- SEQ 2
I-----I 3.1
I----- SEQ 3
The longest branch (33.5% between 5,6 and 1,2,3,4)
is split between the 2 bottom branches of the tree.
As it happens in this particular case,
sequences 5 and 6 are myoglobins while sequences 1,2,3 and 4 are alpha and beta globins,
so you could also justify the above rooting on biological grounds.
If you do not have any particular need or evidence for the position of the root,
then LEAVE THE TREE UNROOTED.
Unrooted trees do not look as pretty as rooted ones but it is uaual to leave them unrooted if you do not have any evidence for the position of the root.
BOTSTRAPPING: Different sets of sequences and different tree drawing methods may give different topologies (branching orders)
for parts of a tree that are weakly supported by the data.
It is useful to have an indication of the degree of error in the tree.
There are several ways of doing this,
some of them rather technical.
We provide one general purpose method in this program,
which makes use of a technique called bootstrapping (see Felsenstein,
1985).
In the case of sequence alignments, bootstrapping involves taking random samples of positions from the alignment. If the alignment has N positions, each bootstrap sample consists of a random sample of N positions, taken WITH REPLACEMENT i.e. in any given sample, some sites may be sampled several times, others not at all. Then, with each sample of sites, you calculate a distance matrix as usual and draw a tree. If the data very strongly support just one tree then the sample trees will be very similar to each other and to the original tree, drawn without bootstrapping. However, if parts of the tree are not well supported, then the sample trees will vary considerably in how they represent these parts.
In practice, you should use a very large number of bootstrap replicates (1000 is recommended, even if it means running the program for an hour on a slow microcomputer; on a workstation it will be MUCH faster). For each grouping on the tree, you record the number of times this grouping occurs in the sample trees. For a group to be considered "significant" at the 95% level (or P <= 0.05 in statistical terms) you expect the grouping to show up in >= 95% of the sample trees. If this happens, then you can say that the grouping is significant, given the data set and the method used to draw the tree.
So, when you use the bootstrap option, a NJ tree is drawn as before and then you are asked to say how many bootstrap samples you want (1000 is the default) and you are asked to give a seed number for the random number generator. If you give the same seed number in future, you will get the same results (we hope). Remember to give different seed numbers if you wish to carry out genuinely different bootstrap sampling experiments. Below is the output file from using the same data for the 6 globin sequences as used before. The output file has the same name as the input fike with the extension ".njb".
// STUFF DELETED .... same as for the ordinary NJ output // Bootstrap Confidence Limits Random number generator seed = 99 Number of bootstrap trials = 1000 Diagrammatic representation of the above tree: Each row represents 1 tree cycle; defining 2 groups. Each column is 1 sequence; the stars in each line show 1 group; the dots show the other Numbers show occurences in bootstrap samples. ****.. 1000 .***.. 1000 <- This is the answer!! *..*** 812 122311For an unrooted tree with N sequences, there are actually only N-3 genuinely different groupings that we can test (this is the number of "internal branches"; each internal branch splits the sequences into 2 groups). In this example, we have 6 sequences with 3 internal branches in the reference tree. In the bootstrap resampling, we count how often each of these internal branches occur. Here, we find that the branch which splits 1,2,3 and 4 versus 5 and 6 occurs in all 1000 samples; the branch which splits 2,3 and 4 versus 1,5 and 6 occurs in 1000; the branch which splits 2 and 3 versus 1,4,5 and 6 occurs in 812/1000 samples. We can put these figures on to the diagrammatic representation we made earlier of our unrooted NJ tree as follows:
SEQ 1 SEQ 4
I I
I I
SEQ 6 ----------I I I I--------- SEQ 2
I 1000 I 1000 I 812 I
I--------I-----------I----------I
I I
SEQ 5 ----------I I--------- SEQ 3
You can equally put these confidence figures on the rooted tree (in fact the interpretation is simpler with rooted trees).
With the unrooted tree,
the grouping of sequence 5 with 6 is significant (as is the grouping of sequences 1,2,3 and 4).
Equally the grouping of sequences 1,5 and 6 is significant (the same as saying that 2,3 and 4 group significantly).
However,
the grouping of 2 and 3 is not significant,
although it is relatively strongly supported.
Unfortunately, there is a small complication in the interpretation of these results. In statistical hypothesis testing, it is not valid to make multiple simultaneous tests and to treat the result of each test completely independantly. In the above case, if you have one particular test (grouping) that you wish to make in advance, it is valid to test IT ALONE and to simply show the other bootstrap figures for reference. If you do not have any particular test in mind before you do the bootstrapping, you can just show all of the figures and use the 95% level as an ARBITRARY cut off to show those groups that are very strongly supported; but not mention anything about SIGNIFICANCE testing. In the literature, it is common practice to simply show the figures with a tree; they frequently speak for themselves.
The positions of gaps that are generated in early alignments remain through later stages. This can be justified because gaps that arise from the comparison of closely related sequences should not be moved because of later alignment with more distantly related sequences. At each alignment stage, you align two groups of already aligned sequences. This is done using a dynamic programming algorithm where one allows the residues that occur in every sequence at each alignment position to contribute to the alignment score. A Dayhoff (1978) PAM matrix is used in protein comparisons.
The details of the algorithm used in ClustalV have been published in Higgins and Sharp (1989). This was an improved version of an earlier algorithm published in Higgins and Sharp (1988). First, you calculate a crude similarity measure between every pair of sequence. This is done using the fast, approximate alignment algorithm of Wilbur and Lipman (1983). Then, these scores are used to calculate a "guide tree" or dendrogram, which will tell the multiple alignment stage in which order to align the sequences for the final multiple alignment. This "guide tree" is calculated using the UPGMA method of Sneath and Sokal (1973). UPGMA is a fancy name for one type of average linkage cluster analysis, invented by Sokal and Michener (1958).
Having calculated the dendrogram, the sequences are aligned in larger and larger groups. At each alignment stage, we use the algorithm of Myers and Miller (1988) for the optimal alignments. This algorithm is a very memory efficient variation of Gotoh's algorithm (Gotoh, 1982). It is because of this algorithm that ClustalV can work on microcomputers. Each of these alignments consists of aligning 2 alignments, using what we call "profile alignments".
Profile alignments are a simple extension of 2 sequence alignments in that you can treat each of the two input alignments as single sequences but you calculate the score at aligned positions as the average weight matrix score of all the residues in one alignment versus all those in the other e.g. if you have 2 alignments with I and J sequences respectively; the score at any position is the average of all the I times J scores of the residues compared seperately. Any gaps that are introduced are placed in all of the sequences of an alignment at the same position. The profile alignments offered in the "profile alignment menu" are also calculated in this way.
PAM 250 C 12 S 0 2 T -2 1 3 P -3 1 0 6 A -2 1 1 1 2 G -3 1 0 -1 1 5 N -4 1 0 -1 0 0 2 D -5 0 0 -1 0 1 2 4 E -5 0 0 -1 0 0 1 3 4 Q -5 -1 -1 0 0 -1 1 2 2 4 H -3 -1 -1 0 -1 -2 2 1 1 3 6 R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6 K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 ---------------------------------------------------------------- C S T P A G N D E Q H R K M I L V F Y W IDENTITY MATRIX 10 0 10 0 0 10 0 0 0 10 0 0 0 0 10 0 0 0 0 1 10 0 0 0 0 0 0 10 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 PAM 100 14 -1 6 -5 2 7 -6 1 -1 10 -5 2 2 1 6 -8 1 -3 -3 1 8 -8 2 0 -3 -1 -1 7 -11 -1 -2 -4 -1 -1 4 8 -11 -2 -3 -3 0 -2 1 5 8 -11 -3 -3 -1 -2 -5 -1 1 4 9 -6 -4 -5 -2 -5 -7 2 -1 -2 4 11 -6 -1 -4, -2 -5 -8 -3 -6 -5 1 1 10 -11 -2 -1 -4 -4 -5 1 -2 -2 -1 -3 3 8 -11 -4 -2 -6 -3 -8 -5 -8 -6 -2 -7 -2 1 13 -5 -4 -1 -6 -3 -7 -4 -6 -5 -5 -7 -4 4 2 9 -12 -7 -5 -5 -5 -8 -6 -9 -7 -3 -5 -7 -6 4 2 9 -4 -4 -1 -4 0 -4 -5 -6 -5 -5 -6 -6 -6 1 5 1 8 -10 -5 -6 -9 -7 -8 -6 -11 -11 -10 -4 -7-11 -2 0 0 -5 12 -2 -6 -6 -11 -6 -11 -3 -9 -7 -9 -1-10-10 -8 -4 -5 -6 6 13 -13 -4 -10 -11 -11 -13 -8 -13 -14 -11 -7 1 -9-11-12 -7-14 -2 -2 19
There is a constant debate in the literature as to the merits of different methods but unfortunately, a lot of what is said is incomprehensible or inaccurate. It is also a field that is prone to having highly opinionated schools of thought. This is a pity because it prevents rational discussion of the pro's and con's of the different methods. The approach adopted in ClustalV is to supply just one method and to produce alignments in a format that can be used by Phylip. In simple cases, the trees produced will be as "good" (reliable, robust) as those from ANY other method. In more complicated cases, there is no single magic recipe that we can supply that will work well in even most situations.
The method we provide is the Neighbor Joining method (NJ) of Saitou and Nei (1987) which is a distance method. We use this for three reasons: it is conceptually and computationally simple; it is fast; it gives "good" trees in simple cases. It is difficult to prove that one tree is "better" than another if you do not know the true phylogeny; the few systematic surveys of methods show it to work more or less as well as any other method ON AVERAGE. Another reason for using the NJ method is that it is very commonly used; THIS IS A BAD REASON SCIENTIFICALLY but at least you will not feel lonely if you use it.
The NJ method works on a matrix of distances (the distance matrix) between all pairs of sequence to be analysed. These distances are related to the degree of divergence between the sequences. It is normal to calculate the distances from the sequences after they are multiply aligned. If you calculate them from seperate alignments (as done for the dendrograms in another part of this program), you may increase the error considerably.
This measure of distance is perfectly adequate (with some further modification described below) for rRNA sequences. However it treats all residues identically e.g. all amino acid substitutions are equally weighted. It also treats all positions identically e.g. it does not take account of different rates of substitution in different positions of different codons in protein coding DNA sequences; see Li et al (1985) for a distance measure that does. Despite these shortcomings, these percent identity distances do work well in practice in a wide variety of situations.
In a simple world, you would like a distance to be proportional to the time since the sequences diverged. If this were EXACTLY true, then the calculation of the tree would be a simple matter of algebra (UPGMA does this for you) and the branch lengths will be nice and meaningful (times). In practice this OBVIOUSLY depends on the existence and quality of the "molecular clock", a subject of on- going debate. However, even if there is a good clock, there is a further problem with estimating divergences. As sequences diverge, they become "saturated" with mutations. Sites can have substitutions more than once. Calculated distances will underestimate actual divergence times; the greater the divergence, the greater the discrepancy. There are various methods for dealing with this and we provide two commonly used ones, both due to Motoo Kimura; one for proteins and one for DNA.
For distance K (percent divergence /100 )
...
Correction for Protein distances: (Kimura, 1983).
Corrected K = -ln(1.0 - K - (K * k/5.0))
Correction for nucleotide distances: Kimura's 2-parameter method (Kimura,
1980).
Corrected K = 0.5*ln(a) + 0.25*ln(b) where a = 1/(1 - 2*P - Q) and b = 1/(1 - 2*Q)P and Q are the proportions of transitions (A<-->G, C<-->T) and transversions occuring between the sequences.
One paradoxical effect of these corrections,
is that distances can be corrected to have more than 100% divergence.
That is because,
for very highly diverged sequences of length N,
you can estimate that more than N substitutions have occured by correcting the observed distance in the above ways.
Don't panic!
As explained elsewhere in the documentation, you can only root the tree by one of two methods:
1) assume a degree of constancy in the molecular clock and place the root along the longest branch (internal or external). Methods that appear to produce rooted trees automatically are often just doing this without letting you know; this is true of UPGMA.
2) root the tree on biological grounds. The usual method is to include an "outgroup", a sequence that you are certain will branch to the outside of the tree.
The method works by taking random samples of data from the complete data set. You compute the test statistic (tree in this case) on each sample. Variation in the statistic computed from the samples gives a measure of variation in the statistic which can be used to calculate confidence intervals. Each random sample is the same size as the complete data set and is taken WITH REPLACEMENT i.e. a data point can be selected more than once (or not at all) in any given sample.
In the case of an alignment N residues long, each random sample is a random selection of N sites form the alignment. For each sample, we calculate a distance matrix and tree in the usual way. Variation in the sample trees compared to a tree calculated from the full data set gives an indication of how well supported the tree is by the data. If the sample trees are very similar to each other and to the full tree, then the tree is "strongly" supported; if the sample trees show great variation, then the tree will be weakly supported. In practice, you usually find some parts of a tree well supported, others weakly. This can be seen by counting how often each monophyletic group in the full tree occurs in the sample trees.
For a particular grouping, one considers it to be significant at the 95% level (P <= 0.05) if it occurs in 95% of the bootstrap samples. If a grouping is significant, it is significant with respect to the particular data set and method used for drawing the tree. Biological "significance" is another matter.
================= PHYLIP information sheet =====================
PHYLIP - Phylogeny Inference Package (version 3.3)
This is a FREE package of programs for inferring phylogenies and
carrying out certain related tasks. At present it contains 28
programs, which carry out different algorithms on different kinds of
data. The programs in the package are:
---------- Programs for molecular sequence data ----------
PROTPARS Protein parsimony
DNAPARS Parsimony method for DNA
DNAMOVE Interactive DNA parsimony
DNAPENNY Branch and bound for DNA
DNABOOT Bootstraps DNA parsimony
DNACOMP Compatibility for DNA
DNAINVAR Phylogenetic invariants
DNAML Maximum likelihood method
DNAMLK DNAML with molecular clock
DNADIST Distances from sequences
RESTML ML for restriction sites
----------- Programs for distance matrix data ------------
FITCH Fitch-Margoliash and least-squares methods
KITSCH Fitch-Margoliash and least squares methods with
evolutionary clock
--- Programs for gene frequencies and continuous characters --
CONTML Maximum likelihood method
GENDIST Computes genetic distances
------------- Programs for discrete state data -----------
MIX Wagner, Camin-Sokal, and mixed parsimony criteria
MOVE Interactive Wagner, C-S, mixed parsimony program
PENNY Finds all most parsimonious trees by branch-and-bound
BOOT Bootstrap confidence interval on mixed parsimony methods
DOLLOP, DOLMOVE, DOLPENNY, DOLBOOT same as preceding four
programs, but for the Dollo and polymorphism parsimony
criteria
CLIQUE Compatibility method
FACTOR recode multistate characters
---- Programs for plotting trees and consensus trees ----
DRAWGRAM Draws cladograms and phenograms on screens, plotters and
printers
DRAWTREE Draws unrooted phylogenies on screens, plotters and
printers
CONSENSE Majority-rule and strict consensus trees
The package includes extensive documentation files that provide the
information necessary to use and modify the programs.
COMPATIBILITY: The programs are written in a very standard subset of
Pascal, a language that is available on most computers (including
microcomputers). The programs require only trivial modifications to
run on most machines: for example they work with only minor
modifications with Turbo Pascal, and without modifications on VAX
VMS Pascal. Pascal source code is distributed in the regular version
of PHYLIP: compiled object code is not. To use that version, you
must have a Pascal compiler.
DISKETTE DISTRIBUTION: The package is distributed in a variety of
microcomputer diskette formats. You should send FORMATTED
diskettes, which I will return with the package written on them.
Unfortunately, I cannot write any Apple formats. See below for how
many diskettes to send. The programs on the magnetic tape or
electronic network versions may of course also be moved to
microcomputers using a terminal program.
PRECOMPILED VERSIONS: Precompiled executable programs for PCDOS
systems are available from me. Specify the "PCDOS executable
version" and send the number of extra diskettes indicated below.
An Apple Macintosh version with precompiled code is available from
Willem Ellis, Instituut voor Taxonomische Zoologie, Zoologisch
Museum, Universiteit van Amsterdam, Plantage Middenlaan 64, 1018DH
Amsterdam, Netherlands, who asks that you send 5 800K diskettes.
HOW MANY DISKETTES TO SEND: The following table shows for different
PCDOS formats how many diskettes to send, and how many extra
diskettes to send for the PCDOS executable version:
Diskette size Density For source code For executables, send
in addition
3.5 inch 1.44 Mb 2 1
5.25 inch 1.2 Mb 2 2
3.5 inch 720 Kb 4 2
5.25 inch 360 Kb 7 4
Some other formats are also available. You MUST tell me EXACTLY
which of these formats you need. The diskettes MUST be formatted by
you before being sent to me. Sending an extra diskette may be
helpful.
NETWORK DISTRIBUTION: The package is also available by distribution
of the files directly over electronic networks, and by anonymous ftp
from evolution.genetics.washington.edu. Contact me by electronic
mail for details.
TAPE DISTRIBUTION: The programs are also distributed on a magnetic
tape provided by you (which should be a small tape and need only be
able to hold two megabytes) in the following format: 9-track, ASCII,
odd parity, unlabelled, 6250 bpi (unless otherwise indicated).
Logical record: 80 bytes, physical record: 3200 bytes (i.e. blocking
factor 40). There are a total of 71 files. The first one describes
the contents of the package.
POLICIES: The package is distributed free. I do not make it
available or support it in South Africa. The package will be
written on the diskettes or tape, which will be mailed back. They
can be sent to:
Joe Felsenstein
Electronic mail addresses: Department of Genetics SK-50
Internet: joe@genetics.washington.edu University of Washington
Bitnet/EARN: felsenst@uwavm Seattle, Washington 98195
UUCP: uw-beaver!evolution.genetics!joe U.S.A.
===================== End of Phylip Info. Sheet ====================
Felsenstein, J. (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783-791.
Feng, D.-F. and Doolittle, R.F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J.Mol.Evol. 25, 351-360.
Gotoh, O. (1982) An improved algorithm for matching biological sequences. J.Mol.Biol. 162, 705-708.
Gribskov, M., McLachlan, A.D. and Eisenberg, D. (1987) Profile analysis: detection of distantly related proteins. PNAS USA 84, 4355-4358.
Higgins, D.G. and Sharp, P.M. (1988) CLUSTAL: a package for performing multiple sequence alignments on a microcomputer. Gene 73, 237-244.
Higgins, D.G. and Sharp, P.M. (1989) Fast and sensitive multiple sequence alignments on a microcomputer. CABIOS 5, 151-153.
Kimura, M. (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111-120.
Kimura, M. (1983) The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, England.
Li, W.-H., Wu, C.-I. and Luo, C.-C. (1985) A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol.Biol.Evol. 2, 150-174.
Myers, E.W. and Miller, W. (1988) Optimal alignments in linear space. CABIOS 4, 11-17.
Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. PNAS USA 85, 2444-2448.
Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol.Biol.Evol. 4, 406-425.
Sneath, P.H.A. and Sokal, R.R. (1973) Numerical Taxonomy. Freeman, San Francisco.
Sokal, R.R. and Michener, C.D. (1958) A statistical method for evaluating systematic relationships. Univ.Kansas Sci.Bull. 38, 1409-1438.
Vingron, M. and Argos, P. (1991) Motif recognition and alignment for many sequences by comparison of dot matrices. J.Mol.Biol. 218, 33-43.
Wilbur, W.J. and Lipman, D.J. (1983) Rapid similarity searches of nucleic acid and protein data banks. PNAS USA 80, 726-730.