'96 BC/BP 378

Week 11

Multiple Sequence Analysis: How to work with more than just two sequences at a time. The power and sensitivity of sequence based computational methods dramatically increases with the addition of more data. The GCG program Pileup will be used to create a multiple sequence alignment and then that alignment will be analyzed with the Profile suite.

Author:

Steven M. Thompson

Introduction

The power and sensitivity of sequence based computational methods dramatically increases with the addition of more data. As you've seen in pair-wise comparisons, those areas most resistant to change are functionally the most important to the molecule. However, with increased dataset size, the patterns of conservation become evermore clear. But how does one work with more than just two sequences at a time? You could painstakingly manually align all your sequences using some type of editor, and many people do just that, but some type of an automated solution is desirable, at least as a starting point to manual alignment. However, solving the dynamic programming algorithm for more than than just two sequences rapidly becomes intractable as computational needs increase with the exponent of the dataset size (complexity=[sequence length]number of sequences). Mathematically this is an N-dimensional matrix, quite complex indeed. One program, MSA (version 2.0, 1995), does attempt to globally solve this equation, however, the algorithm's complexity precludes its use in most situations. MSA is available through the Internet at a few supercomputuer Centers, but it will not be explored in this week's exercise.

Several heuristics have been employed over the years to simplify the complexity of the problem. One way to still globally solve the algorithm and yet reduce its complexity is to restrict the search space to only the most conserved 'local' portions of all the sequences involved. This approach is used by the program PIMA (version 1.4, 1995). This program is also available on the Internet, however, we will not be using it this week either. The most commonly used approach to the problem is known as the pairwise, progressive dynamic programming solution. This variation of the dynamic programming algorithm generates a global alignment, but restricts its search space at any one time to a local neighborhood of the full length of two sequences. The pairwise, progressive solution is implemented in several programs including Dr. Des Higgins' ClustalW (1994) and the GCG program PileUp. Both programs insert gaps to align the full length of a sequence set to produce a multiple sequence alignment.

As you saw with pair-wise alignments and searching, all of this stuff is much easier with protein sequences versus nucleotides. If you are forced to align nucleotides the whole process becomes much more difficult. Therefore, just like in database searching, translate nucleotide sequences to their protein counterparts if you are dealing with coding sequences before performing further analyses including multiple sequence alignment. If one is required to align nucleotides because the region does not code for a protein, then automated methods may be able to help as a starting point, but are certainly not guaranteed to come up with a biologically correct alignment. The resulting alignment will probably have to be extensively edited, if it works at all. To help assure the reliability of nucleotide alignments always use comparitive appoaches. Look for conserved structural and functional sites to help guide your judgement. In ribosomal RNA alignments researchers have successfully used the conservation of covarying sites to assist in this process. That is, as one base in a stem structure changes the corresponding Watson-Crick paired base will change in a corresponding manner. This process has been used extensively by the Ribosomal Database Project at the University of Illinois, Urbana Campus, to help guide the construction of their rRNA alignments and structures (http://rdp.life.uiuc.edu).

Another powerful approach that should be utilized if at all possible is the Profile suite (Gribskov, et al., 1987). This strategy works best when one has prepared and refined a multiple sequence alignment of significantly similar sequences or regions within sequences. Profile searching involves forming a "profile" from an alignment of related sequences and then searching the databases with that profile. Profile searching is tremendously powerful and should be pursued whenever possible. A very appropriate strategy is to find similar genes to a newly sequenced gene using traditional database searching techniques and then align all of the significantly similar proteins or protein domains. The aligned sequences can then be run through the Profile package to generate a profile of the family. Often Profile analysis can show features not obvious to individual members. A distinct advantage is in further manipulations and database searches, evolutionary issues are considered by virtue of the Profile algorithms. Gaps are penalized more heavily in conserved areas than they are in variable regions and the more highly conserved a residue is, the more important it becomes. Furthermore, any generated consensus sequences are not based merely on the positional frequency of particular residues but rather utilize the evolutionary conservation of substitutions based on the Dayhoff PAM table (by default, other substitution matrices can be specified). Therefore, the resultant consensus residues are the most evolutionarily conserved rather than just statistically the most frequent. This can mean much more to us than an ordinary consensus and is especially appropriate in the design of hybridization and PCR probes for unknown sequences where data is available in related species.

The GCG program PlotSimilarity can be used to assist in probe design by allowing you to visualize the most important, conserved regions of an alignment. It is invaluable for designing phylogenetic specific probes as it clearly localizes areas of high conservation and variability in an alignment. Depending on the dataset that you analyze, any level of phylogenetic specificity can be achieved. Pick areas of high variability in the overall dataset that correspond to areas of high conversation in phylogenetic category subset datasets to differentiate between universal and specific potential probe sequences. One can then use the GCG program Prime as you saw in Exercise #8 to further test potential probes for common PCR conditions and problems.

Exercise for week 11:

In this week's exercise you will use the GCG program PileUp to align a data set containing a protein with representatives in all the branches of cellular life. This ubiguitous protein is the alpha subunit of Elongation Factor-1 (EF-1) in Eukaryota and Archaebacteria, known as Elongation Factor Tu in Eubacteria. It is crucial to the universal process of protein biosynthesis and promotes the GTP-dependent binding of aminoacyl-tRNA to the A-site of the intact ribosome. GTP is hydrolyzed to GDP in the process. Elongation Factor 1/Tu has guanine nucletide, ribosome, and aminoacyl-tRNA binding sites. There are three distinct types of elongation factors that all work together to help perform the vital function of protein biosynthesis. In Eubacteria and Eukaryota they have the following names (the nomenclature in Archaebacteria has not been completely worked out and is often contradictory):


  Eukaryota   Eubacteria    Function
  ---------------------------------------------------------------------------
    EF-1       EF-Tu          Binds GTP and an aminoacyl-tRNA; delivers the 
                               latter to the A site of ribosomes.
    EF-1       EF-Ts          Interacts with EF-1/EF-Tu to
                               displace GDP and thus allows the regeneration 
                               of GTP-EF-1   
    EF-2        EF-G           Binds GTP and peptidyl-tRNA and translocates 
                               the latter from the A site to the P site.

In EF-1, a specific region is involved in a conformational change mediated by the hydrolysis of GTP to GDP. This region is conserved in both EF-1/EF-Tu and EF-2/EF-G and seems to be typical of GTP-dependent proteins which bind non-initiator tRNAs to the ribosome.

In E.coli EF-Tu is encoded by a duplicated loci, tufA and tufB located about 15 minutes apart on the chromosome at positions 74.92 and 90.02 (ECDC). In humans at least twenty loci on seven different chromosomes demonstrate homology to the gene, However, only two of them are potentially active; the remainder appear to be retropseudogenes (Madsen, et al. 1990). It is encoded in both the nucleus and mitochondria and chloroplast genomes in eukaryotes and is a globobular, cytoplasmic enzyme in all life forms.

Partial E. coli structures have been resolved and deposited in the Protein Data Bank (1EFM and 1ETU) and the complete Thermus aquaticus structure has been determined (1EFT). All three structures show the protein in complex with its nucleotide ligand. The Thermus aquaticus structure is shown below. Notice that half of the protein has well defined alpha helices and the rest is rather unordered coils. GTP/GDP fits right down in amongst all the helices in the pocket.

Using the E. coli numbering scheme the guanine nucleotide binding site involves the following regions: residues 18 to 25, residues 80 to 84, and residues 135 to 138. Residue 8 is associated with aminoacyl-tRNA binding. The six defined helices occur from residue 24 through 39, 84 through 92, 113 through 125, 143 through 160, 174 through 179, and 183 through 198.

Given a particular sequence of interest, one can use any text search tool, such as GCG's StringSearch, or NCBI's Entrez, or tools on the World Wide Web, to find that entry's name in a sequence database. After the entry has been identified a natural next step is to use some type of a similarity matching program, such as FastA or BLAST to help prepare a list of sequences to be aligned. One of the more difficult aspects of multiple alignment is knowing what sequences you should attempt it with. Any list from any program run will need to be edited to include only those things that actually should be aligned in a set. You do not want to try to align "apples and oranges." To begin this exercise I will illustrate the techniques with a list file prepared by me that points to the appropriate entries in SwissProtein that will be aligned. I have tried to get as broad a representation across all of cellular life as possible while still keeping it within the practical limits of an afternoon session on the computer. You are to read and follow along with my example, but do not run the analyses with the representative list in your own account -- it is for use as an example only! You will be running the alignment portion of the exercise after you've seen how it works with the most similar EF-1/Tu sequences that you can find to your favorite branch off the tree of life. And then for the really gung-ho of you, you can use the Elongation Factor 2/G similarly for extra credit.

After an alignment is prepared, what are the types of things that can be done with it? In this week's exercise you will prepare a profile using your EF-1/Tu subset alignment. As described in the introduction, profile analysis can provide the most sensitive, albeit extremely computationally intensive, database similarity search possible. Profile searching with EF-1/Tu should allow us to find all other guanine nucleotide binding sequences in the protein database. It will also point out several other interesting comparisons.

We can visualize the areas of an alignment that profile searching puts the most emphasis on. These are the most conserved areas of an alignment, and thus functionally the most important. Realize that in addition to the primary sequence conservation seen in these regions, structure and function is also conserved. We will use the GCG program PlotSimilarity and the Extended GCG (EGCG) program PrettyPlot to see these crucial regions of our alignment.

And finally, in next week's exercise, we can use multiple sequence alignments to infer phylogeny. Based on the explicit assertion of homologous postitions in an alignment several algorithms available can estimate the most reasonable evolutionary tree for that alignment.

l) Activate the machine, connect to ribozyme and log into your account.

By this time you should know how to activate the machine you want to use, make connections with ribozyme, and log into your account. If you still need help with these functions, refer to the beginning of the exercises for weeks 2 and 3 for step by step instructions.

2) Move to this week's subdirectory and copy the necessary files to it.

% cd eleven

Now move over all the files needed to do this week's exercise. They are located in the directory location $UGRAD_DIR/week11.

% cp $UGRAD_DIR/week11/* .

3) Run the demo that describes this weeks activities.

This week's demo illustrates multiple sequence alignment analysis with a very large group of extremely important proteins. These are the G-Protein coupled, transmembrane receptors. These are also known as TM7 receptors because they all have seven alpha helices that cross back and forth across the cell membrane. They are vital to many regulatory and sensory pathways including several hormonal systems and sight and smell. This demo will show a multiple sequence alignment of a subset of these sequences and illustrate several of the powerful types of analyses that can be done with them. Pay particular attention to how much more robust structural analyses such as secondary structure prediction can be when more than just one sequence is analyzed. To launch this week's demo issue the following command:

% demo11

Several of the routines explored in this week's exercise require graphics configuration. Therefore, be sure to use setplot to designate your graphics configuration after activating gcg before beginnning the exercise.

4) Multiple Sequence Alignment -- GCG's PileUp.

First let's look at the list file that I have prepared as an example. Following the more command:

% more representative.list

This is a list of 25 representative Elongation Factor 1 Alpha (Tu in
Eubacteria) protein sequences.  This list spans all of cellular life and
attempts to collect sequences from as broad a phylogenetic spectrum as
available in SwissProtein version 30.     ..

SwissProtein:EF10_XENLA
SwissProtein:EF11_DROME
SwissProtein:EF11_HUMAN
SwissProtein:EF1A_ARATH
SwissProtein:EF1A_DICDI
SwissProtein:EF1A_ENTHI
SwissProtein:EF1A_EUGGR
SwissProtein:EF1A_GIALA
SwissProtein:EF1A_ONCVO
SwissProtein:EF1A_PLAFK
SwissProtein:EF1A_PYRWO
SwissProtein:EF1A_SULSO
SwissProtein:EF1A_TETPY
SwissProtein:EF1A_THEAC
SwissProtein:EF1A_WHEAT
SwissProtein:EF1A_YEAST
SwissProtein:EFTU_ANANI
SwissProtein:EFTU_CHLTR
SwissProtein:EFTU_ECOLI
SwissProtein:EFTU_HALMA
SwissProtein:EFTU_METVA
SwissProtein:EFTU_MYCGA
SwissProtein:EFTU_MYCTU
SwissProtein:EFTU_THEAQ
SwissProtein:EFTU_THEMA

Before reading through my example below, take a look at the members of representative.list. Investigate any that you are curious about by using the GCG command typedata -reference to read the documentation for that entry. Pick one that you would like to work with further. Submit that entry to fasta using batch mode to find the sequences most similar to it in the SwissProtein database. Use the following options: -noincrease, -optall, -noalign, -nohistogram, and -batch . Limit your output list to twenty (20). Of course this search should find all EF-1-Tu sequences. The point is you want to prepare an alignment of only the most similar sequences within one of the major phylogenetic branches, Archaebacteria, Eubacteria, Protist, Plant, Fungus, or Animal, represented in representative.list, not all the branches together as I do in the example below. Therefore, follow the screen trace to submit your favorite entry from representative.list to FastA (here I will use the yeast entry as an example of something from the Fungus branch; pick your own favorite):

% fasta -noincrease -optall -noalign -nohistogram -batch

FASTA does a Pearson and Lipman search for similarity between a query
sequence and any group of sequences.  For nucleotide database searches,
FASTA is more sensitive than BLAST.  This program has a known bug
which produces long output files when queried with a relatively common
sequence.  Use the command -noalign to restrict the length of the output
file in these cases when running in batch.

 FASTA with what query sequence ?  sw:ef1a_yeast

                  Begin (* 1 *) ?  <rtn>
                End (*   458 *) ?  <rtn>

 Search for query in what sequence(s) (* SwissProt:* *) ? <rtn>

 What word size (* 2 *) ?  <rtn>

 List how many best scores (* 40 *) ?  20

 What should I call the output file (* ef1a_yeast.fasta *) ? <rtn>

 ** fasta will run as a batch or at job.

 ** fasta was submitted using the command:
    "  SUBMIT-NOPRINT-NOTIFY-QUEUE  "

Job FASTA_112556 (queue RIBOZYME, entry 395) started on RIBOZYME

After this job has been submitted, run FastA with your favorite EF-1-Tu sequence from representative.list against the NRL_3D database. Because NRL_3D is quite small, you can do this search interactively. Limit your output to the top ten sequences this time. Remember this is a database of all of the sequences available in Brookhaven's three-dimensional PDB. Significantly similar sequences here can help guide homology modelling efforts and may lead you into whole new areas of investigation. I am only going to show a screen trace of the command line for this process and expect you to pull it off on your own. The exercise report form will ask about the results of this search so I do not suggest blowing it off.

% fasta -optall sw:ef1a_yeast nrl_3d:*

Now read through my example of PileUp below with representative.list to see the progressive, pairwise nature of this implementation of dynamic programming . The program is launched with the command pileup -check to read about and use the available options. The batch option is available in PileUp so that one doesn't have to wait for the program to finish before going on. Other options I often recommend include the optional symbol comparison matrix GenMoreData:Blosum62.Cmp and a more appropriate name for the program's cluster dendrogram than Pileup.Figure. If you use an alternate symbol comparison matrix, then be sure to correspondingly change the gap imposition and gap extension penalties. In the following case, since the default PAM matrix in GCG uses an identity value of 1.5 and the Blosum62 matrix uses a range of identities from 4 to 11, I increase the penalties by a factor of from three to four. The Blosum62 matrix seems to work particularly well when the sequences cover a broad phylogenetic spectrum as they do here.

% pileup -check

PileUp creates a multiple sequence alignment from a group of related
sequences using progressive, pairwise alignments.  It can also plot a
tree showing the clustering relationships used to create the alignment.

Minimal Syntax: % PILEUP -[INfile=]@Hsp70.List -Default

Prompted Parameters:

-GAPweight=3.0          gap creation penalty
-LENgthweight=0.1       gap extension penalty
-DENsity=20.0           number of sequences per 100 pu in the dendrogram
[-OUTfile1=]Hsp70.MSF   output file for multiple sequence alignment

Local Data Files:-DATa=PileUpPep.Cmp  scoring matrix for peptides
                 -DATa=PileUpDNA.Cmp  scoring matrix for nucleic acids

Optional Parameters:

-BEGin=1     sets beginning position for every sequence to be aligned
-END=100     sets ending position for every sequence to be aligned
-ENDWeight   penalizes end gaps like other gaps
-HIGhroad    selects "top" alignment path for equally optimal gaps
-LOWroad     selects "bottom" alignment path for equally optimal gaps
 Press q to quit or <Return> for more:
-MAXSeg=5000 sets maximum segment length for every input sequence -MAXGap=2000 sets maximum combined length of all gaps added to a sequence -NOSORt presents output sequences in the same order as input -LINesize=50 sets the number of sequence symbols per line -BLOcksize=10 sets the number of sequence symbols per block -DEGap removes gap characters ('.') from the input sequences -NOPLOt suppresses plot of clustering relationships -NOMONitor suppresses screen trace of each alignment -NOSUMmary suppresses screen summary at the end of the program -BATch submits program to the batch queue All GCG graphics programs accept these and other switches. See the Using Graphics chapter of the USERS GUIDE for descriptions. -FIGure[=FileName] stores plot in a file for later input to FIGURE -FONT=3 draws all text on the plot using font 3 -COLor=1 draws entire plot with pen in stall 1 -SCAle=1.2 enlarges the plot by 20 percent (zoom in) -XPAN=10.0 moves plot to the right 10 platen units (pan right) -YPAN=10.0 moves plot up 10 platen units (pan up) -PORtrait rotates plot 90 degrees Add what to the command line ? -data=genmoredata:blosum62.cmp -figure=ef.pileu p -batch PileUp of what sequences ? @representative.list

I answer @representative.list to indicate that I want PileUp to work on a listing of file names called representative.list rather than individual sequences within my directory. The @ symbol is necessary anytime you specify a list file to a GCG program. Next the program finds the sequences in the database and brings them over to work with. Then it asks for the gap parameter values. Be careful here. If you specify an alternate data matrix, as I do here, you need to adjust the defaults to reflect the new table's scoring system as discussed above. I then give the output an appropriate filename. Finally, PileUp compares each sequence to every other sequence and then aligns them in the order determined by that similarity. The process is shown below for my example:

   1      Ef10_Xenla   462 aa
   2      Ef11_Drome   463 aa
   3      Ef11_Human   462 aa
   4      Ef1a_Arath   449 aa
   5      Ef1a_Dicdi   456 aa
   6      Ef1a_Enthi   430 aa
   7      Ef1a_Euggr   445 aa
   8      Ef1a_Giala   396 aa
   9      Ef1a_Oncvo   464 aa
  10      Ef1a_Plafk   443 aa
  11      Ef1a_Pyrwo   430 aa
  12      Ef1a_Sulso   435 aa
  13      Ef1a_Tetpy   435 aa
  14      Ef1a_Theac   424 aa
  15      Ef1a_Wheat   447 aa
  16      Ef1a_Yeast   458 aa
  17      Eftu_Anani   409 aa
  18      Eftu_Chltr   393 aa
  19      Eftu_Ecoli   393 aa
  20      Eftu_Halma   420 aa
  21      Eftu_Metva   428 aa
  22      Eftu_Mycga   394 aa
  23      Eftu_Myctu   396 aa
  24      Eftu_Theaq   405 aa
  25      Eftu_Thema   400 aa

 *** I read your local data file "Genmoredata:blosum62.cmp". ***

 What is the gap creation penalty (* 3.00 *) ?  12.0

 What is the gap extension penalty (* 0.10 *) ?  0.40

 The minimum density for a one-page plot is 18.0 sequences-100 platen units.
 What density do you want (* 18.0 *) ?  <rtn>

 What should I call the output file name (* representative.msf *) ? ef.msf

If you run the program in batch mode, then the following screen trace is produced:

 ** pileup will run as a batch or at job.

 ** pileup was submitted using the command:
    "  SUBMIT-NOPRINT-NOTIFY-QUEUE  "

Job PILEUP_102723 (queue RIBOZYME, entry 390) started on RIBOZYME

Otherwise, in an interactive (nonbatch) process, you'll get the following, shown in an abridged fashion:

 Determining pairwise similarity scores...

   1   x     2       4.45
   1   x     3       5.06
   1   x     4       4.03
   1   x     5       3.99
   1   x     6       4.09
   1   x     7       4.09
   1   x     8       3.43
   1   x     9       4.41
   1   x    10       3.73
   1   x    11       2.74
   1   x    12       2.83
   1   x    13       4.03
   1   x    14       2.95
   1   x    15       4.09
   1   x    16       4.28
   1   x    17       1.03
   1   x    18       0.91
   1   x    19       1.03
   1   x    20       2.61
   1   x    21       2.73
   1   x    22       1.07
   1   x    23       1.12
   1   x    24       1.18
   1   x    25       1.20
///////////////////////////////////////////////////////////////////////
  18   x    19       3.46
  18   x    20       1.32
  18   x    21       1.23
  18   x    22       3.25
  18   x    23       3.27
  18   x    24       3.42
  18   x    25       3.32
  19   x    20       1.46
  19   x    21       1.24
  19   x    22       3.81
  19   x    23       3.90
  19   x    24       3.85
  19   x    25       3.72
  20   x    21       3.27
  20   x    22       1.49
  20   x    23       1.52
  20   x    24       1.49
  20   x    25       1.52
  21   x    22       1.29
  21   x    23       1.27
  21   x    24       1.35
  21   x    25       1.34
  22   x    23       3.53
  22   x    24       3.71
  22   x    25       3.60
  23   x    24       3.74
  23   x    25       3.45
  24   x    25       3.77

 Aligning...

   1     .......................-...
   2     ......................-...
   3     .......................-.
         .......................-...
   4     .......................-.
         .......................-...
   5     ......................-.
         ......................-...
   6     ......................-.
         ......................-...
   7     .....................-...
   8     .....................-.
         .....................-...
   9     .......................-.
         .......................-...
  10     ...................-..
  11     ......................-.
         ......................-...
  12     ....................-...
  13     ......................-.
         ......................-...
  14     ....................-.
         ....................-..
  15     ...................-.
         ...................-..
  16     ...................-.
         ...................-..
  17     ...................-..
  18     ...................-.
         ...................-..
  19     .....................-...
  20     .....................-...
  21     .....................-.
         .....................-...
  22     .....................-.
         .....................-...
  23     ......................-.
         ......................-...
  24     .......................-...

 FIGURE instructions are now being written into ef.pileup.

        Total sequences:         25
       Alignment length:        488
               CPU time:   09:37.35

            Output file:disk3/thompson/bc378/week_11/ef.msf

Notice how the program first compares every sequence with every other one. This is the pairwise nature of the program, then it progressively merges them into the alignment in the order of determined similarity, from most to least.

The abridged output file from the representative.list alignment follows below. Notice the interleaved character of the sequences, yet they all have unique identities, addressable by using their MSF filename together with their own name in braces, {}.

% more ef.msf

PileUp of: @representative.list

 Symbol comparison table: Genmoredata:Blosum62.Cmp  CompCheck: 6430

                   GapWeight: 12.000
             GapLengthWeight: 0.400

 Ef.Msf  MSF: 488  Type: P  October 11, 1995 09:45  Check: 5176 ..

 Name: Eftu_Ecoli       Len:   488  Check:  910  Weight:  1.00
 Name: Eftu_Myctu       Len:   488  Check:  474  Weight:  1.00
 Name: Eftu_Anani       Len:   488  Check: 7907  Weight:  1.00
 Name: Eftu_Theaq       Len:   488  Check: 2261  Weight:  1.00
 Name: Eftu_Mycga       Len:   488  Check: 2294  Weight:  1.00
 Name: Eftu_Thema       Len:   488  Check: 8510  Weight:  1.00
 Name: Eftu_Chltr       Len:   488  Check: 7702  Weight:  1.00
 Name: Ef1a_Arath       Len:   488  Check: 1012  Weight:  1.00
 Name: Ef1a_Wheat       Len:   488  Check: 8651  Weight:  1.00
 Name: Ef1a_Euggr       Len:   488  Check: 4712  Weight:  1.00
 Name: Ef10_Xenla       Len:   488  Check: 5877  Weight:  1.00
 Name: Ef11_Human       Len:   488  Check: 6029  Weight:  1.00
 Name: Ef11_Drome       Len:   488  Check: 4282  Weight:  1.00
 Name: Ef1a_Oncvo       Len:   488  Check: 4819  Weight:  1.00
 Name: Ef1a_Yeast       Len:   488  Check: 9637  Weight:  1.00
 Name: Ef1a_Enthi       Len:   488  Check: 6304  Weight:  1.00
 Name: Ef1a_Tetpy       Len:   488  Check: 5600  Weight:  1.00
 Name: Ef1a_Dicdi       Len:   488  Check: 1674  Weight:  1.00
 Name: Ef1a_Plafk       Len:   488  Check: 2609  Weight:  1.00
 Name: Ef1a_Giala       Len:   488  Check: 5443  Weight:  1.00
 Name: Ef1a_Pyrwo       Len:   488  Check: 2595  Weight:  1.00
 Name: Ef1a_Theac       Len:   488  Check: 5839  Weight:  1.00
 Name: Eftu_Halma       Len:   488  Check: 6928  Weight:  1.00
 Name: Eftu_Metva       Len:   488  Check: 2017  Weight:  1.00
 Name: Ef1a_Sulso       Len:   488  Check: 1090  Weight:  1.00

//

            1                                                   50
Eftu_Ecoli  .SKEKFERTK PHVNVGTIGH VDHGKTTLTA AITTVLA... ..........
Eftu_Myctu  MAKAKFQRTK PHVNIGTIGH VDHGKTTLTA AITKVLH... ..........
Eftu_Anani  MARAKFERTK PHANIGTIGH VDHGKTTLTA AITTVLA... ..........
Eftu_Theaq  .AKGEFIRTK PHVNVGTIGH VDHGKTTLTA ALTYVAA... ..........
Eftu_Mycga  MAKERFDRSK PHVNIGTIGH IDHGKTTLTA AICTVLS... ..........
Eftu_Thema  MAKEKFVRTK PHVNVGTIGH IDHGKSTLTA AITKYLS... ..........
Eftu_Chltr  .SKETFQRNK PHINIGAIGH VDHGRTTLTA AITRTLS... ..........
Ef1a_Arath  .....MGKEK FHINIVVIGH VDSGKSTTTG HLIYKLGGID KRVIERFEKE
Ef1a_Wheat  .....MGKEK THINIVVIGH VDSGKSTTTG HLIYKLGGID KRVIERFEKE
Ef1a_Euggr  .....MGKEK VHISLVVIGH VDSGKSTTTG HLIYKCGGID KRTIEKFEKE
Ef10_Xenla  .....MGKEK THINIVVIGH VDSGKSTTTG HLIYKCGGID KRTIEKFEKE
Ef11_Human  .....MGKEK THINIVVIGH VDSGKSTTTG HLIYKCGGID KRTIEKFEKE
Ef11_Drome  .....MGKEK IHINIVVIGH VDSGKSTTTG HLIYKCGGID KRTIEKFEKE
Ef1a_Oncvo  .....MGKEK THINIVVIGH VDSGKSTTTG HLIYKCGGID KRTIEKFEKE
Ef1a_Yeast  .....MGKEK SHINVVVIGH VDSGKSTTTG HLIYKCGGID KRTIEKFEKE
Ef1a_Enthi  .....MPKEK THINIVVIGH VDSGKSTTTG HLIYKCGGID QRTIEKFEKE
Ef1a_Tetpy  ....MARGDK VHINLVVIGH VDSGKSTTTG HLIYKCGGID KRVIEKFEKE
Ef1a_Dicdi  ..MEFPESEK THINIVVIGH VDAGKSTTTG HLIYKCGGID KRVIEKYEKE
Ef1a_Plafk  .....MGKEK THINLVVIGH VDSGKSTTTG HIIYKLGGID RRTIEKFEKE
Ef1a_Giala  .......... .......... .....STLTG HLIYKCGGID QRTIDEYEKR
Ef1a_Pyrwo  ...MKMPKDK PHVNIVFIGH VDHGKSTTIG RLLYDTGNIP EQIIKKF.EE
Ef1a_Theac  .....MASQK PHLNLITIGH VDHGKSTLVG RLLYEHGEIP AHIIEEYRKE
Eftu_Halma  .......SDE QHQNLAIIGH VDHGKSTLVG RLLYETGSVP EHVIEQHKEE
Eftu_Metva  .....MAKTK PILNVAFIGH VDAGKSTTVG RLLLDGGAID PQLIVRLRKE
Ef1a_Sulso  ......MSQK PHLNLIVIGQ VDHGKSTLVG RLLMDRGFID EKTVKEAEEA

            51                                                 100
Eftu_Ecoli  KTYG..GAAR AFDQIDNAPE EKARGITINT SHVEYDTPTR HYAHVDCPGH
Eftu_Myctu  DKFPDLNETK AFDQIDNAPE ERQRGITINI AHVEYQTDKR HYAHVDAPGH
Eftu_Anani  KA.G.MAKAR AYADIDAAPE EKARGITINT AHVEYETGNR HYAHVDCPGH
Eftu_Theaq  AE.NPNVEVK DYGDIDKAPE ERARGITINT AHVEYETAKR HYSHVDCPGH
Eftu_Mycga  KA..GTSEAK KYDEIDAAPE EKARGITINT AHVEYATQNR HYAHVDCPGH
Eftu_Thema  LKV..LAQYI PYDQIDKAPE EKARGITINI THVEYETEKR HYAHIDCPGH
Eftu_Chltr  ..GDGLADFR DYSSIDNTPE EKARGIPINA SHVEYETANR HYAHVDCPCH
Ef1a_Arath  AAEMNKRSFK YAWVLDKLKA ERERGITIDI ALWKFETTKY YCTVIDAPGH
Ef1a_Wheat  AAEMNKRSFK YAWVLDKLKA ERERGITIDI ALWKFETTKY YCTVIDAPGH
Ef1a_Euggr  ASEMGKGSFK YAWVLDKLKA ERERCITIDI ALWKFETAKS VFTIIDAPGH
Ef10_Xenla  AAEMGKGSFK YAWVLDKLKA ERERGITIDI SLWKFETSKY YVTIIDAPGH
Ef11_Human  AAEMGKGSFK YAWVLDKLKA ERERGITIDI SLWKFETSKY YVTIIDAPGH
Ef11_Drome  AQEMGKGSFK YAWVLDKLKA ERERGITIDI ALWKFETAKY YVTIIDAPGH
Ef1a_Oncvo  AQEMGKGSFK YAWVLDKLKA ERERGIQIDI ALWKFETPKY YITIIDAPGH
Ef1a_Yeast  AAELGKGSFK YAWVLDKLKA ERERGITIDI ALWKFETPKY QVTVIDAPGH
Ef1a_Enthi  SAEMGKGSFK YAWVLDNLKA ERERGITIDI SLWKFETSKY YFTIIDAPGH
Ef1a_Tetpy  SAEQGKGSFK YAWVLDKLKA ERERGITIDI SLWKFETAKY HFTIIDAPGH
Ef1a_Dicdi  ASEMGKQSFK YAWVMDKLKA ERERGITIDI ALWKFETSKY YFTIIDAPGH
Ef1a_Plafk  SAEMGKGSFK YAWVLDKLKA ERERGITIDI ALWKFETPRY FFTVIDAPGH
Ef1a_Giala  ATEMGKGSFK YAWVLDQLKD ERERGITINI ALWKFETKKY IVTIIDAPGH
Ef1a_Pyrwo  MGEKGK.SFK FAWVMDRLRE ERERGITIDV AHTKFETPHR YITIIDAPGH
Ef1a_Theac  AEQKGKATFE FAWVMDRFKE ERERGVTIDL AHRKFETDKY YFTLIDAPGH
Eftu_Halma  AEEKGKGGFE FAYVMDNLAE ERERGVTIDI AHQEFSTDTY DFTIVDCPGH
Eftu_Metva  AEEKGKAGFE FAYVMDGLKE ERERGVTIDV AHKKFPTAKY EVTIVDCPGH
Ef1a_Sulso  AKKLGKESEK FAFLLDRLKE ERERGVTINL TFMRFETKKY FFTIIDAPGH

//////////////////////////////////////////////////////////////////

            401                                                450
Eftu_Ecoli  .......... PEGVEMVMPG DNIKMVVTLI HPIAMDDGL. .....RFAIR
Eftu_Myctu  .......... PEGTEMVMPG DNTNISVKLI QPVAMDEGL. .....RFAIR
Eftu_Anani  .......... GSAAEMVIPG DRIKMTVELI NPIAIEQGM. .....RFAIR
Eftu_Theaq  .......... PQGVEMVMPG DNVTFTVELI KPVALEEGL. .....RFAIR
Eftu_Mycga  .......... KEGTEMVMPG DNTEIIVELI SSIACEKGS. .....KFSIR
Eftu_Thema  .......... PEGVEMVMPG DHVEMEIELI YPVAIEKGQ. .....RFAVR
Eftu_Chltr  .......... PEGVEMVMPG DNVEFEVQLI SPVALEEGM. .....RFAIR
Ef1a_Arath  RRSGKEI..E KE.PKFLKNG DAGMVKMTPT KPMVVETFSE YPPLGRFAVR
Ef1a_Wheat  RRSGKEL..E AL.PKFLKNG DAGIVKMIPT KPMVVETFAT YPPLGRFAVR
Ef1a_Euggr  RRSGKEL..E AE.PKFIKSG DAAIVLMKPQ KPMCVESFTD YPPLG.VSCG
Ef10_Xenla  RRSGKKL..E .DNPKFLKSG DAAIVDMIPG KPMCVESFSD YPPLGRFAVR
Ef11_Human  RRSGKKL..E .DGPKFLKSG DAAIVDMVPG KPMCVESFSD YPPLGRFAVR
Ef11_Drome  RRSGKTT..E .ENPKFIKSG DAAIVNLVPS KPLCVEAFQE FPPLGRFAVR
Ef1a_Oncvo  RRSGKKV..E .DNPKSLKSG DAGIIDLIPT KPLCVETFTE YPPLGRFAVR
Ef1a_Yeast  RRSGKKL..E .DHPKFLKSG DAALVKFVPS KPMCVEAFSE YPPLGRFAVR
Ef1a_Enthi  RRTGKSM..E GGEPEYIKNG DSALVKIVPT KPLCVEEFAK FPPLGRFAVR
Ef1a_Tetpy  RRTGKSQ..E .ENPKFIKNG DAALVTLIPT KALCVEVFQE YPPLGRYAVR
Ef1a_Dicdi  RRTGAVVAKE GTAAVVLKNG DAAMVELTPS RPMCVESFTE YPPLGRFAVR
Ef1a_Plafk  KRSGKVVE.. .ENPKAIKSG DSALVSLEPK KPMVVETFTE YPPLGRFAIR
Ef1a_Giala  KRT...LKPE MENPPDAGRG DCIIVKMVPQ KPLCCETFND YAPLGPFAVR
Ef1a_Pyrwo  PKTGNIVE.. .ENPQFIKTG DAAIVILRPM KPVVLEPVKE IPQLGRFAIR
Ef1a_Theac  PKDGTTLK.. .EKPDFIKNG DVAIVKVIPD KPLVIEKVSE IPQLGRFAVL
Eftu_Halma  PSSGEVAE.. .ENPDFIQNG DAAVVTVRPQ KPLSIEPSSE IPELGSFAIR
Eftu_Metva  PATGEVLE.. .ENPDFLKAG DAAIVKLIPT KPMVIESVKE IPQLGRFAIR
Ef1a_Sulso  PRTGQEAE.. .KNPQFLKQG DVAIVKFKPI KPLCVEKYNE FPPLGRFAMR

            451                                   488
Eftu_Ecoli  EGGRTVGAGV VAKVLS.... .......... ........
Eftu_Myctu  EGGRTVGAGR VTKIIK.... .......... ........
Eftu_Anani  EGGRTIGAGV VSKILQ.... .......... ........
Eftu_Theaq  EGGRTVGAGV VTKILE.... .......... ........
Eftu_Mycga  EGGRTVGAGT VVEVLE.... .......... ........
Eftu_Thema  EGGRTVGAGV VTEVIE.... .......... ........
Eftu_Chltr  EGGRTIGAGT ISKIIA.... .......... ........
Ef1a_Arath  DMRQTVAVGV IKSVDKKDPT GAKVTKAAVK KGAK....
Ef1a_Wheat  DMRQTVAVGV IKGVEKKDPT GAKVTKAAIK KK......
Ef1a_Euggr  DMRQTVAVGV IKSVNKKENT G.KVTKAAQK KK......
Ef10_Xenla  DMRQTVAVGV IKAVEKKAAG SGKVTKSAQK AAKTK...
Ef11_Human  DMRQTVAVGV IKAVDKKAAG AGKVTKSAQK AQKAK...
Ef11_Drome  DMRQTVAVGV IKAVNFKDAS GGKVTKAAEK ATKGKK..
Ef1a_Oncvo  DMRQTVAVGV IKNVD.KSEG VGKVQKAAQK AGVGGKKK
Ef1a_Yeast  DMRQTVAVGV IKSVD.KTEK AAKVTKAAQK AAKK....
Ef1a_Enthi  DMKQTVAVGV VKAVTP.... .......... ........
Ef1a_Tetpy  DMKQTVAVGV IKKVEKKDK. .......... ........
Ef1a_Dicdi  DMRQTVAVGV IKSTVKKAPG KAGDKKGAAA PSKKK...
Ef1a_Plafk  DMRQTIAVGI INQLKRKNLG AVTAKAPAKK ........
Ef1a_Giala  .......... .......... .......... ........
Ef1a_Pyrwo  DMGMTIAAGM VISIQRGE.. .......... ........
Ef1a_Theac  DMGQTVAAGQ CIDLEKR... .......... ........
Eftu_Halma  DMGQTIAAGK VLGVNER... .......... ........
Eftu_Metva  DMGMTVAAGM AIQVTAKNK. .......... ........
Ef1a_Sulso  DMGKTVGVGI IVDVKPAKVE IK........ ........

Notice the listing of sequence names near the top of the file. This listing contains an important number called the checksum. All GCG sequence programs utilize this number as a unique sequence identifier. There is a checksum line for the whole alignment as well as individual checksum lines for each member of the alignment. If any two of the checksum numbers are the same, then those sequences are identical. If they are, an editor can be used to place an exclamation point, "!" at the start of the checksum line in which the duplicate sequence occurs. Exclamation points are interpreted by GCG as remark delineators, therefore, the duplicate sequence will be ignored in subsequent programs. Another important number on the individual checksum lines needs to be pointed out. The "Weight" designation determines how much importance each sequence contributes to a profile made of the alignment. Sometimes it is worthwhile to adjust these values so that the contribution of a collection of very similar sequences does not overwhelm the signal from a few more divergent sequences. However, here we will not be bothering with it.

Additionally, I created a dendrogram of the similarity clustering relationships between the sequences when I ran PileUp. The resultant dendrogram shows the clustering process used to create the alignment. It is not a phylogenetic tree and should not be presented as one, although if the rates of evolution for each lineage are exactly the same, which is seldom the case in nature, it can be the same as one. Be sure to initialize GCG graphics with the setplot command before trying to draw any GCG graphics on the screen. In order to see the dendrogram I use GCG's program Figure to plot the dendrogram to the screen:

% figure ef.pileup

Figure makes figures and posters by drawing graphics and text
together. You can include output from other GCG graphics programs as
part of a figure.

  Process set to plot with TEK4107 attached to TERM:
  using the TEKTRONIX graphic interface.

 When your TEK4107 attached to tty is ready, press <Return>. <rtn>

The length of the vertical lines is proportional to the similarity difference between the sequences. Remember this tree is not an evolutionary tree. No evolutionary statistics or methods for correction of unequal rates of divergence are used in its construction. It merely indicates the relative similarity of the sequences. This dendrogram can assist, however, in determining the proper weighting factors to assign each sequence in order to even out the contribution of each to a profile. I type clearplot at the UNIX prompt to get rid of the plot.

5) Visualizing conservation in multiple sequence alignments.

Profile analysis puts the most emphasis on the most conserved portions of an alignment. To easily visualize these most conserved portions of a multiple sequence alignment we can utilize the GCG graphics program PlotSimilarity. This program draws a graph of the running average similarity along a group of aligned sequences (or of a profile with the Profile switch).

While your FastA jobs are running read through my following example of how alignments can be visualized. After your FastA job is done and your alignment has been created, go through the process with your own sequences. Type plotsimilarity -expand to begin PlotSimilarity and run it with its Expand option. This blows up the plot, scaling it between the maximum and minimum similarity values observed so that the entire graph is used rather than just the portion of the Y axis that your alignment happens to occupy making the whole thing easier to read. Remember that the symbol comparison matrix that I specified, Blosum62, begins its identity value range at 4.0. The Y-axis of the resulting plot will use the similarity values from whichever symbol comparison matrix used to create your alignment or you specify an alternative. Reply to the "...what sequence(s)" query with the GCG designation for all the sequences in an MSF file, i.e. YourAlignment.MSF{*}. I accept all the suggested default parameters in this example. For the Elongation Factor example the following screen appears:

% plotsimilarity -expand

PlotSimilarity plots the running average of the similarity among the
sequences in a multiple sequence alignment.

  Process set to plot with VERSATERM-TEK4105 attached to TERM:
  using the TEKTRONIX graphic interface.

 PLOTSIMILARITY between what sequence(s) ? ef.msf{*}

                  Begin (* 1 *) ?  <rtn>
                End (*   488 *) ?  <rtn>

 *** I read your local data file "Genmoredata:Blosum62.Cmp". ***

      ef.msf{Eftu_Ecoli}
      ef.msf{Eftu_Myctu}
      ef.msf{Eftu_Anani}
      ef.msf{Eftu_Theaq}
      ef.msf{Eftu_Mycga}
      ef.msf{Eftu_Thema}
      ef.msf{Eftu_Chltr}
      ef.msf{Ef1a_Arath}
      ef.msf{Ef1a_Wheat}
      ef.msf{Ef1a_Euggr}
      ef.msf{Ef10_Xenla}
      ef.msf{Ef11_Human}
      ef.msf{Ef11_Drome}
      ef.msf{Ef1a_Oncvo}
      ef.msf{Ef1a_Yeast}
      ef.msf{Ef1a_Enthi}
      ef.msf{Ef1a_Tetpy}
      ef.msf{Ef1a_Dicdi}
      ef.msf{Ef1a_Plafk}
      ef.msf{Ef1a_Giala}
      ef.msf{Ef1a_Pyrwo}
      ef.msf{Ef1a_Theac}
      ef.msf{Eftu_Halma}
      ef.msf{Eftu_Metva}
      ef.msf{Ef1a_Sulso}

 What window to average (* 10 *) ?  <rtn>

 The minimum density for this plot is  424.3 residues-100 platen units.
 What density do you want (* 424.3 *) ?  <rtn>

 When your TEK4107 attached to tty is ready, press <Return>.

Two EGCG programs, PrettyBox and PrettyPlot can also help to visualize multiple sequence alignments. They both display the alignment as an interleaved sequence file, boxing areas of conservation. PrettyBox uses different gray tones to differentiate various levels of conservation; PrettyPlot encloses conserved residues in an open box, basing its conservation decisions on what you specify at runtime. PrettyBox only generates PostScript output which must be redirected to an output file; PrettyPlot uses standard GCG graphics standards so its output can be displayed to any supported emulation including Encapsulated PostScript or redirected to a Figure file for further manipulation. I'll show how PrettyPlot runs below with my example. Run it on your own alignment after it is finished. Be sure to initialize Extended GCG first with the the EGCG command. A screen trace follows:

% egcg

EGCG (UNIX) Programs prepared:
   GelPicture, GelStatus, GelFigure, GelAnalyze [ Fragment Assembly extensions ]
   SigCleave                                       [ signal peptide prediction ]
   PrettyPlot                    [ version of Pretty with boxed alignment plot ]
   PlotAlign       [ plots conserved parameters for protein multiple alignment ]
   FastaCheck                    [ reports significant hits in (T)FASTA output ]
   HelixTurnHelix                               [ Dodd and Egan HTH prediction ]
   PepCoil                                      [ predicts coiled-coil regions ]
   Antigenic                     [ one prediction scheme for antigenic regions ]
   PepWindow                                             [ Kyte-Doolittle plot ]
   PepStats                                [ peptide properties and statistics ]
   ToText                                [ simple GCG to ascii text conversion ]
   MapSelect                 [Build a subset of the GCG restriction enzyme file]
   PepAllWindow                     [PepWindow for multiple sequence alignments]
   TProfileSearch, TProfileSegments   [ Profile vs. DNA databases, new options ]

For more information, use the new command EGENHELP (works just like GENHELP)

Now I'll launch prettyplot on my ef.msf{*} sequences; I accept all of the defaults:

% prettyplot

 PRETTYPLOT format what sequence(s) ?  ef.msf{*}

             ef.msf{Eftu_Ecoli}, len: 488
             ef.msf{Eftu_Myctu}, len: 488
             ef.msf{Eftu_Anani}, len: 488
             ef.msf{Eftu_Theaq}, len: 488
             ef.msf{Eftu_Mycga}, len: 488
             ef.msf{Eftu_Thema}, len: 488
             ef.msf{Eftu_Chltr}, len: 488
             ef.msf{Ef1a_Arath}, len: 488
             ef.msf{Ef1a_Wheat}, len: 488
             ef.msf{Ef1a_Euggr}, len: 488
             ef.msf{Ef10_Xenla}, len: 488
             ef.msf{Ef11_Human}, len: 488
             ef.msf{Ef11_Drome}, len: 488
             ef.msf{Ef1a_Oncvo}, len: 488
             ef.msf{Ef1a_Yeast}, len: 488
             ef.msf{Ef1a_Enthi}, len: 488
             ef.msf{Ef1a_Tetpy}, len: 488
             ef.msf{Ef1a_Dicdi}, len: 488
             ef.msf{Ef1a_Plafk}, len: 488
             ef.msf{Ef1a_Giala}, len: 488
             ef.msf{Ef1a_Pyrwo}, len: 488
             ef.msf{Ef1a_Theac}, len: 488
             ef.msf{Eftu_Halma}, len: 488
             ef.msf{Eftu_Metva}, len: 488
             ef.msf{Ef1a_Sulso}, len: 488

                  Begin (* 1 *) ?  <rtn>
                End (*   488 *) ?  <rtn>

 Find consensus to what minimum plurality (* 13.00 *) ? <rtn>

 When your TEK4107 attached to tty is ready, press <rtn>. <rtn> 

This plot goes on for several pages; I will only show the first.

6) Doing it on your own: a detailed study of your favorite branch off of the "Tree of Life" from the Elongation Factor 1-Tu alignment.

By now the FastA batch job submitted first thing in the exercise should be finished. Go into that output file with the pico editor and comment out with exclamation points all those sequences that do not obviously belong to the same major phylogenetic divison as your interest. Leave the one most similar out-group sequence. In my example below with the yeast sequence, I left all the fungi sequences and the closest non-fungal representative, a Xenopus (the common laboratory African Clawed Frog) sequence. Use typedata -reference to find out more about any of the sequence names that you do not recognize. My edited output file example follows below:

(Peptide) FASTA of: Ef1a_Yeast  from: 1 to: 458  October 11, 1995 12:17

P1;EF1A_YEAST - ELONGATION FACTOR 1-ALPHA (EF-1-ALPHA)
ID   EF1A_YEAST     STANDARD;      PRT;   458 AA.
AC   P02994;
DT   21-JUL-1986 (REL. 01, CREATED)
DT   21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
DT   01-OCT-1994 (REL. 30, LAST ANNOTATION UPDATE) . . .


 TO: SwissProt:*  Sequences:     40,292  Symbols: 14,147,368  Word Size: 2
 Scoring matrix: GenRunData:Fastapep.Cmp
 Variable pamfactor used
 Gap creation penalty: 12.0      Gap extension penalty: 4.0

Mean score calculations exclude scores greater than 74
 mean initn score:  27.7 (s.d. 7.64)
 mean init1 score:  27.0 (s.d. 6.01)
 1677 scores better than 53 saved
 joining threshold: 29, optimization threshold: 20

The best scores are:                                        init1 initn  opt..

Swissprotein:Ef1a_Yeast  ELONGATION FACTOR 1-ALPHA (EF-1-...2212  2212  2212
Swissprotein:Ef1a_Canal  ELONGATION FACTOR 1-ALPHA (EF-1-...2104  2104  2104
Swissprotein:Ef1a_Absgl  ELONGATION FACTOR 1-ALPHA (EF-1-...2015  2015  2021
Swissprotein:Ef12_Rhira  ELONGATION FACTOR 1-ALPHA (EF-1-...2006  2006  2006
Swissprotein:Ef11_Rhira  ELONGATION FACTOR 1-ALPHA (EF-1-...2001  2001  2001
Swissprotein:Ef13_Rhira  ELONGATION FACTOR 1-ALPHA (EF-1-...1701  1981  1989
Swissprotein:Ef1a_Pucgr  ELONGATION FACTOR 1-ALPHA (EF-1-...1174  1174  1971
Swissprotein:Ef1a_Trire  ELONGATION FACTOR 1-ALPHA (EF-1-...1899  1899  1940
Swissprotein:Ef13_Xenla  ELONGATION FACTOR 1-ALPHA, OOCYT...1138  1877  1885
!Swissprotein:Ef10_Xenla  ELONGATION FACTOR 1-ALPHA, SOMAT...1113  1850  1885
!Swissprotein:Ef12_Xenla  ELONGATION FACTOR 1-ALPHA, OOCYT...1140  1874  1883
!Swissprotein:Ef12_Human  ELONGATION FACTOR 1-ALPHA 2 (EF-...1109  1838  1881
!Swissprotein:Ef1a_Crigr  ELONGATION FACTOR 1-ALPHA CHAIN    1120  1850  1878
!Swissprotein:Ef11_Human  ELONGATION FACTOR 1-ALPHA 1 (EF-...1119  1849  1877
!Swissprotein:Sttn_Rat  STATIN S1                            1093  1822  1865
!Swissprotein:Ef1a_Artsa  ELONGATION FACTOR 1-ALPHA (EF-1-...1100  1831  1849
!Swissprotein:Ef1a_Mouse  ELONGATION FACTOR 1-ALPHA (EF-1-... 860  1786  1835
!Swissprotein:Ef1a_Bommo  ELONGATION FACTOR 1-ALPHA (EF-1-...1081  1807  1834
!Swissprotein:Ef11_Drome  ELONGATION FACTOR 1-ALPHA (EF-1-...1084  1790  1833
!Swissprotein:Ef1a_Apime  ELONGATION FACTOR 1-ALPHA (EF-1-...1117  1829  1832

! CPU time used:
!        Database scan:  0:47:51.1
! Post-scan processing:  0:00:02.5
!       Total CPU time:  0:47:55.2
! Output File: Ef1a_Yeast.Fasta

In the fungal case, that leaves nine sequences to be analyzed further. Depending on the division you choose, Archaebacteria, Eubacteria, Protist, Plant, Fungus, or Animal, your list size will be bigger or smaller. Be sure not to include more than about a dozen sequences!

Now run pileup on your FastA output list. Because all the members of this sequence set will be very similar, it is not necessary to use the Blosum62 matrix. Since you can use the default PAM matrix, you can accept the default gap penalties. Since your set should not be any larger than about a dozen sequences, it is not necessary to run it in batch mode either. I will not provide a screen trace for this; refer to the previous example for guidance. The exercise report form will ask several questions concerning your subset alignment and you will be using it next week to infer evolutionary relationships between the member species. Rename the resulting dendrogram from this PileUp run into something that identifies you, such as (your lastname).ef_pileup.

Load your MSF file into an editor for further evaluation. Many specialized editors exist for manipulating multiple sequence data. In the VADMS system there is GCG's LineUp and EGCG's ELineUp. Steve Smith's GDE (the Genetic Data Environment) is also available for X-Window environments and is incorporated into GCG's version 9.0 release. However, for the purpose of this exercise I am merely going to use the UNIX text editor pico. Several features need to be pointed out. Notice that the extreme amino and carboxy ends of the alignment are jagged and uncertain. This is fairly common in multiple sequence alignments and subsequent analyses should probably not include these regions. Take notes of where you think the reliability of your alignment degrades at the ends. Overall, things to look for include strongly conserved residues such as tryptophans, cysteines, and histidines, important structural amino acids such as prolines, tyrosines and phenylanines, and the conserved isoleucine, leucine, valine triumvirate; make sure they all align. Depending on the quality of the alignment, you may want to rerun PileUp with different parameters or other comparison matrices. Often a particular alignment will have to be rerun and manually adjusted several times before it is acceptable. All subsequent analyses are absolutely dependent on the assertion that this is truly a biologically meaningful alignment. Each column of symbols must actually contain homologous characters.

For the purpose of this exercise you will not make any adjustments to the alignment and accept it as is. I do need to state, though, that if any changes are made to the alignment with a non-GCG compatible editor, such as pico, then you must reformat that alignment to restore compatibility with the GCG programs. The manner Reformat is used with multiple sequence alignments is different than what you are used to. To reformat an MSF file issue the following command:

% reformat -msf myalignment.msf{*}

You should not have to use this command in this exercise but I needed to be sure that you know how to use it for those cases when it will be needed.

After you have studied your alignment with the pico editor be sure to use the visualization techniques discussed earlier to prepare plots of your analyses. When you run PlotSimilarity on your alignment, display the plot to the terminal and then repeat the command with the extra option -figure=(your lastname).plotsim on the command line to generate a Figure file to be sent to teacher for evaluation.

7) Profile Analysis: How to use ProfileMake to create a weighted matrix of the alignment and ProfileSearch to scan the database with that profile.

Dr. Gribskov et al. have assembled an elegant package for associating distantly related proteins and discovering structural motifs with the Profile analysis suite. John Devereux of GCG has written an excellent overview essay of the method in the GCG program manual; please take the time to read this section!

The Profile method enables the researcher to recognize features that may otherwise be invisible. The greatly enhanced information content of a Profile over individual sequences has the potential to find similar motifs in sequences which may be only distantly related and that will not be found by any other search algorithms. Even though ProfileSearches do require some work to setup and run -- a meaningful multiple sequence alignment must be assembled, ProfileMake needs to be run, and the search job itself takes quite a long time to run -- it is well worth the bother.

A profile should usually be refined to only include the most highly conserved area of an alignment and its members should be appropriately weighted. This refinement procedure, including repeatedly searching the databases and including or excluding members as the case may be, is known as validating the profile. If using Profile analysis in your own research, following the validation procedures outlined in the GCG Program Manual in the ProfileScan description is a very prudent idea, but we do not have the time for that now. However, we will restrict the length of our profile to exclude those jagged ends in our alignment.

A profile, and its inherent consensus, is created with the GCG program ProfileMake. Type profilemake with the option -seqout=ef.cons to generate a normal sequence file of the consensus in addition to the profile file. You also want to restrict the length of your profile so that it excludes the jagged ends of our alignment. Therefore, use the -begin= and -end= options, supplying the locations that you previously noted. In response to "PROFILEMAKE of what aligned sequences ?" answer with your MSF filename followed by an asterisk enclosed by a pair of braces, {*}, to indicate that you want to make a profile of all of the sequences included in your particular MSF file. A screen trace of my example follows (use your own MSF file's name and beginning and ending constraints):

% profilemake -seqout=ef.cons -begin=8 -end=466

 PROFILEMAKE  Version 4.40     October 11, 1995 21:12

ProfileMake creates a position-specific scoring table, called a
profile, that quantitatively represents the information from a
group of aligned sequences.  The profile can then be used for database
searching (ProfileSearch) or sequence alignment (ProfileGap).

 PROFILEMAKE of what aligned sequences ? ef.msf{*}

   ef.msf{Eftu_Ecoli}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Eftu_Myctu}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Eftu_Anani}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Eftu_Theaq}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Eftu_Mycga}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Eftu_Thema}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Eftu_Chltr}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef1a_Arath}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef1a_Wheat}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef1a_Euggr}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef10_Xenla}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef11_Human}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef11_Drome}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef1a_Oncvo}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef1a_Yeast}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef1a_Enthi}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef1a_Tetpy}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef1a_Dicdi}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef1a_Plafk}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef1a_Giala}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef1a_Pyrwo}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef1a_Theac}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Eftu_Halma}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Eftu_Metva}, begin:     8  end:   466  len:    459  weight: 1.00
   ef.msf{Ef1a_Sulso}, begin:     8  end:   466  len:    459  weight: 1.00 

 What should I call the output profile (* ef.prf *) ? <rtn>

If we had "commented out" any sequences with an exclamation point in the MSF file, then those sequences would not be used in the formation of the profile. Take a look at the resultant consensus sequence and notice that all positions are filled; there are no gaps. This is because the Profile algorithm will decide on the most conserved residue for each position, regardless. Also notice that the header contains information relating to the sequence's creation through ProfileMake; this can be valuable. The abridged EF-1-Tu profile consensus sequence follows:

% more ef.cons

(Consensus) (Peptide) PROFILEMAKE v4.40 of: ef.msf{*}  Length: 459 Sequences: 2
5  MaxScore: 225.58  October 11, 1995 21:13

     Gap: 1.00      Len: 1.00
GapRatio: 0.33 LenRatio: 0.10

                                    ef.msf{Eftu_Ecoli}  From:     8    To:   4 
66     Weight:  1.00
                                    ef.msf{Eftu_Myctu}  From:     8    To:   4
66     Weight:  1.00
                                    ef.msf{Eftu_Anani}  From:     8    To:   4
66     Weight:  1.00
                                    ef.msf{Eftu_Theaq}  From:     8    To:   4
66     Weight:  1.00
///////////////////////////////////////////////////////////////////////////////
                                    ef.msf{Ef1a_Giala}  From:     8    To:   4
66     Weight:  1.00
                                    ef.msf{Ef1a_Pyrwo}  From:     8    To:   4
66     Weight:  1.00
ef.msf{Ef1a_Theac} From: 8 To: 4 66 Weight: 1.00 ef.msf{Eftu_Halma} From: 8 To: 4 66 Weight: 1.00 ef.msf{Eftu_Metva} From: 8 To: 4 66 Weight: 1.00 ef.msf{Ef1a_Sulso} From: 8 To: 4 66 Weight: 1.00 Symbol comparison table: Profilepep.Cmp FileCheck: 4886 Relaxed treatment of non-observed characters Exponential weighting of characters Length: 459 October 11, 1995 21:13 Type: P Check: 813 .. 1 KEKPHINIVV IGHVDSGKST TTGHLIYKYG GIDKRTIEKF EKEAAEMGKG 51 SFKYAWVLDK LKEERERGIT IDIALRKFET AKWYFTIIDA PGHRDFIKNM 101 ITGTSQADGA ILVVAATDGE FEAGISKDGQ TREHALLAWT LGVKQLIVAV 151 NKMDMVEPDY SEEWFEEIKK EVSDFLKKVG YLNPDKVPFV PISGFNGDNM 201 LEPSDNMPWY KGSDAEWKEG ILEGPTLLEA LDAYIPPPER PTDKPLWLPL 251 QDVYKIGGIG TVPVGRVETG VLKPGEVVTF APAGVTTRKT VQGEVKSVEM 301 HHEALDEAVP GDNVGFNVRG VSVKDIKRGN VAGDSKNDPG SAKGAAKFTA 351 QVIVLNKEEG GHPGQITNGY TPVLDCHTAH IACKFAEILE KLDRRSGKEL 401 EKEPENPKLI KSGDAAIVKL IPTKPLCVET FSEFPPLGRF AVRDMGQTVA 451 VGVIKKVEK

You are also welcome to take a look at your resultant .prf file. It is a huge table of numbers that doesn't make a whole lot of sense to us mere mortals, but it is a tremendously powerful tool in subsequent analysis steps. As described in the Introduction, other programs can read and interpret all of those numbers to perform very sensitive database searches and alignments by utilizing the information within the matrix which penalizes misalignments in phylogenetically conserved areas more than in variable regions.

ProfileSearches take a very long time to run so be sure to submit your search as early as possible. To submit your profilesearch against SwissProtein follow the screen trace below. Be sure to use the -minlist=4 and -batch options. MinList sets a list Z score cut-off value -- a very handy way to limit your output list size. Accept all defaults and give appropriate responses when queried.

% profilesearch -minlist=4 -batch

ProfileSearch uses a profile (representing a group of aligned
sequences) as a query to search the database for new sequences with
similarity to the group.  The profile is created with the program
ProfileMake.

 PROFILESEARCH  version 4.40     October 11, 1995 22:25

 PROFILESEARCH with what query profile ?  ef.prf

 "ef.prf" is a profile of length: 459

 Search for query in what sequences(s) (* SwissProt:* *) ? <rtn>

 What is the gap creation penalty (* 4.50 *) ?  <rtn>

 What is the gap extension penalty (* 0.05 *) ?  <rtn>

 What should I call the output file (* Ef.pfs *) ? <rtn>

 ** profilesearch will run as a batch or at job.

 ** profilesearch was submitted using the command:
    "  SUBMIT  "

Job PROFILESEARCH_222532 (queue RIBOZYME, entry 591) started on RIBOZYME

ProfileSearch Z scores are normalized and reflect the significance of the results. Here rather than randomizing sequences to evaluate the Z score as is done in the Monte Carlo approach with the Randomization option of Gap and BestFit, they are calculated based on all of the nonsimilar sequences from the database search similar to the way that BLAST calculates it's Probability scores. As with Monte Carlo approaches, Z scores below 3 are probably not worth considering, from 4 to 7 may be interesting , and above 7 are most probably significant.

When you return to your completed ProfileSearch in a couple of days take a look at the output. Pay particular attention to the reported Z scores. Notice that the program is finding all of the Elongation Factors and several other interesting nucleotide binding proteins. The nucleotide binding motif of our profile is the most conserved portion of the alignment and hence more importance is placed on it in the search, therefore, the other proteins with similar domains are all found. An abridged screen trace of my ProfileSearch output follows below. I've excluded a majority of the entries that we would expect and left many of the surprises.

% more -page ef.pfs

(Peptide) PROFILESEARCH of: ef.prf Length: 459
to: Swissprot:*

         Scores are corrected for composition effects

                 Gap Weight: 4.50
          Gap Length Weight: 0.05
         Sequences Examined: 38106
         CPU time (seconds): 65238
*    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *
Profile information:
(Peptide) PROFILEMAKE v4.40 of: ef.msf{*}  Length: 459
  Sequences: 25  MaxScore: 225.58  October 11, 1995 21:13
                          Gap: 1.00              Len: 1.00
                     GapRatio: 0.33         LenRatio: 0.10
            ef.msf{Eftu_Ecoli}  From: 8         To: 466       Weight: 1.00
            ef.msf{Eftu_Myctu}  From: 8         To: 466       Weight: 1.00 . . .

*    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *
Normalization:                                  October 12, 1995 17:52

         Curve fit using 40 length pools
         0 of 40 pools were rejected

         Normalization equation:

                 Calc_Score = 50.74 * ( 1.0 - exp(-0.0026*SeqLen - 0.0849) )

         Correlation for curve fit: 0.948

         Z score calculation:
         Average and standard deviation calculated using 37977 scores
         129 of 38106 scores were rejected

                 Z_Score = ( Score-Calc_Score - 0.997 ) - 0.093

          Sequence  Strd ZScore   Orig Length ! Documentation  ..
SWISSPROTEIN:EF1A_ENTHI  +   52.28 207.65    430 ! ELONGATION FACTOR 1-ALPHA (EF
-1-ALPHA)
SWISSPROTEIN:EF1A_TETPY  +   51.96 207.72    435 ! ELONGATION FACTOR 1-ALPHA (EF
-1-ALPHA) (14 NM FILAMENT-ASSOCIATED
SWISSPROTEIN:EF10_XENLA  +   51.46 212.00    462 ! ELONGATION FACTOR 1-ALPHA, SO
MATIC FORM (EF-1-ALPHA-S)
SWISSPROTEIN:EF1A_CRIGR  +   51.44 211.95    462 ! ELONGATION FACTOR 1-ALPHA CHA
IN
SWISSPROTEIN:EF11_HUMAN  +   51.44 211.95    462 ! ELONGATION FACTOR 1-ALPHA 1 (
EF-1-ALPHA-1) (EF-TU)
SWISSPROTEIN:EF13_XENLA  +   51.10 210.58    461 ! ELONGATION FACTOR 1-ALPHA, OO
CYTE FORM (EF-1-ALPHA-O1) (EF-1AO1)
SWISSPROTEIN:EF12_XENLA  +   51.05 210.40    461 ! ELONGATION FACTOR 1-ALPHA, OO
CYTE FORM (EF-1-ALPHA-O) (EF-1AO)
SWISSPROTEIN:EF1A_WHEAT  +   51.01 207.24    447 ! ELONGATION FACTOR 1-ALPHA (EF
-1-ALPHA)
///////////////////////////////////////////////////////////////////////////////
SWISSPROTEIN:STTN_RAT  +   50.19 207.89    463 ! STATIN S1
///////////////////////////////////////////////////////////////////////////////
SWISSPROTEIN:EFTU_MYCPN  +   32.48 138.02    404 ! ELONGATION FACTOR TU (EF-TU)
SWISSPROTEIN:EFTU_COLOB  +   29.19 129.21    415 ! ELONGATION FACTOR TU (EF-TU)
SWISSPROTEIN:GST1_HUMAN  +   27.05 133.26    499 ! GST1-HS GTP-BINDING PROTEIN
SWISSPROTEIN:SUP2_YEAST  +   21.64 128.98    685 ! OMNIPOTENT SUPPRESSOR PROTEIN
 2 (GST1 PROTEIN)
SWISSPROTEIN:SUP2_PICPI  +   19.99 125.46    741 ! OMNIPOTENT SUPPRESSOR PROTEIN
 2
SWISSPROTEIN:HBS1_YEAST  +   18.27 110.98    611 ! ELONGATION FACTOR 1 ALPHA-LIK
E PROTEIN
SWISSPROTEIN:CYSN_ECOLI  +   15.24  89.53    475 ! SULFATE ADENYLATE TRANSFERASE
 SUBUNIT 1 (EC 2.7.7.4) (ATP-
SWISSPROTEIN:NODQ_RHIME  +   15.16 100.77    641 ! NODULATION PROTEIN Q
SWISSPROTEIN:NODQ_AZOBR  +   11.92  87.07    620 ! NODULATION PROTEIN Q
SWISSPROTEIN:LEPA_BACSU  +   11.01  62.01    327 ! GTP-BINDING PROTEIN LEPA HOMO
LOG (FRAGMENT)
SWISSPROTEIN:EFGC_PEA  +   10.25  35.75    141 ! ELONGATION FACTOR G, CHLOROPLAS
T (EF-G) (FRAGMENT)
SWISSPROTEIN:SELB_ECOLI  +    9.45  77.55    620 ! SELB TRANSLATION FACTOR
SWISSPROTEIN:YIHK_ECOLI  +    8.50  72.57    591 ! HYPOTHETICAL 65.4 KD PROTEIN 
IN GLNA-FDHE INTERGENIC REGION (O591)
SWISSPROTEIN:TETO_CAMCO  +    6.73  67.82    639 ! TETRACYCLINE RESISTANCEPROTE
IN TETO
SWISSPROTEIN:TETO_STRMU  +    6.61  67.34    639 ! TETRACYCLINE RESISTANCE PROTE
IN TETO
SWISSPROTEIN:TET5_ENTFA  +    6.42  66.58    639 ! TETRACYCLINE RESISTANCE PROTE
IN TETM (TRANSPOSON TN1545)
SWISSPROTEIN:TETO_CAMJE  +    6.33  66.15    637 ! TETRACYCLINE RESISTANCE PROTE
IN TETO
SWISSPROTEIN:TETM_UREUR  +    6.28  66.03    639 ! TETRACYCLINE RESISTANCE PROTE
IN TETM
SWISSPROTEIN:LEPA_ECOLI  +    6.16  64.01    598 ! GTP-BINDING PROTEIN LEPA
SWISSPROTEIN:IF2G_YEAST  +    6.09  60.60    527 ! TRANSLATIONAL INITIATION FACT
OR GAMMA SUBUNIT (EIF-2-GAMMA)
SWISSPROTEIN:TET9_ENTFA  +    6.05  65.14    639 ! TETRACYCLINE RESISTANCE PROTE
IN TETM (TRANSPOSON TN916)
SWISSPROTEIN:RF3_ECOLI  +    6.04  60.47    528 ! PEPTIDE CHAIN RELEASE FACTOR 3
 (RF-3)
SWISSPROTEIN:EF2_SULAC  +    5.99  68.03    736 ! ELONGATION FACTOR 2 (EF-2)
SWISSPROTEIN:ARFL_CAEEL  +    5.47  31.90    177 ! GTP-BINDING ADP-RIBOSYLATION 
FACTOR HOMOLOG PROTEIN ZK632.8
////////////////////////////////////////////////////////////////////////////////
SWISSPROTEIN:EF2_MESAU  +    4.47  64.47    858 ! ELONGATION FACTOR 2 (EF-2)
SWISSPROTEIN:EF2_RAT  +    4.46  64.45    858 ! ELONGATION FACTOR 2 (EF-2)
SWISSPROTEIN:RAN_HUMAN  +    4.44  33.83    216 ! GTP-BINDING NUCLEAR PROTEIN RA
N (TC4)
SWISSPROTEIN:EF2_HALHA  +    4.43  61.44    728 ! ELONGATION FACTOR 2 (EF-2)
SWISSPROTEIN:EF2_HUMAN  +    4.43  64.31    858 ! ELONGATION FACTOR 2 (EF-2)
SWISSPROTEIN:RAN_CANFA  +    4.42  33.80    216 ! GTP-BINDING NUCLEAR PROTEIN RA
N (TC4)
SWISSPROTEIN:RAB7_PEA  +    4.34  32.64    206 ! RAB7-RELATED GTP-BINDING PROTEI
N
SWISSPROTEIN:YO81_CAEEL  +    4.28  56.22    581 ! HYPOTHETICAL 65.0 KD PROTEIN 
ZK1236.1 IN CHROMOSOME III
SWISSPROTEIN:EFGM_YEAST  +    4.22  61.39    761 ! ELONGATION FACTOR G, MITOCHON
DRIAL PRECURSOR (MEF-G)
SWISSPROTEIN:RAB4_HUMAN  +    4.18  32.97    213 ! RAS-RELATED PROTEIN RAB-4
SWISSPROTEIN:VHED_BPIKE  +    4.17  18.80     88 ! HELIX-DESTABILIZING PROTEIN (
DNA-BINDING PROTEIN)
SWISSPROTEIN:ATPI_EUGGR  +    4.13  36.32    251 ! ATP SYNTHASE A CHAIN PRECURSO
R (EC 3.6.1.34) (SUBUNIT IV)
SWISSPROTEIN:FRXB_WHEAT  +    4.09  29.07    176 ! FRXB PROTEIN
SWISSPROTEIN:RAN2_LYCES  +    4.09  33.53    221 ! GTP-BINDING NUCLEAR PROTEIN R
AN2
SWISSPROTEIN:RAS2_PHYPO  +    4.01  30.63    193 ! RAS-LIKE PROTEIN 2

The program ProfileSegments makes BestFit style alignments of the results of a ProfileSearch. The importance of the conserved portions of your alignment as reflected in the content of your profile is fully utilized in this alignment procedure. When you've checked out your ProfileSearch output, edit it with the pico editor to place exclamation points in front of all sequences that you expected to be found by the search, i.e. all of the EF-1-Tu's. You may want to leave one top-scoring EF-1-Tu sequence as a positive control. Do not comment out all of the sequences which are not obviously EF-1-Tu. Next make alignments off of the modified ProfileSearch output file with profilesegments; accept all of the defaults. An abridged example of my session follows:

% profilesegments

ProfileSegments makes optimal alignments showing the segments of
similarity found by ProfileSearch.

 PROFILESEGMENTS  version 4.40     October 17, 1995 21:08

 (Local) PROFILESEGMENTS from what PROFILESEARCH output file ? ef.pfs

 Stop after how many alignments (* 15 *) ?  <rtn>

 What should I call the paired output display file (* Ef.pairs *) ? <rtn>

        The following levels will be marked in the alignments:
                   Bar: 0.33
                 Colon: 0.20
                   Dot: 0.10

 Aligning ......................-......................
 Swissprotein:Ef1a_Enthi

          Gaps:     8
       Quality: 207.7
 Quality Ratio: 0.485
        Length:   459


 Aligning ......................-.......................
 Swissprotein:Sttn_Rat

          Gaps:     6
       Quality: 207.9
 Quality Ratio: 0.471
        Length:   459


 Aligning ......................-......................
 Swissprotein:Gst1_Human

          Gaps:     7
       Quality: 133.3
 Quality Ratio: 0.313
        Length:   459


///////////////////////////////////////////////////////////////////////

 Aligning ......................-......................
 Swissprotein:Teto_Strmu

          Gaps:    21
       Quality:  67.3
 Quality Ratio: 0.147
        Length:   660


 Aligning ......................-......................
 Swissprotein:Tet5_Entfa

          Gaps:    19
       Quality:  66.6
 Quality Ratio: 0.146
        Length:   656

Now use more on the results of your ProfileSegments run and notice how much different the alignments are, after the obvious first one, from the examples seen in Exercise #10. See how the conserved portions of the profile do not allow the corresponding portion of alignment to gap. "Clustering" is much more critical to Profile analyses than any other method. An abridged Elongation Factor ProfileSegments example output file follows:

% more ef.pairs

 (Local) PROFILESEGMENTS of: Ef1a_Enthi  check: 9365  from: 1  to: 430

P1;EF1A_ENTHI - ELONGATION FACTOR 1-ALPHA (EF-1-ALPHA)
ID   EF1A_ENTHI     STANDARD;      PRT;   430 AA.
AC   P31018;
DT   01-JUL-1993 (REL. 26, CREATED)
DT   01-JUL-1993 (REL. 26, LAST SEQUENCE UPDATE)
DT   01-JUL-1993 (REL. 26, LAST ANNOTATION UPDATE) . . .

 to: Ef.Prf  check: 813  from: 1  to: 459

(Peptide) PROFILEMAKE v4.40 of: ef.msf{*}  Length: 459
  Sequences: 25  MaxScore: 225.58  October 11, 1995 21:13
                          Gap: 1.00              Len: 1.00
                     GapRatio: 0.33         LenRatio: 0.10
            ef.msf{Eftu_Ecoli}  From: 8         To: 466       Weight: 1.00
            ef.msf{Eftu_Myctu}  From: 8         To: 466       Weight: 1.00 . . .

         Gap Weight:  4.500      Average Match:  0.200
      Length Weight:  0.050   Average Mismatch: -0.152

            Quality: 207.65             Length:    459
              Ratio:   0.49               Gaps:      8

 Ef1a_Enthi x Ef.Prf       October 17, 1995 21:08  ..

                  .         .         .         .         .
S      3 KEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDQRTIEKFEKESAEMGKG 52
         |||:|||||||||||:|||||||||||:.||||.|:||:||||.:|:|||
P      1 KEKPHINIVVIGHVDSGKSTTTGHLIYKYGGIDKRTIEKFEKEAAEMGKG 50
                  .         .         .         .         .
S     53 SFKYAWVLDNLKAERERGITIDISLWKFETSKYYFTIIDAPGHRDFIKNM 102
         |||||||||:|||||||||||||:::||||.||:||||||||||||||||
P     51 SFKYAWVLDKLKEERERGITIDIALRKFETAKWYFTIIDAPGHRDFIKNM 100
                  .         .         .         .         .
S    103 ITGTSQADVAILIVAAGTGEFEAGISKNGQTREHILLSYTLGVKQMIVGV 152
         ||||||||.|||||||:.|:||||:|:.||||||:||:.|||||||||.|
P    101 ITGTSQADGAILVVAATDGEFEAGISKDGQTREHALLAWTLGVKQLIVAV 150
                  .         .         .         .         .
S    153 NKMDAI..QYKQERYEEIKKEISAFLKKTGY.NPDKIPFVPISGFQGDNM 199
         ||||..  .|.::|||||::||:.:||| || |:|.::||||||:.|||:
P    151 NKMDMVEPDYSEEWFEEIKKEVSDFLKKVGYLNPDKVPFVPISGFNGDNM 200
                  .         .         .         .         .
S    200 IEPSTNMPWYK............GPTLIGALDS.VTPPERPVDKPLRLPL 236
         :|.|.|:||||            |:||::|||. | :|:||:||||||||
P    201 LEPSDNMPWYKGSDAEWKEGILEGPTLLEALDAYIPPPERPTDKPLWLPL 250
                  .         .         .         .         .
S    237 QDVYKISGIGTVPVGRVETGILKPGTIVQFAPSGVSS......ECKSIEM 280
         ||||||:|||||||||||||:|||| .| |:| ::..      |.|||||
P    251 QDVYKIGGIGTVPVGRVETGVLKPGEVVTFAPAGVTTRKTVQGEVKSVEM 300
                  .         .         .         .         .
S    281 HHTALAQAIPGDNVGFNVRNLTVKDIKRGNVASDAKNQP..AVGCEDFTA 328
         ||..|.:|.||||||||||||.:||||||.|:::.:|.|  : : ..|||
P    301 HHEALDEAVPGDNVGFNVRGVSVKDIKRGNVAGDSKNDPGSAKGAAKFTA 350
                  .         .         .         .         .
S    329 QVIVMN.....HPGQIRKGYTPVLDCHTSHIACKFEELLSKIDRRTGKSM 373
         ||||||     |||:| .||:|||:|||:||||:|.::..|||:|.|: :
P    351 QVIVLNKEEGGHPGQITNGYTPVLDCHTAHIACKFAEILEKLDRRSGKEL 400
                  .         .         .         .         .
S    374 ..EGGEPEYIKNGDSALVKIVPTKPLCVEEFAKFPPLGRFAVRDMKQTVA 421
           : ..|::||.||.|:|::.|.|||:||.:. ||||||||||||.||||
P    401 EKEPENPKLIKSGDAAIVKLIPTKPLCVETFSEFPPLGRFAVRDMGQTVA 450

S    422 VGVVKAVTP 430
         ||||:.|
P    451 VGVIKKVEK 459


 (Local) PROFILESEGMENTS of: Sttn_Rat  check: 6798  from: 1  to: 463

P1;STTN_RAT   - STATIN S1
ID   STTN_RAT       STANDARD;      PRT;   463 AA.
AC   P27706;
DT   01-AUG-1992 (REL. 23, CREATED)
DT   01-AUG-1992 (REL. 23, LAST SEQUENCE UPDATE)
DT   01-FEB-1994 (REL. 28, LAST ANNOTATION UPDATE) . . .

 to: Ef.Prf  check: 813  from: 1  to: 459
(Peptide) PROFILEMAKE v4.40 of: ef.msf{*}  Length: 459
  Sequences: 25  MaxScore: 225.58  October 11, 1995 21:13
                          Gap: 1.00              Len: 1.00
                     GapRatio: 0.33         LenRatio: 0.10
            ef.msf{Eftu_Ecoli}  From: 8         To: 466       Weight: 1.00
            ef.msf{Eftu_Myctu}  From: 8         To: 466       Weight: 1.00 . . .

         Gap Weight:  4.500      Average Match:  0.200
      Length Weight:  0.050   Average Mismatch: -0.152

            Quality: 207.89             Length:    459
              Ratio:   0.47               Gaps:      6

 Sttn_Rat x Ef.Prf         October 17, 1995 21:09  ..

                  .         .         .         .         .
S      3 KEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKG 52
         |||:|||||||||||:|||||||||||:.||||:|:||:|||||:|:|||
P      1 KEKPHINIVVIGHVDSGKSTTTGHLIYKYGGIDKRTIEKFEKEAAEMGKG 50
                  .         .         .         .         .
S     53 SFKYAWVLDKLKAERERGITIDISLWKFETTKYYITIIDAPGHRDFIKNM 102
         |||||||||||||||||||||||:::||||.||::|||||||||||||||
P     51 SFKYAWVLDKLKEERERGITIDIALRKFETAKWYFTIIDAPGHRDFIKNM 100
                  .         .         .         .         .
S    103 ITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGV 152
         ||||||||:|||||||: |:||||:|:.||||||||||.|||||||||.|
P    101 ITGTSQADGAILVVAATDGEFEAGISKDGQTREHALLAWTLGVKQLIVAV 150
                  .         .         .         .         .
S    153 NKMDSTEPAYSEKRYDEIVKEVSAYIKKIGY.NPATVPFVPISGWHGDNM 201
         ||||.... |||.|||||.:||:.:|||:|| |:..::||||||:.|||:
P    151 NKMDMVEPDYSEEWFEEIKKEVSDFLKKVGYLNPDKVPFVPISGFNGDNM 200
                  .         .         .         .         .
S    202 LEPSPNMPWFKGWKVERKEGNASGVSLLEALDT.ILPPTRPTDKPLRLPL 250
         ||.| |:||||.   .. ..   |  ||||||  | :| ||:||||||||
P    201 LEPSDNMPWYKGSDAEWKEGILEGPTLLEALDAYIPPPERPTDKPLWLPL 250
                  .         .         .         .         .
S    251 QDVYKIGGIGTVPVGRVETGILRPGMVVTFAPVNITT......EVKSVEM 294
         ||||||||||||||||||||:||||.:|:|:|...:|      |||||||
P    251 QDVYKIGGIGTVPVGRVETGVLKPGEVVTFAPAGVTTRKTVQGEVKSVEM 300
                  .         .         .         .         .
S    295 HHEALSEALPGDNVGFNVKNVSVKDIRRGNVCGDSKADP..PQEAAQFTS 342
         |||.| ||.||||||||||||::|||:||.| |:.: :|  . :.:.||:
P    301 HHEALDEAVPGDNVGFNVRGVSVKDIKRGNVAGDSKNDPGSAKGAAKFTA 350
                  .         .         .         .         .
S    343 QVIILN.....HPGQISAGYSPVIDCHTAHIACKFAELKEKIDRRSGKKL 387
         ||||||     |||:| .||.||::||||||||:|.:: .|||:|:|: :
P    351 QVIVLNKEEGGHPGQITNGYTPVLDCHTAHIACKFAEILEKLDRRSGKEL 400
                  .         .         .         .         .
S    388 E...DNPKSLKSGDAAIVEMVPGKPMCVESFSQYPPLGRFAVADTRQTVA 434
         .   |:|: ||:||||:|.|.| |||:||.:..:|||||||| | .||||
P    401 EKEPENPKLIKSGDAAIVKLIPTKPLCVETFSEFPPLGRFAVRDMGQTVA 450

S    435 VGVIKNVEK 443
         ||||:.|.:
P    451 VGVIKKVEK 459


 (Local) PROFILESEGMENTS of: Gst1_Human  check: 7837  from: 1  to: 499

P1;GST1_HUMAN - GST1-HS GTP-BINDING PROTEIN
ID   GST1_HUMAN     STANDARD;      PRT;   499 AA.
AC   P15170;
DT   01-APR-1990 (REL. 14, CREATED)
DT   01-APR-1990 (REL. 14, LAST SEQUENCE UPDATE)
DT   01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) . . .

 to: Ef.Prf  check: 813  from: 1  to: 459

(Peptide) PROFILEMAKE v4.40 of: ef.msf{*}  Length: 459
  Sequences: 25  MaxScore: 225.58  October 11, 1995 21:13
                          Gap: 1.00              Len: 1.00
                     GapRatio: 0.33         LenRatio: 0.10
            ef.msf{Eftu_Ecoli}  From: 8         To: 466       Weight: 1.00
            ef.msf{Eftu_Myctu}  From: 8         To: 466       Weight: 1.00 . . .

         Gap Weight:  4.500      Average Match:  0.200
      Length Weight:  0.050   Average Mismatch: -0.152

            Quality: 133.26             Length:    458
              Ratio:   0.31               Gaps:      7

 Gst1_Human x Ef.Prf       October 17, 1995 21:09  ..

                  .         .         .         .         .
S     71 KKEHVNVVFIGHVDAGKSTIGGQIMYLTGMVDKRTLEKYEREAKEKNRET 120
         .| ||||| ||||| ||||..|.|:|  | ||:|::|:||:|| |..:..
P      2 EKPHINIVVIGHVDSGKSTTTGHLIYKYGGIDKRTIEKFEKEAAEMGKGS 51
                  .         .         .         .         .
S    121 WYLSWALDTNQEERDKGKTVEVGRAYFETEKKHFTILDAPGHKSFVPNMI 170
         | ..| ||  .|||||| ||||:   |||.| :|||||||||::||.|||
P     52 FKYAWVLDKLKEERERGITIDIALRKFETAKWYFTIIDAPGHRDFIKNMI 101
                  .         .         .         .         .
S    171 GGASQADLAVLVISARKGEFETGFEKGGQTREHAMLAKTAGVKHLIVLIN 220
         ||||||| |||||:|  |:||.|. :.|||||||||| | |||:||| :|
P    102 TGTSQADGAILVVAATDGEFEAGISKDGQTREHALLAWTLGVKQLIVAVN 151
                  .         .         .         .         .
S    221 KMDDPTVNWSNERYEECKEKLVPFLKKVGFNPKKDIHFMPCSGLTGANLK 270
         |||  ...||.:|||| : .|  :|||:||   ..: |.| ||. | |:
P    152 KMDMVEPDYSEEWFEEIKKEVSDFLKKVGYLNPDKVPFVPISGFNGDNML 201
                  .         .         .         .         .
S    271 EQSDFCPWY............IGLPFIPYLDN.LPNFNRSVDGPIRLPIV 307
         | |.  |||             |  |:  ||. |:  .|.:| ||||||
P    202 EPSDNMPWYKGSDAEWKEGILEGPTLLEALDAYIPPPERPTDKPLWLPLQ 251
                  .         .         .         .         .
S    308 DKYK..DMGTVVLGKLESGSICKGQQLVMMPNKHNV......EVLGILSD 349
         | ||  |.|||.:||||.| |  |. |.: | .         || ||
P    252 DVYKIGGIGTVPVGRVETGVLKPGEVVTFAPAGVTTRKTVQGEVKSVEMH 301
                  .         .         .         .         .
S    350 DVETDTVAPGENLKIRLKGIEEEEILPGFILCDPNNLCHSGR...TFDAQ 396
         . . : . ||||| | ||||. .|| :| |..:..| . ...   .|.||
P    302 HEALDEAVPGDNVGFNVRGVSVKDIKRGNVAGDSKNDPGSAKGAAKFTAQ 351
                  .         .         .         .         .
S    397 IVIIE.....HKSIICPGYNAVLHIHTCIEEVEITALICLVDKKSGEK.. 439
         ||||.     | : | .|| |||:.||.  .. |  :.  ||.::|.
P    352 VIVLNKEEGGHPGQITNGYTPVLDCHTAHIACKFAEILEKLDRRSGKELE 401
                  .         .         .         .         .
S    440 .SKTRPRFVKQDQVCIARLRTAGTICLETFKDFPQMGRFTLRDEGKTIAI 488
              |.||| || .:  | .  ::::|.: :||.||||::|| :.|||:
P    402 KEPENPKLIKSGDAAIVKLIPTKPLCVETFSEFPPLGRFAVRDMGQTVAV 451

S    489 GKVLKLVP 496
         | | .:
P    452 GVIKKVEK 459

///////////////////////////////////////////////////////////////////////

 (Local) PROFILESEGMENTS of: Sup2_Picpi  check: 4346  from: 1  to: 741

P1;SUP2_PICPI - OMNIPOTENT SUPPRESSOR PROTEIN 2
ID   SUP2_PICPI     STANDARD;      PRT;   741 AA.
AC   P23637;
DT   01-NOV-1991 (REL. 20, CREATED)
DT   01-NOV-1991 (REL. 20, LAST SEQUENCE UPDATE)
DT   01-MAR-1992 (REL. 21, LAST ANNOTATION UPDATE) . . .

 to: Ef.Prf  check: 813  from: 1  to: 459

(Peptide) PROFILEMAKE v4.40 of: ef.msf{*}  Length: 459
  Sequences: 25  MaxScore: 225.58  October 11, 1995 21:13
                          Gap: 1.00              Len: 1.00
                     GapRatio: 0.33         LenRatio: 0.10
            ef.msf{Eftu_Ecoli}  From: 8         To: 466       Weight: 1.00
            ef.msf{Eftu_Myctu}  From: 8         To: 466       Weight: 1.00 . . .

         Gap Weight:  4.500      Average Match:  0.200
      Length Weight:  0.050   Average Mismatch: -0.152

            Quality: 125.46             Length:    460
              Ratio:   0.29               Gaps:     11

 Sup2_Picpi x Ef.Prf       October 17, 1995 21:09  ..

                  .         .         .         .         .
S    314 GGKDHMSIIFMGHVDAGKSTMGGNLLFLTGAVDKRTVEKYEREAKDAGRQ 363
          .| |::|: ||||| ||||:.|.|:|  |.||:|:||:||:|| : |:
P      1 KEKPHINIVVIGHVDSGKSTTTGHLIYKYGGIDKRTIEKFEKEAAEMGKG 50
                  .         .         .         .         .
S    364 GWYLSWIMDTNKEERNDGKTIEVGKSYFETDKRRYTILDAPGHKLYISEM 413
         :| ..||||  ||||: | ||||:   |||.|  :||||||||: ||:||
P     51 SFKYAWVLDKLKEERERGITIDIALRKFETAKWYFTIIDAPGHRDFIKNM 100
                  .         .         .         .         .
S    414 IGGASQADVGVLVISSRKGEYEAGFERGGQSREHAILAKTQGVNKLVVVI 463
         ||||||||.|||||::  |::|||. ..||:||||||| | ||..||| :
P    101 ITGTSQADGAILVVAATDGEFEAGISKDGQTREHALLAWTLGVKQLIVAV 150
                  .         .         .         .         .
S    464 NKMDDPTVNWSKERYEECTTKLAMYLKGVGYQKGD.VLFMPVSGYTGAGL 512
         ||||  ...||.:||||   .|  :|| :|| ..| : |.||||: | .:
P    151 NKMDMVEPDYSEEWFEEIKKEVSDFLKKVGYLNPDKVPFVPISGFNGDNM 200
                  .         .         .         .         .
S    513 KERVSQKDAPWYN............GPSLLEYLDS.MPLAVRKINDPFML 549
          |. |..  |||.            |: ||| ||. :: | | :|.||:|
P    201 LEP.SDN.MPWYKGSDAEWKEGILEGPTLLEALDAYIPPPERPTDKPLWL 248
                  .         .         .         .         .
S    550 PISS..KMKDLGTVIEGKIESGHVKKGQNLLVMPNKTQVEV.....TTIY 592
         || .  || |.|||  ||||.| || |. | . | .           .
P    249 PLQDVYKIGGIGTVPVGRVETGVLKPGEVVTFAPAGVTTRKTVQGEVKSV 298
                  .         .         .         .         .
S    593 NETEAEADSAFCGEQVRLRLRGIEEEDLSAGYVLS.SINHP..VKTVTRF 639
         |  ... : |  |||| | ||||. .||  | |.: . | |  .:. ..|
P    299 EMHHEALDEAVPGDNVGFNVRGVSVKDIKRGNVAGDSKNDPGSAKGAAKF 348
                  .         .         .         .         .
S    640 EAQIAIVEL.....KSILSTGFSCVMHVHTAIEEVTFTQLLHNLQKGTNR 684
         .||| ||.       : | .||..||:.|||  ...| .:. .|:. ...
P    349 TAQVIVLNKEEGGHPGQITNGYTPVLDCHTAHIACKFAEILEKLDRRSGK 398
                  .         .         .         .         .
S    685 R...SKKAPAFAKQGMKIIAVLETTEPVCIESYDDYPQLGRFTLRDQGQT 731
               ..| | | |   :  |....|::||.: ::|.||||::|| :||
P    399 ELEKEPENPKLIKSGDAAIVKLIPTKPLCVETFSEFPPLGRFAVRDMGQT 448
                  .
S    732 IAIGKVTKLL 741
         ||:| |..:
P    449 VAVGVIKKVE 458

///////////////////////////////////////////////////////////////////////

 (Local) PROFILESEGMENTS of: Nodq_Azobr  check: 555  from: 1  to: 620

P1;NODQ_AZOBR - NODULATION PROTEIN Q
ID   NODQ_AZOBR     STANDARD;      PRT;   620 AA.
AC   P28604;
DT   01-DEC-1992 (REL. 24, CREATED)
DT   01-DEC-1992 (REL. 24, LAST SEQUENCE UPDATE)
DT   01-DEC-1992 (REL. 24, LAST ANNOTATION UPDATE) . . .

 to: Ef.Prf  check: 813  from: 1  to: 459

(Peptide) PROFILEMAKE v4.40 of: ef.msf{*}  Length: 459
  Sequences: 25  MaxScore: 225.58  October 11, 1995 21:13
                          Gap: 1.00              Len: 1.00
                     GapRatio: 0.33         LenRatio: 0.10
            ef.msf{Eftu_Ecoli}  From: 8         To: 466       Weight: 1.00
            ef.msf{Eftu_Myctu}  From: 8         To: 466       Weight: 1.00 . . .

         Gap Weight:  4.500      Average Match:  0.200
      Length Weight:  0.050   Average Mismatch: -0.152

            Quality:  87.07             Length:    463
              Ratio:   0.21               Gaps:     11

 Nodq_Azobr x Ef.Prf       October 17, 1995 21:10  ..

                  .         .         .         .         .
S      5 TGRGLLRFLTCGSVDDGKSTLIGRLLHDAGLISDDQLEQARRDSRGRAEE 54
          .|  : :...| ||.||||:.|.|:.. | | .  :|.  ::. . :
P      1 KEKPHINIVVIGHVDSGKSTTTGHLIYKYGGIDKRTIEKFEKEAAEMG.. 48
                  .         .         .         .         .
S     55 DGGIDFSLLVDGLEAEREQSITIDVAYRYFATDRRSFIVADAPGHEQYTR 104
         .|::.|..::| |.||||||||||||.: |.|.:  | : ||||| || |
P     49 KGSFKYAWVLDKLKEERERGITIDIALRKFETAKWYFTIIDAPGHRDFIK 98
                  .         .         .         .         .
S    105 NMATAASGRSLAVLLVDARKGLL.......TQTRRHAIVASLMGIRHVVL 147
         || ||||: : |||||.|  | |       .||| |||||  |||:::||
P     99 NMITGTSQADGAILVVAATDGEFEAGISKDGQTREHALLAWTLGVKQLIV 148
                  .         .         .         .         .
S    148 AVNKMDLVED..GETVFAAIRQAFTVFSAPLGF...RSVTAIPLSARHGD 192
         :|||||.:.   :|  |. | ...  :   .||     :  ||||: .||
P    149 AVNKMDMVEPDYSEEWFEEIKKEVSDFLKKVGYLNPDKVPFVPISGFNGD 198
                  .         .         .         .         .
S    193 NVVHRSAAMPWHH............GPTLLGHLETAAAEDDPTEDGPLRF 230
         |:::.|. :||              |:|||: ||   ...... | ||||
P    199 NMLEPSDNMPWYKGSDAEWKEGILEGPTLLEALDAYIPPPERPTDKPLWL 248
                  .         .         .         .         .
S    231 LVEWVNRPNLDFRGLSGTLLSGSLETGGAVTVWPSGRSAR......IARI 274
          || | . .    :  | | .| |..|  |:. | : ..       |  |
P    249 PLQDVYKIGGIGTVPVGRVETGVLKPGEVVTFAPAGVTTRKTVQGEVKSV 298
                  .         .         .         .         .
S    275 VTFDGDVTQARAGDAVTVTLDAAV..DAGRGDLLSGPDGAPEVA...DQF 319
            ...| :| |||:||. | :    |  ||.|.:.....| .:   ..|
P    299 EMHHEALDEAVPGDNVGFNVRGVSVKDIKRGNVAGDSKNDPGSAKGAAKF 348
                  .         .         .         .         .
S    320 AAHLLWMA......EEPLIPGRSYLLRAGARWVPATVTALRHAVNVET.. 361
         .|:|: |        ..| .| . :|   |  |:..:  :   |:  .
P    349 TAQVIVLNKEEGGHPGQITNGYTPVLDCHTAHIACKFAEILEKLDRRSGK 398
                  .         .         .         .         .
S    362 ...LEHGAASVLGLNAVGLCNLSTAAPLAFDPYEASRHTGSFILVDRFSN 408
             : ..: .|  |: :: .| .  ||:.|.:      | | : |   .
P    399 ELEKEPENPKLIKSGDAAIVKLIPTKPLCVETFSEFPPLGRFAVRDM..G 446
                  .
S    409 RTVGAGMIRHPLR 421
         :|||:|:|.   .
P    447 QTVAVGVIKKVEK 459

///////////////////////////////////////////////////////////////////////

 (Local) PROFILESEGMENTS of: Selb_Ecoli  check: 8381  from: 1  to: 620

P1;SELB_ECOLI - SELB TRANSLATION FACTOR
ID   SELB_ECOLI     STANDARD;      PRT;   620 AA.
AC   P14081;
DT   01-JAN-1990 (REL. 13, CREATED)
DT   01-OCT-1994 (REL. 30, LAST SEQUENCE UPDATE)
DT   01-OCT-1994 (REL. 30, LAST ANNOTATION UPDATE) . . .

 to: Ef.Prf  check: 813  from: 1  to: 459

(Peptide) PROFILEMAKE v4.40 of: ef.msf{*}  Length: 459
  Sequences: 25  MaxScore: 225.58  October 11, 1995 21:13
                          Gap: 1.00              Len: 1.00
                     GapRatio: 0.33         LenRatio: 0.10
            ef.msf{Eftu_Ecoli}  From: 8         To: 466       Weight: 1.00
            ef.msf{Eftu_Myctu}  From: 8         To: 466       Weight: 1.00 . . .

         Gap Weight:  4.500      Average Match:  0.200
      Length Weight:  0.050   Average Mismatch: -0.152

            Quality:  77.55             Length:    610
              Ratio:   0.17               Gaps:     15

 Selb_Ecoli x Ef.Prf       October 17, 1995 21:10  ..

                  .         .         .         .         .
S      1 MIIATAGHVDHGKTTLLQAITGVNA......................... 25
         .:|   ||||:||:|:  .|.   :
P      9 VVI...GHVDSGKSTTTGHLIYKYGGIDKRTIEKFEKEAAEMGKGSFKYA 55
                  .         .         .         .         .
S     26 ...DRLPEEKKRGMTIDLGYAYWPQPDGRVPGFIDVPGHEKFLSNMLAGV 72
            |:|.|||.||||||::.  | |:.    :.|| ||| :||:|||||
P     56 WVLDKLKEERERGITIDIALRKF.ETAKWYFTIIDAPGHRDFIKNMITGT 104
                  .         .         .         .         .
S     73 GGIDHALLVVACDDGVM.......AQTREHLAILQLTGNPMLTVALTKAD 115
         :: | |:||||...| :       :|||||. |    | : | |:::| |
P    105 SQADGAILVVAATDGEFEAGISKDGQTREHALLAWTLGVKQLIVAVNKMD 154
                  .         .         .         .         .
S    116 RV....DEARVDEVERQVKEVLREYGFAEAKLFITAATEGRGMDALREH. 160
          :    :|:| |||..||.: |.. || ... .   |. | . | : |
P    155 MVEPDYSEEWFEEIKKEVSDFLKKVGYLNPDKVPFVPISGFNGDNMLEPS 204
                  .         .         .         .         .
S    161 ......................LLQ.....LPEREHASQHSFRLAIDRAF 183
                               ||:     |:.:::: | :|||||| .|
P    205 DNMPWYKGSDAEWKEGILEGPTLLEALDAYIPPPERPTDKPLWLPLQDVY 254
                  .         .         .         .         .
S    184 TVKGAGLVVTGTALSGEVKVGDSLWLTGVNKPMRVRALHAQNQPTETANA 233
         .| | | |.:| . .| ||:|. | |.:..                   .
P    255 KIGGIGTVPVGRVETGVLKPGEVVTFAPAGVTTR..........KTVQGE 294
                  .         .         .         .         .
S    234 GQRIALNIAGDAEKEQINRGDWLLADVPPEPFTRVIVELQTHTPLTQWQP 283
          . |:|. .. .|  . |. || | :|. . : |: |  ..
P    295 VKSVEMHHEALDEAVPGDNVGFNVRGVSVKDIKRGNVAGDSK........ 336
                                  .
                                  .
                                  .
                  .         .         .         .         .
S    334 GARVVMLNPPRRGKRKPEYLQWLASLARAQSDADALSVHLERGAVNLADF 383
                |.|  .:   .|   |  |
P    337 .......NDPGSAKGAAKFTAQVIVLNKEEG................... 360
                  .         .         .         .         .
S    384 AWARQLNGEGMRELLQQPGYIQAGYSLLNAPVAARWQRKILDTLATYHEQ 433
                         :|| | .||. :    ||.   :| : ...   :
P    361 ...............GHPGQITNGYTPVLDCHTAHIACKFAEILEKL..D 393
                  .         .         .         .         .
S    434 HRDEPGPGRERLRRMALPMEDEALVLLLIEKMRESGDIH..SHHGWLHLP 481
         .| .     :      |  || |:| |   |    |..   .  ||| ::
P    394 RRSGKELEKEPENPKLIKSGDAAIVKLIPTKPLCVETFSEFPPLGRFAVR 443
                  .         .         .         .         .
S    482 DHKAGFSEEQQAIWQKAEPLFGDEPWWVRDLAKETGTDEQAMRLTLRQAA 531
         |                                        |. ||  |:
P    444 D........................................MGQTV..AV 451
                  .
S    532 QQGIITAIVK 541
           |||..| :
P    452 ..GVIKKVEK 459

///////////////////////////////////////////////////////////////////////

8) Optional work for extra credit.

Repeat the same type of analysis with EF 2-G as I illustrated with the representative list of EF-1-Tu sequences. Find an EF 2-G protein sequence name, run a quick BLAST search with it, and prepare a representative list from the BLAST output of from ten to twenty protein sequences. Prepare a multiple sequence alignment of them and create a profile consensus sequence from the alignment. Name the consensus sequence (your lastname).ef2_cons. However, do not submit a ProfileSearch with the EF 2-G alignment. If you do the extra credit portion of the exercise, be sure to send the EF 2-G consensus sequence to the teacher account in order to recieve credit for the optional work.

9) Conclusions and evaluations; finishing up.

To get credit for this lab, you need to complete the exercise report form that was copied into your directory at the beginning of the tutorial. Rename it to have your last name with the mv command but leave the extension .week11, and then go into the file using the pico editor to fill in the report.

% mv week11.week11 (your lastname).week11

% pico (your lastname).week11

In addition to the report form, I also want to see the Figure files from your Elongation Factor subset alignment, the similarity dendrogram and the similarity plot, and the extra credit output if you chose to tackle it. All required files should have the same filename (i.e. your lastname) but unique, identifying extensions. Because of this you can rcp them all at once by utilizing a wildcard. Remote copy the files to teacher:

% rcp (your lastname).* teacher@ribozyme:receive

This concludes your computing session for this week. Log off ribozyme, get out of the emulator and back to the overlapping windows screen.

% exit

Press the <alt> and <x> keys together. You will be asked if you really want to exit the program. Respond with <y> to get out of the teemtalk emulator and return to the overlapping windows screen.

References

ECDC. The E.coli Data Collection. http://susi.bio.uni-giessen.de/usr/local/www/html/ecdc.html.

Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84, 4355-4358.

Gupta, S. K., Kececioglu, J., and Schaffer, A. A (1995) Making the Shortest-Paths Approach to Sum-of-Pairs Multiple Sequence Alignment More Space Efficient in Practice, to appear in Proc. 6th Annual Combinatorial Pattern Matching conference (CPM '95).

Madsen, H.O. Poulsen, K., Dahl, O., Clark, B.F., and Hjorth, J.P. (1990) Retropseudogenes constitute the major part of thehuman elongation factor 1 alpha gene family. Nucleic Acids Research 18, 1513-1516.

Smith, R.F. andSmith, T.F. (1992). Pattern-Induced Multi-sequence Alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for comparitive protein modelling. Protein Engineering 5:35-41.

Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22: 4673-4680.