'97 BC/BP 578

Week 8

Sequence Series

Advanced Database Similarity Searching: What's available, the methods and the algorithms -- motifs, hashing techniques and heuristics, dot matrix analysis, multiple sequence alignments, and profiles.

Author:

Steven M. Thompson

Introduction

I've got a sequence: now what? Search the databases -- what shows up?

What do database searches tell us and what can we gain from them? Why even bother? If our new sequence, or portions of it, falls into a preexisting group then we can gain knowledge of its function and possibly even its structure, plus sometimes a confirmation of the sequence's validity. Database searches can often provide valuable insights into enzymatic mechanism and even evolution. Are there any "families" that your newly discovered sequence falls into? Even if no similarity can be found, the very fact that your sequence is new and different could be very important. Granted, it's going to be a lot more difficult to discover functional and structural data about it, but in the long run its characterization might prove very rewarding.

Most database searching programs use a couple of tricks to make things happen faster. These tricks fall into two main categories, that of hashing and approximation. Hashing is the process of breaking your query sequence into small "words" or "ktuples" of a set size and creating a "look-up" table with those words keyed to numbers. Then when any of the words match part of an entry in the database, that match is saved. In general, hashing reduces the complexity of the search problem from N2 for dynamic programming to N, the length of all the sequences in the database. Approximation techniques are collectively known as "heuristics." Webster's defines heuristic as "serving to guide, discover, or reveal; . . . but unproved or incapable of proof." In database searching techniques the heuristic usually restricts the necessary search space by calculating some sort of a statistic that allows the program to decide whether further scrutiny of a particular match should be pursued. This statistic may miss things depending on the parameters set -- that's the heuristic.

Database searching is far more sensitive at the amino acid level than at the DNA level because proteins have twenty match criteria versus DNA's four. This drastically cuts down the "noise" level of the search. Therefore, if you are dealing with coding sequence, it is usually prudent to search at the protein level. However, sometimes a preliminary search of the DNA databases with a DNA query will yield valuable information such as the important questions: "Am I working on something that's already been done; is it worth continuing?" or, "Am I doing something totally unique -- or maybe, just `junk' DNA?" An early nucleotide search can often point out the necessity of continued research or the contemplation of abandonment, before any translational analysis has been completed. This could prove to be a tremendous time saver.

Even though protein searching is more sensitive, the DNA databases are much larger. This drawback has been partly overcome with programs which can take a protein query and compare it to translated nucleotide databases, but one still needs to know if the translation is "real." So, even though there are advantages and disadvantages to both types of searching, the general rule is to query with a peptide sequence, if at all possible, and screen whichever database you choose.

One small database that you should screen just as a matter of course is PROSITES. The GCG program Motifs performs this search. Motifs searches for recognized structural, regulatory and enzymatic consensus sequences in the PROSITE Dictionary of Protein Sites and Patterns (Bairoch, 1992). This approach is wonderful for trying to ascertain function in an unidentified peptide sequence, but keep in mind the inherent problems of consensus style searches explored last week. The program can tolerate mismatches with a mismatch option and it displays an abstract with selected references for each motif signature found. Another small database that should not be ignored is NRL_3D. This database contains all the sequences from the three-dimensional coordinate database PDB and, thus, can serve as a link between structural and sequence based methods.

A very powerful approach that should always be utilized, if at all possible, is the Profile suite (Gribskov, et al., 1987). This strategy works best when one has prepared and refined a multiple sequence alignment of significantly similar sequences or, even better, regions within sequences. PROFILE searching involves forming a "profile" from this set of related sequences and then searching the databases with that profile. Profile searching is tremendously powerful and should be pursued whenever possible. A very appropriate strategy would be to find similar genes to a newly sequenced gene using traditional database searching techniques, and then align all of the significantly similar proteins or protein domains using the multiple sequence alignment program PileUp. One could then run the aligned sequences through the Profile package to generate a profile of the family.

Often Profile analysis can show features not obvious to individual members. A distinct advantage is evolutionary issues are considered by virtue of the Profile algorithms in further manipulations and database searches. Gaps are penalized more heavily in conserved areas than they are in variable regions and the more highly conserved a residue is, the more important it becomes. Generated consensus sequences are not based merely on the positional frequency of particular residues but rather utilize the evolutionary conservation of substitutions based on the BLOSUM62 (Henikoff and Henikoff, 1992) table (by default, other substitution matrices can also be specified).

A big question and a very common mistake that is made in this whole area of searching and alignment is the concept of homology versus similarity: There is a huge difference! Similarity is merely a statistical parameter which describes how much two sequences, or portions of them, are alike according to some set scoring criteria. Homology, by definition, implies an evolutionary relationship -- more than just the fact that we've all evolved from the same old pond scum. You need to be able to demonstrate some type of lineage between the organisms or genes of interest in order to claim homology. Even better, be able show some experimental evidence, morphological or genetic, that corroborates your assertion. There is really no such thing as percent homology; something is either homologous or it is not. Do not make the mistake of calling any old sequence similarity homology, it will get you in trouble with a lot of scientists, especially the evolutionists of the world.

How do you tell if a similarity is significant or is truly homologous? Many of the programs generate percent similarity scores, however these really don't mean a whole lot. The "quality" score means a lot more but is hard to interpret. At least it takes the length of similarity, all of the necessary gaps introduced, and the matching of symbols all into account. To get a better handle on what these various scores mean, read the algorithm section of the GCG Program Manual for the various methods -- statistics are always confusing but the descriptions do help. A couple of the programs generate histograms of score distributions, but again, they can be confusing. One of the best ways of deciding significance is to take advantage of the Monte Carlo randomizations option available in the two comparison programs BestFit and Gap. To utilize this strategy, pull the sequence you wish to evaluate from your search output list and compare it to your query using the appropriate algorithm, either Gap or BestFit depending on whether you're trying to compare the entire length of each sequence or only the best regions of similarity to each, respectively, and specify -randomizations=100 on the command line. This option jumbles the second sequence of the comparison 100 times and then generates scores and a standard deviation based on the jumbled matches. You can then compare the random scores to the unjumbled score using a "Z score" calculation to help decide significance. The BLAST (Altschul, et al., 1990) and ProfileSearch algorithms use a similar approach but base their statistics on a comparison to all the rest, "insignificantly similar," members of the database being searched.

Another powerful method that should always be considered in similarity analysis is the dot matrix procedure. In dot matrix analysis one sequence is plotted on the vertical axis against another on the horizontal axis using a very simple approach; wherever they match according to some scoring system that you specify a dot is generated. Dot matrix analysis can point out areas of similarity between two sequences that all other methods might miss. This is because most other methods align either the overall length of two sequences or just the "best" parts of each to achieve one optimal alignment. Dot matrix methods enable the operator to visualize the entirety of both sequences; if you will, it allows the "Gestalt" of the alignment to be seen. However, their interpretation is entirely up to the user -- you must know what the plots mean and how to successfully filter out extraneous background noise. Using this method correctly you can identify areas within sequences that happen to have significant matches that no other method would ever notice. Again, this is a method which you would perform after running initial searches as it only compares one sequence against another, not the entire database.

Many database searches work best when submitted as a batch process. This is because of the sizes of the databases of interest. We currently have over a million sequences in our nucleotide databases. In spite of the fast hashing style algorithms incorporated, most programs take quite a while to search through that much data. There is no way you want to wait in front of a unusable terminal while the computer cranks away at work comparing your query to that many sequences, therefore, take advantage of batch capablities. Most of the GCG database searches accept a batch option which is really handy. An exception to the standard "submit the search and wait" style of most database searching is the BLAST program. This program uses an extremely fast heuristic statistical hashing algorithm on a large parallel computer located at the National Center for Biotechnological Information. Typical searches run in just a minute or two, after you get through the waiting queue, however, realize the limitations of this algorithm to finding only ungapped areas of similarity and of not being optimized for nucleic acid sequences.

Exercise Series #8: A "Real-Life" Project Oriented Approach.

Database Searching: How and where do we start?

I will return to the prion example for this exercise. As in the previous two exercises, use your own Selected Molecule off the list to complete all of the following steps. Log on to your account and initialize GCG. Create a new directory for this exercise and move down into it. Next, copy your translated output file from the actual database entry containing your gene of interest from last week's gene finding session into this new directory. This file should have the .pep extension and it should be identical to its counterpart in the protein sequence databases.

1) Searching the PROSITES database.

A Quick and Dirty Method--MOTIFS search of PROSITES.

One of the quickest and easiest databases to search with your newly translated protein is Dr. Amos Bairoch's Protein signature database. This can lead to immediate answers and routes of investigation. It should always be utilized -- it's just too fast and simple to ignore.

Start the Motifs program by typing motifs; specify your .pep sequence's name. The prion example follows below:

% motifs humprp.pep

Motifs looks for sequence motifs by searching through proteins for the
patterns defined in the PROSITE Dictionary of Protein Sites and
Patterns.  Motifs can display an abstract of the current literature on
each of the motifs it finds.

 What should I call the output file (* humprp.motifs *) ?  <rtn>

          humprp.pep len:        231 .........................

             Total finds:          2
            Total length:        231
         Total sequences:          1
          CPU time (sec):      02.79

             Output file: "/disk2/usr/local/people/thompson/BC578/EX8/humprp.motifs"

% more humprp.motifs

 MOTIFS from: humprp.pep

 Mismatches: 0                February  8, 1996 17:08  ..


          humprp.pep  Check: 414   Length: 231   ! TRANSLATE of: humprp.genbank
check: 4378 from: 116 to: 808

______________________________________________________________________________

Prion_1               AGAAAAGAVVGGLGGY
            91: NMKHM AGAAAAGAVVGGLGGY MLGSA
****************************
* Prion protein signatures *
****************************

Prion protein (PrP) [1,2,3] is a  small glycoprotein found in high quantity in
the brains of  humans  or animals  infected  with  a  number  of  degenerative
neurological diseases such as  Kuru, Creutzfeldt-Jacob disease (CJD),  scrapie
or bovine spongiform encephalopathy (BSE).   PrP is encoded in the host genome
and expressed  both  in  normal  and  infected  cells.  It  has  a tendency to
aggregate yielding polymers called rods.

Structurally, PrP is  a  protein consisting  of  a signal peptide, followed by
an N-terminal  domain  that contains tandem repeats of a short motif (PHGGGWGQ
in mammals, PHNPGY in chicken), itself followed  by  a highly conserved domain
of about  140  residues  that  contains a  disulfide bond.  Finally comes a C-
terminal hydrophobic domain post-translationally  removed when PrP is attached
to the extracellular side of the cell membrane by a GPI-anchor.  The structure
of PrP is shown in the following schematic representation:

 +---+----------------+-******-------------------****-----+-----+
 |Sig| Tandem repeats |                    C        C    S|     |
 +---+----------------+--------------------|--------|----|+-----+
                                           +--------+    |
                                                         GPI

'C': conserved cysteine involved in a disulfide bond.
'*': position of the patterns.

As signature pattern for PrP,  we  selected a perfectly conserved alanine- and
glycine-rich region of 16 residues as well  as a region centered on the second
cysteine involved in the disulfide bond.

-Consensus pattern: A-G-A-A-A-A-G-A-V-V-G-G-L-G-G-Y
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: NONE.

-Consensus pattern: E-x-[ED]-x-K-[LIVM](2)-x-[KR]-[LIVM](2)-x-[QE]-M-C-x(2)-
                    Q-Y
                    [C is involved in a disulfide bond]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: NONE.

-Last update: December 1992 / Patterns and text revised.

[ 1] Stahl N., Prusiner S.B.
     FASEB J. 5:2799-2807(1991).
[ 2] Brunori M., Chiara Silvestrini M., Pocchiari M.
     Trends Biochem. Sci. 13:309-313(1988).
[ 3] Prusiner S.B.
     Annu. Rev. Microbiol. 43:345-374(1989).
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Prion_2               Ex(E,D)xK(L,I,V,M)2x(K,R)(L,I,V,M)2x(Q,E)MCx2QY
                            Ex(D)xK(M){2}x(R)(V){2}x(Q)MCx{2}QY
           178: GENFT               ETDVKMMERVVEQMCITQY               ERESQ

Find reference above under sequence: humprp.pep, pattern: Prion_2.
Extensive abstract and reference lists follow the identified sequence locations for each site. This information can save anybody a tremendous amount of work! The sites themselves are shown with their sequence locations below each consensus pattern. More sites will be listed if you specify the frequent option. Notice in my example that Dr. Bairoch included the prion signature in his database; the characteristic repeat is only found in the prion protein.

2) Traditional database searching: FastA and Word approaches -- running the algorithms.

Two different symbol matching algorithms have traditionally been utilized in database searching. These two algorithms (see the GCG Program Manual for details) are incorporated into GCG's FastA (Pearson and Lipman, 1988) and WordSearch (Wilbur and Lipman, 1983) programs Since the algorithms differ, the output results will also differ. As in all types of computerized molecular biology analyses, the prudent will run as many strategies as practical and try to interpret the results in light of this. Most of the following programs accept GCG's batch option; pay attention to the necessity of running these programs in the background -- otherwise you will be stuck at an unusable terminal!

A good solution: TFASTA -- takes advantage of the sensitivity of a protein query and the size of the nucleic acid databases.

TFastA is one of the most robust of the searching programs onsite. It compares your peptide sequence against translations of the DNA database. This way you can take advantage of the multitude of DNA sequences that never make it to the protein databases and yet still retain the sensitivity of protein searches.

Type tfasta -check -batch to start the Translation FastA program in batch mode and see all of the available options. At the command option prompt add -noalign to suppress all the alignments since most are redundant for our purposes and we will be investigating the more interesting ones later anyway. Some of the other options can be very helpful depending on your specific situation and should be explored in your own research. The optall option in particular is very useful and is now the program default. FastA sorts its output based on a normalized derivative of the optimum score, the result of the final dynamic programming pass in FastA, rather than the initn score, the longest combined word score. Specify your peptide's sequence name at the program's sequence prompt. Accept all of the program defaults including the Expectation Function cutoff value. This should insure that the listing goes beyond those entries which we would expect for these well studied protein molecules. The prion example session follows below:

% tfasta -check -batch

TFastA does a Pearson and Lipman search for similarity between a
query peptide sequence and any group of nucleotide sequences. TFastA
translates the nucleotide sequences in all six reading frames before
performing the comparison.  It is designed to answer the question, "What
implied peptide sequences in a nucleotide sequence database are similar
to my peptide sequence?"

Minimal Syntax: % tfasta [-INfile1=]ggamma.pep -Default

Prompted Parameters:

[-INfile2=]GenEMBL:*           search set (all of GenEMBL)
[-OUTfile=]ggamma.tfasta       output file name
-BEGin=1 -END=148              range of interest
-WORdsize=2                    word size
-EXPect=2.0                    lists scores until E() value reaches 2.0

Local Data Files:

-MATRix=blosum50.cmp           scoring matrix for peptides

Optional Parameters:

 Press q to quit or <Return> for more:  <rtn>
-GAPweight=16.0    gap creation penalty
-LENgthweight=4.0  gap extension penalty
-SINce=6.90        limits search to sequences dated on or after June 1990
-THREEFrames       translates and searches only the three forward reading
                     frames
-FRAme=1           translates and searches only the frame specified.
-NOPAMfactor       uses a constant factor to calculate initial diagonal scores
-LIStsize=40       shows the best 40 scores (overrides EXPect)
-NOATTRibutes      suppresses writing the Begin, End, and Strand
                     list attributes to the list of best scores
-ALIgn=20          shows the best 20 alignments
-NOALIgn           suppresses sequence alignments
-OPTall=20         immediately computes opt score when the initn score is 20
                     or higher; sorts on opt score
-NOOPTall          doesn't compute opt score during search; sorts on initn
-SWalign           does final alignment as Smith-Waterman
-SHOWall           shows complete sequences in alignment, not just overlaps
-MARKx=3           determines the alignment display mode
-NOHIStogram       suppresses printing the histogram
-LINEsize=60       number of sequence symbols per line of the alignment
-NODOCLines        suppresses sequence documentation in the alignment
-NOMONitor         suppresses the screen trace for each search set sequence
-BATch             submits the program to run in the batch queue
 Press q to quit or <Return> for more:  <rtn>
-MINLength=1000    searches only sequences of 1000 or more residues
-MAXLength=5000    searches only sequences of 5000 or fewer residues

 Add what to the command line ?  -noalign

 TFASTA with what query sequence ?  humprp.pep

                  Begin (* 1 *) ?  <rtn>
                End (*   231 *) ?  <rtn>

 Search for query in what sequence(s) (* GenEMBL:* *) ?  <rtn>

 What word size (* 2 *) ?  <rtn>

 Don't show scores whose E() value exceeds: (* 10.0 *):  <rtn>

 What should I call the output file (* humprp.tfasta *) ?  <rtn>

 ** tfasta will run as a batch or at job.

 ** tfasta was submitted using the command:
    "  batch  "

warning: commands will be executed using /bin/sh
job 848620194.b at Thu Nov 21 15:49:54 1996
The job will submit itself and record a log file of any problems encountered during its run. This log file will be e-mailed to you by the system; it can prove handy with debugging attempts if anything goes wrong. You should proceed with the rest of the exercise at this point.

Contrast with normal FASTA.

Now run the normal FastA program searching the SwissProt:* logical sequence specification. Start the program by typing fasta -batch -noalign (the options are the same as in TFastA) to again run the program in batch mode and suppress alignments. Specify your sequence's filename.ext and accept the program's SwissProt database search. Accept the rest of the defaults again. The prion example is illustrated below:

% fasta -batch -noalign

FastA does a Pearson and Lipman search for similarity between a query
sequence and a group of sequences of the same type (nucleic acid or
protein). For nucleotide searches, FastA may be more sensitive than BLAST.

 FASTA with what query sequence ?  humprp.pep

                  Begin (* 1 *) ?  <rtn>
                End (*   231 *) ?  <rtn>

 Search for query in what sequence(s) (* SwissProt:* *) ?  <rtn>

 What word size (* 2 *) ?  <rtn>

 Don't show scores whose E() value exceeds: (* 10.0 *):  <rtn>

 What should I call the output file (* humprp.fasta *) ?  <rtn>

 ** fasta will run as a batch or at job.

 ** fasta was submitted using the command:
    "  batch  "

warning: commands will be executed using /bin/sh
job 848620634.b at Thu Nov 21 15:57:14 1996
Again the batch option automatically submits the program to the background and you should proceed with the exercise.

Repeat the FastA search specifying NRL_3D as the database to search. You do not need to run this search in batch because NRL_3D is relatively small. Be sure not to use the default output file name as this would overwrite your previous FastA search. This search should pull out the corresponding sequence from the 3-dimensional structural coordinate database. Remember NRL_3D is a "link" to PDB which can be used to specify these sequences in GCG programs and point to whole new vistas of investigation.

Using WORDSEARCH.

The last traditional database search we will submit is called WordSearch. Unlike FastA, WordSearch can find more than one region of similarity between two sequences. Just like the previous two, it will accept automatic batch submission. Type wordsearch -batch to begin the program; additionally, take advantage of this program's graphics capability by specifying -plot and -figure on the command line. Give the program the sequence's name and accept the logical SwissProt:* as the sequences to be searched. Specify a list size of 500. An option this program will accept, but that we will not be using this time, is simplify. This option simplifies the input sequences according to a scheme which generalizes amino acids into broad classes based on their chemical characteristics. Another helpful option available is mask. This option allows you to search with only certain characters of your sequence such as only the first and second codon positions in coding DNA. The example session with the prion protein follows below:

% wordsearch -plot -batch -figure

WordSearch identifies sequences similar to a query sequence using a
Wilbur and Lipman search.  WordSearch answers the question,
"What sequences in the database are similar to my sequence?"  The output
is a list of significant diagonals whose alignments can be displayed
with Segments.

  Process set to plot with VERSATERM-TEK4105 attached to term
  using the tekd graphic interface.

 WORDSEARCH with what query sequence ?  humprp.pep

                  Begin (* 1 *) ?  <rtn>
                End (*   231 *) ?  <rtn>

 Search for query in what sequence(s) (* SwissProt:* *) ?  <rtn>

 What word size (* 2 *) ?  <rtn>

 List how many best diagonals (* 50 *) ?  500

 Integrate how many adjacent diagonals (* 3 *) ?  <rtn>
/
 What should I call the output file (* humprp.word *) ?  <rtn>

 ** wordsearch will run as a batch or at job.

 ** wordsearch was submitted using the command:
    "  batch  "

warning: commands will be executed using /bin/sh
job 848621491.b at Thu Nov 21 16:11:31 1996job 848621491.b at Thu Nov 21 16:11:31 1996
3) Running ProfileSearch.

The Profile Method: PROFILESEARCH.

Dr. Gribskov et al. (see references in GCG program manual) have assembled an elegant package for associating distantly related proteins and discovering structural motifs in Profile analysis. John Devereux has written an excellent overview essay of the method in the GCG program manual; please take the time to read this section!

We should actually be running a Profile analysis of a family assembled from the significant results of the previous searches; however, we just don't have the time. Therefore, we will take advantage of the Profile we assembled back in Exercise 5 when we constructed our probes/primers. Copy your .prf file from your Exercise 5 directory into the present directory. My copy example follows (yours will be different):

% cp ~/courses/BC578/EX5/prion.prf .

This program also accepts automatic batch submission so it can be launched in a similar manner as the previous three. Type profilesearch -batch -check to launch the program in batch mode and see all the available options. Answer the prompts as above, however, take advantage of the minlist option. This option allows you to cut-off the search list at a particular Z score; in this case set it equal to 4.0. We will again run the search against the SwissProt database. My screen trace follows:

% profilesearch -batch -check

ProfileSearch uses a profile (representing a group of aligned
sequences) as a query to search the database for new sequences with
similarity to the group.  The profile is created with the program
ProfileMake.

Minimal Syntax: % profilesearch [-INfile1=]hsp70.prf -Default

Prompted Parameters:

[-INfile2=]SW:*          the search set
-GAPweight=4.50          maximum position-specific gap creation penalty
-LENgthweight=0.05       maximum position-specific gap extension penalty
[-OUTfile=]hsp70.pfs     output file name

Local Data Files: None

Optional Parameters:

-LIStsize=100           sets a limit to the size of the output list
                             (60,000 is the default)
-SEQLimit=100000         maximum number of sequences to search
                             (can't be set higher than 100,000)
-MINList=2.5            lowest Z score to report in output list.
 Press q to quit or <Return> for more:  <rtn>
-MINSeq=50              minimum length sequence to examine in the search
-MINNorm=50             minimum length sequence to use in normalization
-NONORmalize            turns off normalization of comparison scores
-NOAVErage              does not adjust quality scores for sequence
                             composition
-NOSEArch               normalizes a pre-existing file of search scores
-FITfile[=hsp70.fit]    output file with curve fitting information for
                             normalized scores
-CPUlimit=1000          limits the search to 1,000 seconds (default
                             is 86,400)
-SINce=6.90             limits search to sequences dated on or after
                             June 1990 (does not work for PIR databases)
-NOMONitor              suppresses the screen trace for the search set
                             sequences
-BATch                  submits program to the batch queue

 Add what to the command line ?  -minlist=4.0

 PROFILESEARCH  version 4.50     November 21, 1996 16:17

 PROFILESEARCH with what query profile ?  prion.prf

 "prion.prf" is a profile of length: 251

 Search for query in what sequences(s) (* SwissProt:* *) ?  <rtn>

 What is the gap creation penalty (* 33.0 *) ?  <rtn>

 What is the gap extension penalty (* 0.37 *) ?  <rtn>

 What should I call the output file (* prion.pfs *) ?  <rtn>

 ** profilesearch will run as a batch or at job.

 ** profilesearch was submitted using the command:
    "  batch  "

warning: commands will be executed using /bin/sh
job 848621889.b at Thu Nov 21 16:18:09 1996
We did refine our Profile to only include the more conserved portion of the alignment by trimming the divergent amino and carboxy termini back in Exercise 5 which is good; however, its members should be appropriately weighted to even out their contributions. The Profile refinement procedure, including repeatedly searching the databases and including or excluding members and adjusting their weights as well as adjusting the Profile's length, is known as validating the Profile. If using Profile analysis in your own research, following the validation procedures outlined in the GCG Program Manual in the ProfileScan description is a very prudent idea, but we do not have the time for that now. As it is, much of the exercise is going to have to wait for the batch jobs to finish anyway.

The ProfileSearch batch job may not be done for a couple days under ribozyme's present load. Don't worry about it until next week and then we can take a look at the results. Proceed with the rest of the exercise at this point.

4) Revisit BLAST; Internet similarity searching.

The BLAST server at NCBI can provide the most up to date and quickest database search available. BLAST is a heuristic algorithm for searching sequence databases developed by the National Center for Biotechnology Information at the National Library of Medicine. The acronym stands for Basic Local Alignment Search Tool. BLAST by default runs on an eight processor parallel computer system at NCBI. It only looks for ungapped segments; however, it ranks matches statistically and provides probability values for each to help evaluate significance. As such, it is best for identifying shorter regions of high similarity -- exactly what you might want with a sequence of unknown function. It is very fast, at least an entire order of magnitude over traditional sequence similarity database searching, yet maintains all of the sensitivity of the old methods for local similarity regions with protein sequences! You'll get your reply in a manner of minutes in most cases. Another advantage with BLAST is it not only shows you the best alignment for each similar sequence found (as in the BestFit type alignments of FastA) but also shows the next best alignments for each up to a certain preset cutoff point. This combines the power of dot-matrix type analyses and the interpretative ease of traditional sequence alignments. One can fine-tune BLAST by altering its operating parameters and taking advantage of the many options available in it; however, BLAST is not appropriate for comparing nucleotide sequences against the nucleotide database. When you are forced to perform this type of nucleotide-to-nucleotide search it is usually best to use FastA instead.

BLAST accesses the latest (GenBank updates every night) database by default, nucleotide or protein. The GCG implementation of BLAST by default runs in a remote client-server mode such that NCBI's database and computer perform the analysis; alternatively you can run it locally if you have custom assembled databases that you wish to screen.

For help in interpreting BLAST results refer to the GCG BLAST documentation, the BLAST HELP file obtained off the web or by sending the single word "HELP" to BLAST@ncbi.nlm.nih.gov (leave the subject line blank), and the suggested text book for the course. You ran a sample BLAST search back in Exercise 4; run blast -check on your selected protein now. Specify the -filter=xs option. This is very powerful for screening out low complexity and repeat sequences from your query to minimize confusion due to random noise. The prion protein is very rich in these types of confounding sequence and so is a very appropriate candidate for this option. Also specify -segments=0, to suppress segment alignments and hence reduce the size of the output file. The standard output file is very long because BLAST automatically aligns the best 100 matches. Accept the default database choice, 1) nr p which stands for "NonRedundant Protein," to search all of the unique sequences from the most current versions of all of the protein databases. Set the listsize parameter to 500. BLAST has recently become quite busy; because of this, you may have to wait for a few moments in a queue. A screen trace of a BLAST session with the prion protein follows:

% blast -check

BLAST searches for sequences similar to a query sequence. The query and the
database searched can be either peptide or nucleic acid in any combination.
BLAST can search databases on your own computer or databases maintained at
the National Center for Biotechnology Information (NCBI) in Bethesda,
Maryland, USA.

Minimal Syntax: % blast [-INfile1=]SW:Zea2_Maize  -Default

Prompted Parameters:

[-INfile2=]swissprot     database to search
-EXPect=10.0             ignores scores that would occur by chance
                           more than 10 times
-LIStsize=250            maximum number of sequences listed in the output
[-OUTfile=]zea2.blastp   output file name

Local Data Files:

[-DATa1=BLAST.rDBs]      list of available remote databases
[-DATa2=BLAST.lDBs]      list of available local databases
[-DATa3=BLAST.sDBs]      list of available site-specific databases

Optional Parameters:
 Press q to quit or <Return> for more:  <rtn>

-TOPstrand                   searches only the top strand of nuc. query
-BOTtomstrand                searches only the bottom strand of nuc. query
-DBTOPstrand                 translates only top strand of nuc. database
-DBBOTtomstrand              translates only bottom strand of nuc. database
-FILter=xs                   filters low complexity segments out of query
-MATrix=PAM120[,PAM250...]   specifies one [or more] protein scoring matrix
-TBLASTX                     if query and database are both nucleotide,
                               translates both and does protein comparisons
-TRANSlate=1                 genetic code for translating query
-DBTRANSlate=1               genetic code for translating database
-REMoteonly                  searches only remote databases (simplifies menu)
-LOCalonly                   searches only local databases   (   "        " )
-DBNucleotideonly            searches only nucleic databases (   "        " )
-DBProteinonly               searches only protein databases (   "        " )
-SERver=cruncher.nlm.nih.gov runs remote search on this particular host
-WORdsize=7                  sets word size (primarily for nuc-nuc searches)
-SYNonym=10                  sets minimum score for equivalent words
-MATCH=5                     scoring matrix value for nucleotide matches
-SEGments=100                displays segment pairs from top 100 sequences
-HIStogram                   shows histogram of hits versus expectation level
-APPend="string"             appends "string" to pass-through command line
-BATch                       submits program to batch queue

 Add what to the command line ?  -filter=xs -segments=0

 BLAST search with what query sequence?  humprp.pep

 Search for query in what sequence database:

   1) nr          p Non-redundant GenBank CDS translations+PDB+SwissProt+PIR
   2)   pdb       p PDB protein sequences
   3)   swissprot p SwissProt sequences
   4) yeast       p Saccharomyces cerevisiae protein sequences
   5) kabat       p Kabat Sequences of Proteins of Immunological Interest
   6) alu         p Translations of Select Alu Repeats from REPBASE
   7) month       p All new or revised GenBank CDS translation+PDB+SwissProt+PI
   8) nr          n Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST's
   9)   pdb       n PDB nucleotide sequences
  10)   vector    n Vector subset of GenBank
  11) yeast       n Saccharomyces cerevisiae genomic nucleotide sequences
  12) est         n Non-redundant Database of GenBank+EMBL+DDBJ EST Division
  13) sts         n Non-redundant Database of GenBank+EMBL+DDBJ STS Division
  14) gss         n Genome Survey Sequences
  15) mito        n Database of mitochondrial sequences, Rel. 1.0, July 1995
  16) kabat       n Kabat Sequences of Nucleic Acid of Immunological Interest
  17) epd         n Eukaryotic Promotor Database
  18) alu         n Select Alu Repeats from REPBASE
  19) month       n All new or revised GenBank+EMBL+DDBJ+PDB sequences released

 Please choose one (* 1 *):  <rtn>

 Ignore hits expected to occur by chance more than (* 10.0 *) times?  <rtn>

 Limit the number of sequences in my output to (* 250 *) ?  500

 What should I call the output file (* humprp.blastp *) ?  <rtn>


 Trying cruncher.nlm.nih.gov (130.14.25.175)

 Connected to cruncher.nlm.nih.gov
Waiting for 2 of 5 other BLAST requests on the system to finish.
Still waiting
.
 Search in progress on the network server.

 ................................
WARNING:  HSPs invo.lving 67 database sequences were not reported due to the
          limiting value of parameter B = 1.

...

 Done!
The complete prion BLAST search reply follows:
% more humprp.blastp

National Center for Biotechnology Information (NCBI)

Experimental GENINFO(R) BLAST Network Service (Cruncher)

Fri Nov 22 13:20:44 EST 1996, Up 10 days, 3 mins, load: 13.38, 11.72, 11.63

  If results of this search are reported or published, please mention that
  the computation was performed at the NCBI using the BLAST network service.
  Problems with the service should be reported to a local system administrator.

PEPTIDE SEQUENCE DATABASES

 nr        Non-redundant GenBank CDS translations+PDB+SwissProt+PIR
 pdb       PDB protein sequences
 kabat     Kabat Sequences of Proteins of Immunological Interest
 alu *     Translations of Select Alu Repeats from REPBASE
 month     All new or revised GenBank CDS translation+PDB+SwissProt+PIR
           sequences released in the last 30 days
 swissprot SwissProt sequences

NUCLEOTIDE SEQUENCE DATABASES

 nr       Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST's or
          STS's)
 est +    Non-redundant Database of GenBank+EMBL+DDBJ EST Division
 sts +    Non-redundant Database of GenBank+EMBL+DDBJ STS Division
 pdb      PDB nucleotide sequences
 vector   Vector subset of GenBank
 mito *   Database of mitochondrial sequences, Rel. 1.0, July 1995
 kabat    Kabat Sequences of Nucleic Acid of Immunological Interest
 epd      Eukaryotic Promotor Database
 alu *+   Select Alu Repeats from REPBASE
 month    All new or revised GenBank+EMBL+DDBJ+PDB sequences released in
          the last 30 days

  * Databases that are not accessible through the NCBI Retrieve E-mail server.
  + The TBLASTX program is restricted to searching these databases.
==============================================================================
 You can obtain the BLAST documentation files, send a message consisting of
 just the word ''help'' (without the quotes) to: blast@ncbi.nlm.nih.gov
 Last modification dates: August 10th 95 for the E-mail server help, January
 19th 94 for the BLAST manual and February 7th 95 for the BLAST FAQ.
==============================================================================
 For a free subscription to "NCBI News", the NCBI newsletter, send a request
 along with your name and postal mailing address to: info@ncbi.nlm.nih.gov
==============================================================================
 A new GenBank sequence submission tool, called BankIt, is now available
 through the NCBI's home page on the World Wide Web. The URL is
 http://www.ncbi.nlm.nih.gov/
==============================================================================

Waiting for 2 of 5 other BLAST requests on the system to finish.
Still waiting
BLASTP 1.4.9MP [26-March-1996] [Build 14:27:01 Apr  1 1996]

Reference:  Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers,
and David J. Lipman (1990).  Basic local alignment search tool.  J. Mol. Biol.
215:403-10.

Query=  TITLE humprp.pep
        (231 letters)
>lcl|TITLE humprp.pep
KKRPKPGGWNTGGSRYPGQGSPGGNRYXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXTHSQXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXMLGSAMSRPIIHFG
SDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHXXXXXXXXXXXXXXD
VKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSXXXXXXXXXXXXXXVG

Database:  Non-redundant GenBank CDS translations+PDB+SwissProt+SPupdate+PIR
           230,895 sequences; 65,241,596 total letters.
Searching..................................................done

                                                                     Smallest
                                                                       Sum
                                                              High  Probability
Sequences producing High-scoring Segment Pairs:              Score  P(N)      N

sp|P04156|PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR (PRP) ...   506  2.1e-87   2
pir||A05017          major prion PrP27-30 protein precurs...   506  2.2e-87   2
pir||A40372          major prion protein PrP variant 1 - ...   506  2.2e-87   2
gi|190520            (M81930) prion protein [Homo sapiens]     506  2.3e-87   2
gi|190518            (M81929) prion protein [Homo sapiens]     506  2.4e-87   2
sp|P40252|PRIO_GORGO MAJOR PRION PROTEIN PRECURSOR (PRP) ...   503  5.4e-87   2
gi|563209            (U15166) major prion protein precurs...   503  5.4e-87   2
sp|P40253|PRIO_PANTR MAJOR PRION PROTEIN PRECURSOR (PRP) ...   498  2.7e-86   2
sp|P40256|PRIO_PONPY MAJOR PRION PROTEIN PRECURSOR (PRP) ...   490  3.4e-85   2
sp|P40251|PRIO_COLGU MAJOR PRION PROTEIN PRECURSOR (PRP) ...   485  1.7e-84   2
sp|P40254|PRIO_MACFA MAJOR PRION PROTEIN PRECURSOR (PRP) ...   485  1.7e-84   2
gi|563213            (U15164) major prion protein precurs...   485  1.7e-84   2
gi|1658469           (U75389) prion protein [Colobus guer...   485  1.7e-84   2
gi|1658471           (U75386) prion protein [Cercopithecu...   485  1.7e-84   2
gi|1658475           (U75388) prion protein [Cercopithecu...   485  1.7e-84   2
sp|P40250|PRIO_CERAE MAJOR PRION PROTEIN PRECURSOR (PRP) ...   485  1.7e-84   2
sp|P40255|PRIO_MANSP MAJOR PRION PROTEIN PRECURSOR (PRP) ...   485  1.8e-84   2
sp|P40248|PRIO_CALMO MAJOR PRION PROTEIN PRECURSOR (PRP) ...   485  1.8e-84   2
gi|1658467           (U75384) prion protein [Cercocebus a...   485  1.8e-84   2
sp|P40246|PRIO_ATEGE MAJOR PRION PROTEIN PRECURSOR (PRP) ...   485  1.8e-84   2
sp|P40247|PRIO_CALJA MAJOR PRION PROTEIN PRECURSOR (PRP) ...   482  4.4e-84   2
sp|P40257|PRIO_PREFR MAJOR PRION PROTEIN PRECURSOR (PRP) ...   481  6.0e-84   2
gi|1396068           (U08302) prion protein [Presbytis fr...   481  6.0e-84   2
sp|P40249|PRIO_CEBAP MAJOR PRION PROTEIN PRECURSOR (PRP) ...   488  6.0e-84   2
gi|1658477           (U75385) prion protein [Cercocebus t...   481  6.2e-84   2
sp|P40258|PRIO_SAISC MAJOR PRION PROTEIN PRECURSOR (PRP) ...   479  1.1e-83   2
gi|1658481           (U75383) prion protein [Theropithecu...   478  1.7e-83   2
gi|595853            (U15165) major prion protein precurs...   476  3.0e-83   2
sp|P40245|PRIO_AOTTR MAJOR PRION PROTEIN PRECURSOR (PRP) ...   481  4.3e-83   2
sp|P04925|PRIO_MOUSE MAJOR PRION PROTEIN PRECURSOR (PRP) ...   424  1.5e-80   3
gi|200531            (M18071) prion protein [Mus musculus]     423  2.1e-80   3
gi|200527            (M13685) prion protein [Mus musculus]     420  5.4e-80   3
pir||A34759          prion protein - Chinese hamster >gi|...   419  1.4e-79   3
pir||B34759          prion protein - golden hamster >gi|1...   419  1.4e-79   3
sp|P10279|PRIO_BOVIN MAJOR PRION PROTEIN 1 PRECURSOR (PRP...   481  2.1e-79   2
sp|Q01880|PRIP_BOVIN MAJOR PRION PROTEIN 2 PRECURSOR (PRP...   478  5.6e-79   2
gi|217594            (D10612) prion protein [Bos taurus]       477  7.5e-79   2
sp|P40242|PRIO_TRAST MAJOR PRION PROTEIN 1 PRECURSOR (PRP...   475  1.4e-78   2
prf||1921132B        PrP protein [Tragelaphus strepsiceros]    475  1.5e-78   2
sp|P40243|PRP2_TRAST MAJOR PRION PROTEIN 2 PRECURSOR (PRP...   475  1.5e-78   2
pir||A23545          major prion PrP27-30 protein - hamster    409  3.3e-78   3
sp|P04273|PRIO_MESAU MAJOR PRION PROTEIN PRECURSOR (PRP) ...   409  3.3e-78   3
gi|191423            (M37381) scrapie prion [Mesocricetus...   409  3.5e-78   3
pir||UJHYIH          major prion PrP27-30 protein - golde...   409  3.5e-78   3
pir||A54281          prion protein precursor - sheep >gi|...   471  5.2e-78   2
sp|P23907|PRIO_SHEEP MAJOR PRION PROTEIN PRECURSOR (PRP)....   469  9.9e-78   2
pir||S37149          prion protein - goat >gi|400443 (X74...   467  1.9e-77   2
gi|1149617           (X91999) prion protein [Capra hircus]     464  4.9e-77   2
sp|P47852|PRIO_ODOHE MAJOR PRION PROTEIN PRECURSOR (PRP)....   463  6.7e-77   2
sp|P40244|PRIO_MUSVI MAJOR PRION PROTEIN PRECURSOR (PRP) ...   456  6.2e-76   2
sp|P13852|PRIO_RAT   MAJOR PRION PROTEIN (PRP). >pir||A53...   423  2.1e-75   3
gi|556349            (L07623) prion protein [Sus scrofa]       449  5.7e-75   2
gi|619652            (U08952) prion protein [Mustela puto...   446  1.5e-74   2
gi|546665            (S69654) prion protein, PrP [zitter ...   381  2.3e-74   3
gi|1490413           (U28334) major prion protein [Orycto...   441  5.0e-73   2
gi|642126            (L38993) prion protein [Trichosurus ...   400  1.1e-57   2
gi|191432            (K02234) PrP 27-30 protein [Mesocric...   408  2.4e-52   1
gi|200533            (M30384) prion protein PrP27-30 [Mus...   188  1.7e-21   1
pir||A22315          Scrapie prion - mouse (fragment) >pr...   187  2.3e-21   1
sp|P27177|PRIO_CHICK MAJOR PRION PROTEIN HOMOLOG PRECURSO...   125  2.2e-19   3
pir||A37372          prion protein homolog precursor - ch...   120  9.9e-19   3
pir||UJCH            major prion protein homolog precurso...   120  9.9e-19   3
pir||S02520          Major prion PrP, scrapie, protein - ...   149  1.5e-14   1
pir||S14078          amyloid protein, 11K - human              115  1.1e-09   1
sp|Q01157|GRW1_LYCES GLYCINE-RICH CELL WALL STRUCTURAL PR...    55  0.55      1
gnl|PID|e153043      (X55691) glycine-rich protein [Solan...    64  0.59      1
pir||S14978          glycine-rich protein (clone wN) - to...    64  0.65      1
pir||S14977          glycine-rich protein (clone wM) - to...    64  0.67      1
pir||A36019          Scrapie prion - golden hamster (frag...    58  0.72      1
sp|P17816|GRP_HORVU  GLYCINE-RICH CELL WALL STRUCTURAL PR...    64  0.74      1
sp|P21750|SALA_DROME PROTEIN SPALT-ACCESSORY. >pir||S0026...    47  0.91      2
gnl|PID|e156085      (X55690) glycine-rich protein [Solan...    52  0.92      1
pir||S53051          glycine rich protein - barley (fragm...    59  0.93      1
gi|1673657           (AE000002) Mycoplasma pneumoniae, E0...    59  0.999     1
sp||P100_HUMAN_1     [Segment 1 of 2] DNA-BINDING P52/P10...    32  0.9992    2
sp|P21749|SALA_DROSI PROTEIN SPALT-ACCESSORY. >pir||B3391...    43  0.9998    2

Parameters:
  V=500
  B=0
  -filter=xnu+seg
  -echofilter
  P=4

  -ctxfactor=1.00
  E=10

  Query                        -----  As Used  -----    -----  Computed  ----
  Frame  MatID Matrix name     Lambda    K       H      Lambda    K       H
   +0      0   BLOSUM62        0.320   0.135   0.418    same    same    same

  Query
  Frame  MatID  Length  Eff.Length   E    S W   T  X     E2  S2
   +0      0      231       128      10. 59 3  11 22    0.19 31

Statistics:
  Query          Expected         Observed           HSPs       HSPs
  Frame  MatID  High Score       High Score       Reportable  Reported
   +0      0    63 (29.1 bits)  506 (233.6 bits)      461          0

  Query         Neighborhd  Word      Excluded    Failed   Successful  Overlaps
  Frame  MatID   Words      Hits        Hits    Extensions Extensions  Excluded
   +0      0      3821    19292048     4710041    14542333    39674       495

  Database:  Non-redundant GenBank CDS translations+PDB+SwissProt+SPupdate+PIR
    Release date:  November 22, 1996
    Posted date:  6:19 AM EST Nov 22, 1996
  # of letters in database:  65,241,596
  # of sequences in database:  230,895
  # of database sequences satisfying E:  76
  No. of states in DFA:  539 (53 KB)
  Total size of DFA:  95 KB (128 KB)
  Time to generate neighborhood:  0.01u 0.01s 0.02t  Real: 00:00:00
  No. of processors used:  4
  Time to search database:  34.62u 0.33s 34.95t  Real: 00:00:10
  Total cpu time:  34.68u 0.37s 35.05t  Real: 00:00:10
<END
Some very important features of BLAST's output need to be pointed out. First notice that the -filter=xs switch causes the troublesome portions of the query sequence to be ignored in the search. Especially pay attention to the Poisson distribution scores. These are the likelihoods that the observed matches could be due to chance. Therefore, the smaller the number, the more significant. You should be able to see a fairly clear demarcation where the scores quickly drop off between the significant hits and the background noise. Also realize that, if we had chosen to display alignments, more than one segment per sequence can be presented by the BLAST program. Another important point to notice is, unlike all other GCG programs, the list generated by this search is not appropriate as input to other GCG analyses. This output list is actually generated by NCBI's computer which doesn't know about GCG format requirements. Therefore, if you want to use the results of a BLAST search in another GCG program you must manually edit the list changing the database names to reflect the logicals that GCG understands. For example, change sp|P40250|PRIO_CERAE to SW:P40250 PRIO_CERAE (either insert a blank space or ! between the access code and sequence name). The "gi" designation is all the translations from GenBank's CDS references. This database, GenPept, is not installed on VADMS' systems. Therefore, if you want to include any in subsequent analyses, you must individually translate them from GenBank. For instance, "gi|190518 (M81929) prion protein [Homo sapiens]" is the CDS translation from GenBank accession code M81929. This does make life a bit more complicated but can be worked around.

5) What Next? Comparisons, interpretations, and further analyses.

I hate to do this to you, but this exercise must be performed in two parts! In order to proceed you must have the output files from all of the previous searches, excluding the ProfileSearch, available. I do apologize.

When the output from the previous "traditional" batch jobs is available take a look at the files. They will be located within your Exercise 8 subdirectory. Also all batch jobs mail their log files to your account -- you'll want to delete these if all jobs ran to completion. You may want to print out your search results using the lpr command for reference in the remainder of the exercise although it is not necessary. The abridged prion example output files follow below. Naturally, the topmost "hits" are prion proteins; it's the ones just below the expected hits that may prove interesting:

!!SEQUENCE_LIST 1.0

(Peptide) TFASTA of: humprp.pep  from: 1 to: 231  November 21, 1996 19:57

TRANSLATE of: humprp.genbank check: 4378 from: 116 to: 808
(mature protein only: PrP 33-35-C)
generated symbols 1 to: 231.
LOCUS       HUMPRP       2415 bp ss-mRNA            PRI       20-MAY-1987
DEFINITION  Human prion protein (PrP) mRNA, complete cds.
ACCESSION   M13899 . . .

 TO: GenEMBL:*  Sequences:    299,711  Symbols: 390,626,675  Word Size: 2

 Sequences too short to analyze: 36 (146 symbols)
 Databases searched:
   gb_ba, Release 97.0, Released on 0Oct96, Formatted on 0Oct96
   gb_in, Release 97.0, Released on 0Oct96, Formatted on 0Oct96
///////////////////////////////////////////////////////////////////
   em_sy, Release 43.0, Released on 0May1995, Formatted on 0Mar1996
   em_un, Release 43.0, Released on 0May1995, Formatted on 0Mar1996
   em_vi, Release 43.0, Released on 0May1995, Formatted on 0Mar1996

 Searching all six frames.
 Scoring matrix: GenRunData:blosum50.cmp
 Variable pamfactor used
 Gap creation penalty: 16      Gap extension penalty: 4

Histogram Key:
 Each histogram symbol represents 3538 search set sequences
 Each inset symbol represents 39 search set sequences
 z-scores computed from opt scores

z-score obs    exp
        (=)    (*)

< 20 112558      0 :*===============================
  22      1      0 :*
  24      2      2 :*
  26     39     35 :*
  28    154    376 :*
  30    544   2286 :*
  32   1406   8838 := *
  34   3872  23968 :==    *
  36  11128  49226 :====         *
  38  30025  81352 :=========             *
  40  62970 113478 :==================              *
  42 111653 138713 :================================       *
  44 161831 153014 :===========================================*==
  46 202943 155848 :============================================*=============
  48 212249 149207 :==========================================*=================
  50 194974 136152 :======================================*=================
  52 154066 119700 :=================================*==========
  54 119534 102245 :============================*=====
  56  89439  85406 :========================*=
  58  64612  70117 :===================*
  60  47274  56799 :==============  *
  62  34419  45536 :==========  *
  64  27565  36214 :========  *
  66  22429  28623 :======= *
  68  18245  22514 :======*
  70  15137  17643 :====*
  72  12621  13787 :===*
  74  10690  10749 :===*
  76   8597   8366 :==*
  78   7044   6503 :=*
  80   5928   5049 :=*
  82   4916   3863 :=*
  84   4054   3060 :*=
  86   3296   2367 :*
  88   2586   1832 :*
  90   2414   1417 :*
  92   1912   1097 :*         :============================*===========
  94   1455    849 :*         :=====================*================
  96   1271    657 :*         :================*================
  98   1074    508 :*         :=============*==============
 100    951    393 :*         :==========*==============
 102    724    304 :*         :=======*===========
 104    631    235 :*         :======*==========
 106    486    182 :*         :====*========
 108    431    141 :*         :===*========
 110    350    109 :*         :==*======
 112    327     84 :*         :==*======
 114    270     65 :*         :=*=====
 116    306     51 :*         :=*======
 118    203     39 :*         :*=====
>120   1686     30 :*         :*=======================================

 Results sorted and z-values calculated from opt score
 1505 scores better than 121 saved
 359022 optimizations performed
 Joining threshold: 36, optimization threshold: 36, opt. width: 16

The best scores are:                  frame init1 initn   opt    z-sc E(1659129) ..

gb_pr:ggu15166    Begin: 23  End:  253
! U15166 Gorilla gorilla major prion ...(1)  1686  1686  1686  1670.1       0
gb_pr:ggu08300    Begin: 23  End:  253
! U08300 Gorilla gorilla prion protei...(1)  1686  1686  1686  1670.1       0
gb_pr:humprp    Begin: 39  End:  269
! M13899 Human prion protein (PrP) mR...(2)  1690  1690  1690  1667.2       0
gb_pr:ssu08308    Begin: 23  End:  253
! U08308 Symphalangus syndactylus pri...(1)  1680  1680  1680  1664.2       0
gb_pr:ptu08296    Begin: 23  End:  253
! U08296 Pan troglodytes prion protei...(1)  1680  1680  1680  1664.2       0
gb_pr:hlu08299    Begin: 23  End:  253
! U08299 Hylobates lar prion protein ...(1)  1680  1680  1680  1664.2       0
gb_pr:ptu15039    Begin: 23  End:  253
! U15039 Pan troglodytes major prion ...(1)  1680  1680  1680  1664.2       0
gb_pr:ppu08305    Begin: 23  End:  253
! U08305 Pongo pygmaeus prion protein...(1)  1668  1668  1668  1652.5       0
gb_pr:humprp0a    Begin: 41  End:  270
! M13667 Human prion protein 27-30 mR...(2)   940   940  1669  1646.7       0
gb_pr:humprion    Begin: 41  End:  270
! D00015 Human prion protein mRNA, hu...(2)   940   940  1669  1646.7       0
gb_pr:cgu08297    Begin: 23  End:  253
! U08297 Colobus guereza prion protei...(1)  1655  1655  1655  1639.7       0
gb_pr:pfu08302    Begin: 23  End:  253
! U08302 Presbytis francoisi prion pr...(1)  1651  1651  1651  1635.8       0

/////////////////////////////////////////////////////////////////////////////

! U28334 Oryctolagus cuniculus major ...(1)   997   997  1510  1496.5       0
gb_pr:humprpra    Begin: 15  End:  224
! M81929 Human prion protein gene, 5'...(2)  1306  1306  1498  1486.7       0
gb_pr:agu08309    Begin: 21  End:  232
! U08309 Ateles geoffroyi prion prote...(1)  1327  1327  1375  1366.1       0
gb_om:tioprp    Begin: 121  End:  353
! L38993 Trichosurus vulpecular prion...(3)   942   942  1174  1165.4       0
gb_ro:musprpc    Begin: 1  End:  78
! M30384 Mouse prion protein (PrP27-3...(2)   523   523   523   538.2  5.9e-22
gb_ov:chkprion    Begin: 40  End:  219
! M95404 Chicken prion protein gene, ...(1)   330   357   439   447.7  6.6e-17
gb_ov:chkprlp    Begin: 85  End:  258
! M61145 Gallus gallus prion-like pro...(1)   324   324   434   438.0  2.3e-16
gb_pr:s80539    Begin: 9  End:  94
! S80539 PRNP=PrP amyloid [human, Gen...(1)   384   384   399   415.5  4.1e-15
gb_pr:s79978    Begin: 5  End:  86
! S79978 prion protein {144-base pair...(1)   384   384   398   415.1  4.3e-15
gb_pr:hspglap3    Begin: 5  End:  68
! Z48604 H.sapiens prion protein gene...(1)   384   384   389   407.4  1.2e-14
gb_pr:hspglap2    Begin: 1  End:  44
! Z48048 H.sapiens prion protein gene...(1)   375   375   379   400.0   3e-14
gb_in:celrbp1    Begin: 268  End:  372
! D10877 C.elegans mRNA for hnRNP lik...(1)   172   172   236   246.4  1.1e-05
gb_pl:s47405    Begin: 101  End:  198
! S47405 glycine-rich protein {clone ...(2)   197   197   202   217.2  0.00045
gb_pl:bo15h11    Begin: 38  End:  136
! Z74892 B.oleracea mRNA for glycine-...(1)   201   201   201   216.0  0.00053
gb_pl:hvgrp3    Begin: 110  End:  174
! Z48624 H.vulgare mRNA for glycine-r...(3)   201   201   202   216.0  0.00053
gb_pl:bngrp22g    Begin: 167  End:  260
! Z15045 B.napus GRP22 gene encoding ...(1)   194   194   200   212.6  0.00082
gb_pl:pvgrp18    Begin: 328  End:  434
! X13596 Bean DNA for glycine-rich ce...(1)   180   180   203   212.6  0.00082
gb_pl:tau32310    Begin: 131  End:  196
! U32310 Triticum aestivum single-str...(2)   181   181   198   211.9  0.00089
gb_pr:hskerat9    Begin: 520  End:  624
! Z29074 H.sapiens mRNA for cytokerat...(1)   168   168   202   210.6  0.0011
gb_pl:vurnagrp    Begin: 129  End:  235

/////////////////////////////////////////////////////////////////////////////

! J05230 S.cerevisiae fibrillarin (NO...(1)   103   103   130   139.4     9.7
gb_in:sau32943    Begin: 113  End:  212
! U32943 Schistocerca americana Anten...(3)   112   112   131   139.4     9.8
gb_pl:bnamrnaa    Begin: 258  End:  338
! L33282 Brassica napus clone STA 41-...(2)   102   102   127   139.3     9.8
gb_pr:hsl60g9b    Begin: 619  End:  683
! Z69363 Human DNA sequence from cosm...(5)    97    97   141   139.3     9.9
gb_pr:hummel18    Begin: 347  End:  409
! D13969 Human mRNA for Mel-18 protei...(5)    83   101   129   139.3     9.9
gb_pr:s79325    Begin: 7  End:  72
! S79325 SYT...SSX1 {translocation br...(1)   102   102   121   139.3     9.9
gb_pr:s79332    Begin: 7  End:  72
! S79332 SYT...SSX2 {translocation br...(1)   102   102   121   139.3     9.9
gb_om:shpcrfa    Begin: 379  End:  454
! M22853 Sheep corticotropin releasin...(4)    95    95   131   139.2      10
\\End of List

! CPU time used:
!        Database scan:  0:40:49.2
! Post-scan processing:  0:00:20.1
!       Total CPU time:  0:41:09.5
! Output File: humprp.tfasta
Had we chosen to not suppress aligning the results, the TFastA output would also show the sequence alignment at each location. The beginning and ending alignment points along with the translation frames listed in the output can be used to go back to the original nucleotide entry to check whether the match- ups correspond to actually translated areas. Notice that the output file is an acceptable GCG list file that can serve as input to other programs such as PileUp. A histogram of the score distribution is also displayed in the FastA outputs. This can be helpful to get a feeling for the statistical significance of the search and in ascertaining whether you ran your search list large enough. The histogram can be suppressed with the nohistogram option if desired. Another thing to notice in the output is that the entries are sorted by a "z" score parameter based on a normalization of the opt scores and their distribution from the rest of the database. This z- score is a bit different than the more traditional Monte Carlo style distribution Z score that I will introduce below. Here it is calculated from a simple linear regression against the natural log of the search set sequence length. Either type can be very helpful as they help describe the statistical significance of an alignment. Sometimes initial extended word scores, initn's, are greatly improved after the opt dynamic programming and normalization pass. A good example is shown in the above output under the entries HumPrion and HumPrP0A. Both of these entries scored relatively poor initn scores of 940 versus their final z-score of 1646.7. This point underscores the importance of using mutliple algorithms.

The Expectation function, E(), is the most important column. It is very similar to the Poisson value in BLAST reports and describes the number of search set sequences that would be needed to obtain a z-score greater than or equal to the z-score obtained in any particular search purely by chance; in-other-words, just like with BLAST P-values, the smaller the number, the better. As a rule-of-thumb, for a search against a protein database of about 10,000 sequences, as long as optimization is not turned off, E() scores of less than 0.01 are almost certainly homologous, and scores between 1 to 10 may be, although these guidelines can be skewed by compositional biases.

Sometimes the sequences found by TFastA will not show up in any other searches. This could be valuable information, especially if sometime during your peptide's evolutionary history it incorporated (was "infected" by) any type of mobile DNA element. This comes to mind in my example because of frequent hits within the herpes virus family.

Next the abridged example FastA output file:

% more humprp.fasta

!!SEQUENCE_LIST 1.0

(Peptide) FASTA of: humprp.pep  from: 1 to: 231  November 21, 1996 19:14

TRANSLATE of: humprp.genbank check: 4378 from: 116 to: 808
(mature protein only: PrP 33-35-C)
generated symbols 1 to: 231.
LOCUS       HUMPRP       2415 bp ss-mRNA            PRI       20-MAY-1987
DEFINITION  Human prion protein (PrP) mRNA, complete cds.
ACCESSION   M13899 . . .

 TO: SwissProt:*  Sequences:     52,205  Symbols: 18,531,385  Word Size: 2

 Databases searched:
   swissprot, Release 33.0, Released on 0Feb1996, Formatted on 0Jul1996

 Scoring matrix: GenRunData:blosum50.cmp
 Variable pamfactor used
 Gap creation penalty: 12      Gap extension penalty: 2

Histogram Key:
 Each histogram symbol represents 82 search set sequences
 Each inset symbol represents 6 search set sequences
 z-scores computed from opt scores

z-score obs    exp
        (=)    (*)

< 20    169      0 :*==
  22      0      0 :*
  24      1      0 :*
  26      0      1 :*
  28      1     12 :*
  30     21     71 :*
  32    171    276 :===*
  34    605    748 :======== *
  36   1562   1536 :==================*=
  38   3025   2538 :==============================*======
  40   4317   3540 :===========================================*=========
  42   4680   4327 :====================================================*=====
  44   4888   4773 :==========================================================*=
  46   4879   4862 :===========================================================*
  48   4830   4655 :========================================================*==
  50   4266   4247 :===================================================*=
  52   3507   3734 :===========================================  *
  54   2938   3190 :====================================  *
  56   2464   2664 :=============================== *
  58   2052   2187 :==========================*
  60   1600   1772 :==================== *
  62   1214   1421 :===============  *
  64    930   1130 :============ *
  66    723    893 :========= *
  68    593    702 :========*
  70    478    550 :======*
  72    362    430 :=====*
  74    308    335 :====*
  76    224    261 :===*
  78    171    203 :==*
  80    147    158 :=*
  82    125    121 :=*
  84     86     95 :=*
  86     73     74 :*
  88     63     57 :*
  90     44     44 :*
  92     51     34 :*         :=====*===
  94     55     26 :*         :====*=====
  96     48     20 :*         :===*====
  98     31     16 :*         :==*===
 100     19     12 :*         :=*==
 102     25      9 :*         :=*===
 104     36      7 :*         :=*====
 106     28      6 :*         :*====
 108     21      4 :*         :*===
 110     22      3 :*         :*===
 112     17      3 :*         :*==
 114     15      2 :*         :*==
 116     21      2 :*         :*===
 118     14      1 :*         :*==
>120    271      1 :*===      :*=======================================

 Results sorted and z-values calculated from opt score
 1985 scores better than 73 saved
 41992 optimizations performed
 Joining threshold: 36, optimization threshold: 24, opt. width: 16

The best scores are:                    init1 initn   opt    z-sc E(51759)..

sw:prio_human    Begin: 23  End:  253
! P04156 homo sapiens (human). major ... 1690  1690  1690  1682.2       0
sw:prio_gorgo    Begin: 23  End:  253
! P40252 gorilla gorilla gorilla (low... 1686  1686  1686  1678.3       0
sw:prio_pantr    Begin: 23  End:  253
! P40253 pan troglodytes (chimpanzee)... 1680  1680  1680  1672.4       0
sw:prio_ponpy    Begin: 23  End:  253
! P40256 pongo pygmaeus (orangutan). ... 1668  1668  1668  1660.5       0

/////////////////////////////////////////////////////////////////////////

sw:prio_trast    Begin: 26  End:  264
! P40242 tragelaphus strepsiceros (gr... 1352  1352  1546  1539.9       0
sw:prio_cerae    Begin: 23  End:  245
! P40250 cercopithecus aethiops (gree... 1335  1335  1545  1539.4       0
sw:prio_rat    Begin: 1  End:  226
! P13852 rattus norvegicus (rat). maj... 1396  1396  1524  1519.1       0
sw:prio_atege    Begin: 21  End:  232
! P40246 ateles geoffroyi (black-hand... 1327  1327  1381  1377.9       0
sw:prio_chick    Begin: 28  End:  270
! P27177 gallus gallus (chicken). maj...  330   363   488   495.8  4.3e-21
sw:rb87_drome    Begin: 245  End:  356
! P48810 drosophila melanogaster (fru...  140   140   217   226.3  4.4e-06
sw:k1ci_human    Begin: 497  End:  601
! P35527 homo sapiens (human). kerati...  168   168   218   224.4  5.6e-06
sw:grp1_orysa    Begin: 60  End:  158
! P25074 oryza sativa (rice). glycine...  168   168   206   220.6  9.1e-06
sw:grp2_orysa    Begin: 31  End:  130
! P29834 oryza sativa (rice). glycine...  181   181   206   220.0  9.9e-06
sw:grp1_phavu    Begin: 118  End:  217
! P10495 phaseolus vulgaris (kidney b...  181   181   207   219.0  1.1e-05
sw:grp2_phavu    Begin: 177  End:  283
! P10496 phaseolus vulgaris (kidney b...  180   180   209   217.2  1.4e-05
sw:grp_arath    Begin: 140  End:  244
! P27483 arabidopsis thaliana (mouse-...  200   200   206   216.2  1.6e-05
sw:grp_horvu    Begin: 80  End:  154
! P17816 hordeum vulgare (barley). gl...  168   168   197   210.5  3.3e-05
sw:cora_medsa    Begin: 58  End:  167
! Q07202 medicago sativa (alfalfa). c...  147   147   195   208.5  4.3e-05
sw:grp2_sinal    Begin: 96  End:  166
! P49311 sinapis alba (white mustard)...  147   147   193   207.6  4.8e-05
sw:k22e_human    Begin: 536  End:  641
! P35908 homo sapiens (human). kerati...  166   166   200   206.4  5.7e-05
sw:hmpr_human    Begin: 50  End:  123
! Q01851 homo sapiens (human). homeob...  139   139   192   202.5  9.3e-05
sw:k1cj_human    Begin: 460  End:  556
! P13645 homo sapiens (human). kerati...  159   159   195   201.9  0.0001
sw:grpa_medfa    Begin: 39  End:  130
! Q09134 medicago falcata (sickle med...  146   146   186   201.1  0.00011

/////////////////////////////////////////////////////////////////////////

! P27214 bos taurus (bovine). annexin...   80   124   104   113.2     8.8
sw:antf_sarpe    Begin: 23  End:  84
! Q08617 sarcophaga peregrina (flesh ...   78    78    93   113.1     8.8
sw:hxb3_human    Begin: 148  End:  181
! P14651 homo sapiens (human). homeob...   97    97   103   113.1     8.9
sw:rog_human    Begin: 92  End:  224
! P38159 homo sapiens (human). hetero...   59    59   103   113.0       9
sw:hra1_xancv    Begin: 211  End:  300
! P80151 xanthomonas campestris (pv. ...   74    74   105   113.0       9
sw:lmp1_ebv    Begin: 310  End:  378
! P03230 epstein-barr virus (strain b...   72    95   102   112.8     9.2
sw:dip_drome    Begin: 35  End:  87
! P24492 drosophila melanogaster (fru...   54    54    94   112.8     9.3
sw:vasa_drome    Begin: 47  End:  117
! P09052 drosophila melanogaster (fru...   74    74   105   112.5     9.6
sw:prp2_human    Begin: 159  End:  235
! P02812 homo sapiens (human). saliva...   88    88    99   112.5     9.6
sw:hmux_drops    Begin: 104  End:  175
! P20822 drosophila pseudoobscura (fr...   74    74    99   112.3     9.8
\\End of List

! CPU time used:
!        Database scan:  0:02:29.7
! Post-scan processing:  0:00:07.4
!       Total CPU time:  0:02:37.2
! Output File: humprp.fasta
Both FastA output files show a histogram of the score distribution, then a sorted list of the top scores and finally, if alignments are not suppressed, BestFit style alignments of the score list. They both use the same E() significance statistic in their reports. Next, the abridged prion example WordSearch output:
% more humprp.word

!!SEQUENCE_LIST 1.0
 (Peptide) WORDSEARCH of: /disk2/usr/local/people/thompson/courses/BC578/EX8/hum
prp.pep  check: 414 from: 1  to: 231

TRANSLATE of: humprp.genbank check: 4378 from: 116 to: 808
(mature protein only: PrP 33-35-C)
generated symbols 1 to: 231.
LOCUS       HUMPRP       2415 bp ss-mRNA            PRI       20-MAY-1987
DEFINITION  Human prion protein (PrP) mRNA, complete cds.
ACCESSION   M13899 . . .

 TO: SwissProt:*  Sequences: 52,205 Total-length: 18,531,385  November 21, 1996 19:12

 Database Release Information:
        swissprot, Release 33.0, Released on 0Feb1996, Formatted on 0Jul1996

 Word-size: 2  Words: 74263061  Diagonals: 14,481,967 Total-diagonals: 30,538,535
 Integral-width: 3  Alphabet: 20 List-size: 300  CPU minutes: 1.81

Sequence                   Strd   Diag Score Width Documentation ..

SW:PRIO_HUMAN                 +     22  269   25   P04156 homo sapiens (human).
major prion protein precursor (prp) (prp27-30) (prp
SW:PRIO_GORGO                 +     22  268   25   P40252 gorilla gorilla gorill
a (lowland gorilla). major prion protein precursor
SW:PRIO_PANTR                 +     22  267   25   P40253 pan troglodytes (chimp
anzee), hylobates lar (common gibbon), and symphala
SW:PRIO_PONPY                 +     22  264   25   P40256 pongo pygmaeus (orangu
tan). major prion protein precursor (prp) (prp27-30
SW:PRIO_COLGU                 +     22  261   25   P40251 colobus guereza. major
 prion protein precursor (prp) (prp27-30) (prp33-35
SW:PRIO_PREFR                 +     22  260   25   P40257 presbytis francoisi. m
ajor prion protein precursor (prp) (prp27-30) (prp3

//////////////////////////////////////////////////////////////////////////////

SW:PRIO_TRAST                 +     33  226   15   P40242 tragelaphus strepsicer
os (greater kudu). major prion protein 1 precursor
SW:PRIO_CERAE                 +     14  221   14   P40250 cercopithecus aethiops
 (green monkey) (grivet), and cercopithecus dianae.
SW:PRIO_ATEGE                 +      6  216   15   P40246 ateles geoffroyi (blac
k-handed spider monkey). major prion protein precur
SW:GRP1_PETHY                 +     68   72   30   P09789 petunia hybrida (petun
ia). glycine-rich cell wall structural protein 1 pr
SW:GRP1_PETHY                 +    198   71   36   P09789 petunia hybrida (petun
ia). glycine-rich cell wall structural protein 1 pr
SW:GRP_ARATH                  +    137   69    2   P27483 arabidopsis thaliana (
mouse-ear cress). glycine-rich cell wall structural
SW:GRP_ARATH                  +    133   69    3   P27483 arabidopsis thaliana (
mouse-ear cress). glycine-rich cell wall structural
SW:GRP1_PHAVU                 +     66   68    7   P10495 phaseolus vulgaris (ki
dney bean) (french bean). glycine-rich cell wall st
SW:GRP_ARATH                  +    104   68    8   P27483 arabidopsis thaliana (
mouse-ear cress). glycine-rich cell wall structural
SW:PRIO_CHICK                 +     35   67   35   P27177 gallus gallus (chicken
). major prion protein homolog precursor (pr-lp) (a
SW:GRP1_PETHY                 +    109   66   33   P09789 petunia hybrida (petun
ia). glycine-rich cell wall structural protein 1 pr
SW:GRP_ARATH                  +    181   66    6   P27483 arabidopsis thaliana (
mouse-ear cress). glycine-rich cell wall structural
SW:GRP2_PHAVU                 +    140   65   12   P10496 phaseolus vulgaris (ki
dney bean) (french bean). glycine-rich cell wall st
SW:GRP2_PHAVU                 +     72   65    4   P10496 phaseolus vulgaris (ki
dney bean) (french bean). glycine-rich cell wall st
SW:GRP_HORVU                  +     47   65   54   P17816 hordeum vulgare (barle
y). glycine-rich cell wall structural protein precu
SW:GRP_ARATH                  +    153   65    7   P27483 arabidopsis thaliana (
mouse-ear cress). glycine-rich cell wall structural
SW:GRP_ARATH                  +    113   65    2   P27483 arabidopsis thaliana (
mouse-ear cress). glycine-rich cell wall structural

//////////////////////////////////////////////////////////////////////////////

SW:K1CJ_BOVIN                 +     35   45   91   P06394 bos taurus (bovine). k
eratin, type i cytoskeletal vib (cytokeratin vib).
SW:LORI_MOUSE                 +    293   45    2   P18165 mus musculus (mouse).
loricrin. 2/96
SW:LORI_MOUSE                 +    240   45    1   P18165 mus musculus (mouse).
loricrin. 2/96
SW:LORI_MOUSE                 +    149   45    5   P18165 mus musculus (mouse).
loricrin. 2/96
SW:PRIO_CEBAP                 +      5   45    4   P40249 cebus apella (brown-ca
pped capuchin). major prion protein precursor (prp)
SW:PRIO_CALJA                 +      5   45    4   P40247 callithrix jacchus (co
mmon marmoset). major prion protein precursor (prp)
SW:YKR3_CAEEL                 +      4   45   67   P34309 caenorhabditis elegans
. hypothetical 11.3 kd protein c06g4.3 in chromosom
SW:SPD1_NEPCL                 +    314   45   10   P19837 nephila clavipes (orb
spider). spidroin 1 (dragline silk fibroin 1) (frag
SW:SPD1_NEPCL                 +    246   45   19   P19837 nephila clavipes (orb
spider). spidroin 1 (dragline silk fibroin 1) (frag
SW:SPD1_NEPCL                 +     18   45    7   P19837 nephila clavipes (orb
spider). spidroin 1 (dragline silk fibroin 1) (frag
SW:EGG1_SCHJA                 +     40   44    4   P19470 schistosoma japonicum
(blood fluke). eggshell protein 1 precursor. 5/91
SW:FBRL_LEIMA                 +      2   44   71   P35549 leishmania major. fibr
illarin. 11/95
Notice that the various lists have entries in common but also have substantial differences, in particular WordSearch found tons of glycine-rich protein segments, some with scores higher than prion segments themselves. This is another reason as to why it is always a good idea to run as many different types of analyses as practical. We ran WordSearch with the plot option to generate a graphic of the score distribution of that procedure. Be sure to initialize your graphics configuration first (setplot) and then type figure wordsearch.figure to draw the plot:
% figure wordsearch.figure

Figure makes figures and posters by drawing graphics and text
together. You can include output from other GCG graphics programs as
part of a figure.

  Process set to plot with VERSATERM-TEK4105 attached to term
  using the tekd graphic interface.

 When your VERSATERM-TEK4105 attached to tty is ready, press <Return>.  <rtn>

The asterisk on the abscissa indicates your lowest reported score and helps you to decide if you ran your list large enough. Normally you want your list size big enough to include some of the population of random low scores to help you ascertain the significance of the alignments. In my case above the list was plenty large to catch the upward slope of the random scores. The inset shows a blowup of the high score end of the graph -- these are the best alignments found by the program, not the worst!

Unlike the FastA and BLAST programs WordSearch can not illustrate its alignments with sequence pairs. The separate program Segments can do that. We know how your sequence aligns to itself, so first edit your .word file by removing all of the Swiss Protein entry list lines near the top except one or two positive control sequences that refer to your particular protein family. Do not remove any of the reference header section of the file nor those sequences which are somewhat similar to your query. These are the interesting sequences which are obviously not from the same gene family as your selected molecule yet have fairly high scores. Do remove all the sequences below this point all the way to the bottom of the list. These are all the sequences that are obviously totally dissimilar. We are mainly interested here in how some of the not so similar, "twilight zone," sequences align. In order to visualize the alignments run the program Segments; provide it with your WordSearch output file's name; give the new output file the extension .wordpairs. This process is shown in an abridged fashion for the prion example on the next page:

% segments humprp.word

Segments aligns and displays the segments of similarity found by
WordSearch.

 What should I call the output file (* humprp.pairs *) ?  humprp.w
ordpairs

 Aligning ...........-..
 SW:PRIO_ATEGE    232 aa  Gaps:  1  Quality: 315.4 / Length: 226
 Aligning ...........-.
 SW:GRP1_PETHY    384 aa  Gaps:  6  Quality:  89.5 / Length: 223
 Aligning ...........-.
 SW:GRP1_PETHY    384 aa  Gaps:  7  Quality:  80.7 / Length: 216
 Aligning ..........-.
 SW:GRP_ARATH    338 aa  Gaps:  2  Quality:  84.9 / Length: 201
 Aligning ..........-.
 SW:GRP_ARATH    338 aa  Gaps:  3  Quality:  86.8 / Length: 210
 Aligning .........-.
 SW:GRP1_PHAVU    252 aa  Gaps:  5  Quality:  78.5 / Length: 191
 Aligning ...........-..
 SW:GRP_ARATH    338 aa  Gaps:  5  Quality:  88.9 / Length: 222
 Aligning ...........-..
 SW:PRIO_CHICK    273 aa  Gaps: 11  Quality: 150.4 / Length: 245
 Aligning ...........-..
 SW:GRP1_PETHY    384 aa  Gaps:  7  Quality:  89.9 / Length: 266

/////////////////////////////////////////////////////////////////

 Aligning ........-.
 SW:LORI_MOUSE    481 aa  Gaps:  4  Quality:  83.3 / Length: 218
Display the generated .wordpairs file. Again, the prion example follows in an abridged fashion:
% more humprp.wordpairs

 (BestFit) SEGMENTS from: humprp.word  February 14, 1996 09:58

 (Peptide) WORDSEARCH of: /disk2/usr/local/people/thompson/BC578/EX8/humprp.pep
 check: 414 from: 1  to: 231
TRANSLATE of: humprp.genbank check: 4378 from: 116 to: 808
(mature protein only: PrP 33-35-C)
generated symbols 1 to: 231.
LOCUS       HUMPRP       2415 bp ss-mRNA            PRI       20-MAY-1987
DEFINITION  Human prion protein (PrP) mRNA, complete cds. . . .

 AvMatch: 0.54  AvMisMatch: -0.40  GapWeight: 2.00  LengthWeight: 0.10   ..

humprp.pep                check: 414   from: 1      to: 231
SW:PRIO_ATEGE             check: 1948  from: 6      to: 232
     P40246 ateles geoffroyi (black-handed spider monkey). major prion protein p
recur
 Gaps: 1  Quality: 315.4  Ratio: 1.453  Score: 216  Width: 15  Limits: +/-16
                  .         .         .         .         .
       1 KKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGG 50
         ||||||||||||||||||||||||||||||         |||||||||||
      16 KKRPKPGGWNTGGSRYPGQGSPGGNRYPPQ.........GGGWGQPHGGG 56
                  .         .         .         .         .
      51 WGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVV 100
         ||||||||||||||||||||||||.|||||||||||||||||||||||||
      57 WGQPHGGGWGQPHGGGWGQGGGTHNQWNKPSKPKTNMKHMAGAAAAGAVV 106
                  .         .         .         .         .
     101 GGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQ 150
         |||||||||||||||:||||.|||||||||||.||||||||||:|:|.||
     107 GGLGGYMLGSAMSRPLIHFGNDYEDRYYRENMYRYPNQVYYRPVDQYNNQ 156
                  .         .         .         .         .
     151 NNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERES 200
         ||||||||||||||||||||||||||||||||||||||||||||||||||
     157 NNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERES 206
                  .         .
     201 QAYYQRGSSMVLFSSPPVILLISFLI 226
         ||||||||||||||||||||||||||
     207 QAYYQRGSSMVLFSSPPVILLISFLI 232

humprp.pep                check: 414   from: 1      to: 231
SW:GRP1_PETHY             check: 1852  from: 68     to: 384
     P09789 petunia hybrida (petunia). glycine-rich cell wall structural protein
 1 pr
 Gaps: 6  Quality: 89.5  Ratio: 0.441  Score: 72  Width: 30  Limits: +/-31
                  .         .         .         .         .
       4 PKPGGWNTGGSRYPGQGSPGGNRYPPQG..GGGWGQPHGGGWGQPHGGGW 51
         . .||:..||:  .|.|..||. ....|  |||:|.. ||| |.. |||
      78 GAGGGLGGGGGLGGGGGAGGGGGLGGGGGAGGGFGGGAGGGAGGGLGGGG 127
                  .         .         .         .         .
      52 GQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVG 101
         | . ||| |.. ||| |.|:|. :.:. .:        ::|:|:||:.||
     128 GLGGGGGGGAGGGGGVGGGAGSGGGFGAGGG.......VGGGAGAGGGVG 170
                  .         .         .         .         .
     102 GLGGYMLGSAMSRPIIHFGSDYEDRY.YRENMHRYPNQVYYRPMDEYSNQ 150
         | ||:  |:: :   :  ||:.:: :   :.:   :....  .::: :..
     171 GGGGFGGGGGGG...VGGGSGHGGGFGAGGGVGGGAGGGLGGGVGGGGGG 217
                  .         .         .         .         .
     151 NNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERES 200
         .. . :.:.        .....|:.|...:.  :: .|:. . .. : ::
     218 GSGGGGGIG........GGSGHGGGFGAGGG..VGGGVGGGAAGGGGGGG 257
                  .         .
     201 QAYYQRGSSMVLFSSPPVILLIS 223
         .:  . |:::.  |:... :  :
     258 GGGGGGGGGLGGGSGHGGGFGAG 280

humprp.pep                check: 414   from: 1      to: 231
SW:GRP1_PETHY             check: 1852  from: 198    to: 384
     P09789 petunia hybrida (petunia). glycine-rich cell wall structural protein
 1 pr
 Gaps: 7  Quality: 80.7  Ratio: 0.453  Score: 71  Width: 36  Limits: +/-37
                  .         .         .         .         .
       4 PKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQ 53
         . .||:..|.:  .|.||.||      || | | .||||:|.  ||| |.
     202 GAGGGLGGGVGGGGGGGSGGG......GGIGGGSGHGGGFGA..GGGVGG 243
                  .         .         .         .         .
      54 PHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAG.AAAAGAVVGG 102
         . ||| :.. ||| |.|||. :.:. .| . ..:   :| :::|::.|||
     244 GVGGGAAGGGGGGGGGGGGGGGGLGGGSGHGGGFGAGGGVGGGAAGGVGG 293
                  .         .         .         .         .
     103 LGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNN 152
          ||:  |:: :   :  ||:.:: :                     ....
     294 GGGFGGGGGGG...VGGGSGHGGGF.....................GAGG 319
                  .         .         .         .         .
     153 FVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQA 202
          | :... .:   ...... |:.:.:...  :: .|:    ...: : .|
     320 GVGGGAGGGLG..GGGGAGGGGGIGGGHGGGFGVGVG....IGIGVGVGA 363
                  .
     203 YYQRGSSMVLFSSPPV 218
           .:| ::.  |:...
     364 GAGHGVGVGSGSGSGG 379

humprp.pep                check: 414   from: 1      to: 231
SW:GRP_ARATH              check: 6121  from: 137    to: 338
     P27483 arabidopsis thaliana (mouse-ear cress). glycine-rich cell wall struc
tural
 Gaps: 2  Quality: 84.9  Ratio: 0.433  Score: 69  Width: 2  Limits: +/-3
                  .         .         .         .         .
       4 PKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQ 53
         . .|| ..||:  .|:|:. |.  ...:|||:|..|||| |.. |||.|.
     137 GAGGGAGGGGGLGGGHGGGIGGGAGGGAGGGLGGGHGGGIGGGAGGGSGG 186
                  .         .         .         .         .
      54 PHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMA....GAAAAGAV 99
         . ||| |.. ||| |.|||. :. . .: . ..:   |    |::|:|:.
     187 GLGGGIGGGAGGGAGGGGGAGGGGGLGGGHGGGFGGGAGGGLGGGAGGGT 236
                  .         .         .         .         .
     100 VGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSN 149
         .||:||   |:| : :   ||:: ::    :     ....   : ::::.
     237 GGGFGGGAGGGAGGGAGGGFGGGAGGGAGGGFGGGAGGGAGGGAGGGFGG 286
                  .         .         .         .         .
     150 QNNFVH.DCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYER 198
         ... .| :.|. .: . .... . |.. ...:.   : ..:. . ..::
     287 GAGGGHGGGVGGGFGGGSGGGFGGGAGGGAGGGAGGGFGGGGGAGGGFGG 336

     199 E 199
         :
     337 G 337

humprp.pep                check: 414   from: 1      to: 231
SW:GRP_ARATH              check: 6121  from: 133    to: 338
     P27483 arabidopsis thaliana (mouse-ear cress). glycine-rich cell wall struc
tural
 Gaps: 3  Quality: 86.8  Ratio: 0.430  Score: 69  Width: 3  Limits: +/-4
                  .         .         .         .         .
       4 PKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQ 53
         . .|| ..||:  .|:|:. |.  ...:|||:|..|||| |.. |||.|.
     137 GAGGGAGGGGGLGGGHGGGIGGGAGGGAGGGLGGGHGGGIGGGAGGGSGG 186
                  .         .         .         .         .
      54 PHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGL 103
         . ||| |.. ||| |.|||. :. . .: . ..:   ||:: :|:..||
     187 GLGGGIGGGAGGGAGGGGGAGGGGGLGGGHGGGFGGGAGGGLGGGAGGGT 236
                  .         .         .         .         .
     104 GGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNF 153
         || : |:| : :    |:::::    :.   :...    : :: :....
     237 GGGFGGGAGGGAGGGAGGGFGGGAGGGAGGGFGGG....AGGGAGGGAGG 282
                  .         .         .         .         .
     154 VHDCVNITIKQHTVT...TTTKGENFTETDVKMMERVVEQMCITQYERES 200
         . :... . .. .|.   ....|:.|.:..    : ..:. . ..:: ::
     283 GFGGGAGGGHGGGVGGGFGGGSGGGFGGGA....GGGAGGGAGGGFGGGG 328
                  .
     201 QAYYQRGSSM 210
         .|  . |:::
     329 GAGGGFGGGF 338

humprp.pep                check: 414   from: 1      to: 231
SW:GRP1_PHAVU             check: 4254  from: 66     to: 252
     P10495 phaseolus vulgaris (kidney bean) (french bean). glycine-rich cell wa
ll st
 Gaps: 5  Quality: 78.5  Ratio: 0.444  Score: 68  Width: 7  Limits: +/-8
                  .         .         .         .         .
       4 PKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQ 53
         . .:|:..:|: ..|.|: ||.  ...:|||:|.. | | |:..||  |.
      76 GAGAGYGAAGGGHGGGGGNGGGGGGGADGGGYGGGAGKGGGEGYGG..GG 123
                  .         .         .         .         .
      54 PHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGL 103
         ::|||:|.  |||.|.|||. .. . .|   ..    ||:: :|| .||
     124 ANGGGYGG..GGGSGGGGGGGAG.GAGSGYGGGEGSGAGGGYGGANGGGG 170
                  .         .         .         .         .
     104 GGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNF 153
         ||   |:: :..  | |:. ::    |.  . :...|  . .: :....
     171 GGNGGGGGGGSGGAHGGGAAGG...GEGAGQGAGGGYGGGAAGGGGRGSG 217
                  .         .         .         .
     154 VHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCIT 194
         . :.....      ....:|..:.:.:..  : . :. :..
     218 GGGGGGYG......GGGARGSGYGGGGGSGEGGGHGGGYYP 252

humprp.pep                check: 414   from: 1      to: 231
SW:GRP_ARATH              check: 6121  from: 104    to: 338
     P27483 arabidopsis thaliana (mouse-ear cress). glycine-rich cell wall struc
tural
 Gaps: 5  Quality: 88.9  Ratio: 0.415  Score: 68  Width: 8  Limits: +/-9
                  .         .         .         .         .
       6 PGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGW.... 51
         .|| ..| :  :|.||.||   . .||:| |.. |||:|..||||
     111 GGGHGGGIGGGAGGGSGGGLGGGIGGGAGGGAGGGGGLGGGHGGGIGGGA 160
                  .         .         .         .         .
      52 GQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVG 101
         |.. |||:|..|||| |.|:|. |. . .:   ..    ||::::::..|
     161 GGGAGGGLGGGHGGGIGGGAGGGSGGGLGGGIGGGAGGGAGGGGGAGGGG 210
                  .         .         .         .         .
     102 GLG...GYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYS 148
         |||   |  :|:: : .:   ::: .:  : :.    :...   : ::::
     211 GLGGGHGGGFGGGAGGGLGGGAGGGTGGGFGGGAGGGAGGG...AGGGFG 257
                  .         .         .         .         .
     149 NQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYER 198
         .... . ....:.  . ...... |:.|.:...   : .|:    ..::
     258 GGAG.GGAGGGFGGGAGGGAGGGAGGGFGGGAGGGHGGGVG....GGFGG 302
                  .         .
     199 ESQAYYQRGSSMVLFSSPPVIL 220
         :|.: :. |.: .  :::.. :
     303 GSGGGFGGGAGGGAGGGAGGGF 324

humprp.pep                check: 414   from: 1      to: 231
SW:PRIO_CHICK             check: 9640  from: 35     to: 273
     P27177 gallus gallus (chicken). major prion protein homolog precursor (pr-l
p) (a
 Gaps: 11  Quality: 150.4  Ratio: 0.674  Score: 67  Width: 35  Limits: +/-36
                  .         .         .         .         .
       8 GWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQ....PHGGGWGQ 53
         ||..|: | |: ....| .. |. . . | ||..|:.:    ||..|:.|
      35 GWGAGSHRQPSYPRQPGYPHNPGYPHNPGYPHNPGYPHNPGYPHNPGYPQ 84
                  .         .         .         .         .
      54 ....PHGGGWGQPHGGGWGQGGGTHSQWNKPSK.PKTNMKHMAGAAAAGA 98
             ||..|: .. |.|:..::|.  : .||.| ||||:||:||||||||
      85 NPGYPHNPGY.PGWGQGYNPSSGGSYHNQKPWKPPKTNFKHVAGAAAAGA 133
                  .         .         .         .         .
      99 VVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYS 148
         |||||||| :|..||   .||:|. | |::.||  ||||.||||   :||
     134 VVGGLGGYAMGRVMSGMNYHFDSPDEYRWWSENSARYPNRVYYR...DYS 180
                  .         .         .         .         .
     149 N...QNNFVHDCVNITIKQHTVTTTTK........GENFTETDV..KMME 185
         .   |: || ||.|||:.:..:....|        :.| ||.::  |::.
     181 SPVPQDVFVADCFNITVTEYSIGPAAKKNTSEAVAAANQTEVEMENKVVT 230
                  .         .         .         .
     186 RVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIV 230
         :|: :||: || ||    |. :|:: |   |: .:|  :|::|..
     231 KVIREMCVQQY.RE....YRLASGIQLH..PADTWLAVLLLLLTT 268

///////////////////////////////////////////////////////////////////////////

humprp.pep                check: 414   from: 1      to: 231
SW:LORI_MOUSE             check: 8318  from: 165    to: 481
     P18165 mus musculus (mouse). loricrin. 5/92
 Gaps: 4  Quality: 83.3  Ratio: 0.397  Score: 60  Width: 3  Limits: +/-4
                  .         .         .         .         .
       4 PKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQ 53
         ...|| ..||:.|.|.:|.||.. ...||:| |.  ||| | . |||::.
     169 SSCGGGSGGGGSYCGGSSGGGSSGGCGGGSGGGKYSGGGGGSSCGGGYSG 218
                  .         .         .         .         .
      54 PHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGL 103
         . |:: | . |||::.|||. :. . .: . .. .  :|:..:|:..|:
     219 GGGSSGGSSCGGGYSGGGGSSCG.GGGGYSGGGGTSCGGGSSGGGGGGSS 267
                  .         .         .         .         .
     104 GGYM...LGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQ 150
         ..|    .|:: |      ||.::: |  :.  .:...  |.. :: | .
     268 QQYQCQSYGGGSSG.....GSSCGGGYSGGGGSSCGGG..YSGGGGSSCG 310
                  .         .         .         .         .
     151 NNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERES 200
         ..   :. ... .. .. ... |:. .:..    :   .   ..| . ..
     311 GGSSGGGSSCGGSGGGGYSGGGGGSCGGGSSGGGGGYYSSQQTSQTSCAP 360
                  .
     201 QAYYQRGSSMVLFSSPPV 218
         |. |. ||| .  |:...
     361 QQSYGGGSSGGGGSCGGG 378
It's interesting to note that most of the non-prion hits in the WordSearch output are glycine-rich cell wall proteins, whereas the other approaches found this protein plus several other unrelated ones. WordSearch does have problems with the low-complexity, repeat types of sequence that we were able to filter out with BLAST, here that predominance of glycine repeats. Also notice that multiple segments for many sequences are found. WordSearch has largely been superseded by FastA and BLAST approaches but does remain online as an alternative. As stated earlier, it is good to gather as much different data as you can. One situation where WordSearch does excell is in finding difficult DNA alignments such as aligning cDNA to genomic DNA. Peter Rice, formerly at EMBL, Heidelberg, and now at the Sanger Centre, U.K., wrote the following concerning this aspect of WordSearch:
To: info-gcg@net.bio.net
From: rice@embl-heidelberg.de (Peter Rice)
Subject: Re: Way to gap around a 100 bp insertion?
Date: 11 Sep 93 13:57:19 GMT

In article <CD3st6.B28@watserv2.uwaterloo.ca>, kowalski@sciborg.uwaterloo.ca
(Paul Kowalski) writes:
> Is there some clever method to gap "around" a 100 bp insertion in genomic
> DNA? Lowering weights doesn't seem to help.

Sure. There is a program that does just that. Remember WORDSEARCH? The program
everyone used to use before FASTA . . . came along for GCG.

WORDSEARCH will find any decent sized exon (18 bases or so will get lost in
the noise of course). SEGMENTS is simply a BESTFIT run tied to each of
the hits. . . .

Use WORDSEARCH with a list size of 2 if you only have 2 exons. Use the
cDNA as the search sequence, and the genomic sequence as the "database".
I use a higher list size for safety, then edit out the lower scores from the
.WORD output file.

Use SEGMENTS to do the alignments.

WORDSEARCH is also wonderful for understanding fragments or contigs that
fail to overlap in Fragment Assembly.

I still wouldn't recommend WORDSEARCH for database searches, but for just a few
sequences (fragment assembly) or just one with several hits (as above) it
beats any other program.

You can lower the gap weights in SEGMENTS if you like. I often do for
Fragment Assembly matching, where you can expect a large number of 1-base
gaps (insertions or deletions) so you need a very low gap weight.

Interpreting Results--what is significant.

Now take three completely different, relatively high scoring sequences, one each, from each of the programs BLAST, FastA, and WordSearch, which do not obviously belong to the same gene family as your protein. These should be those sequences that score the highest of all the dissimilar matches but not be the same thing as your selected molecule. Be sure that each is a different type of entry from each of the searches. There are no "correct" answers here, we just want to see some interesting comparisons. Use typedata -reference to read about each sequence, if you are interested in what they are. We are going to experiment with various methods of analyzing similarity significance with these sequences. Relevant lines from my three example search files follow.

From the BLAST search, the top-most, non-obious match:

pir||S14078 amyloid protein, 11K - human 115 1.1e-09 1

This one actually does turn out to be a misnamed prion protein, but I will leave it in the list in order to show the difference between significant and insignificant match-ups.

From the FastA search, something completely different:

sw:rb87_drome Begin: 245 End: 356
! P48810 drosophila melanogaster (fru... 140 140 217 226.3 4.4e-06

And from the WordSearch output another totally different protein:

SW:GRP1_PETHY + 68 72 30 P09789 petunia hybrida (petunia). glycine-rich cell wall structural protein 1 pr

6) Dot matrix methods.

COMPARE and DOTPLOT: the GCG implementation of dot matrix analysis.

Dot matrix analysis is one of the few ways to identify other elements beyond what dynamic programming algorithms show to be similar between two sequences. GCG implements dot matrix methods with two programs. Compare generates the data that serves as input to DotPlot which actually draws the matrix. Compare your query sequence to each of its three near neighbors (as described above) using these methods. Start the program by typing compare; supply it with the appropriate sequence names. In general, put the longer sequence along the horizontal axis. Just as in all windowing algorithms, you want to use a window size of approximately the same size as the feature that you're trying to recognize. Leave the window at its default setting of 30 for these runs, unless one of your sequence is too short for this size of a window to find much. Rerun the program increasing the stringency of the comparisons until the number of points generated is of the same order of magnitude as the length of the longest sequence being compared. Below I found that the default stringency score of 33 out of a window of 30 resulted in 516 points -- the same magnitude as the sequences' lengths being compared. In this case, wherever the average of match scores within the window is equal to or exceeds 33, a point will be drawn at the middle of the window, then the window will be slid over one position at which point the process is repeated. You will be saving your three best output files from this program -- one comparing your protein sequence to an interesting non-family member found by BLAST, one to an analogous find by FastA, and finally the best non-Selected Molecule match found by WordSearch. I will illustrate two examples:

% compare

Compare compares two protein or nucleic acid sequences and creates a
file of the points of similarity between them for plotting with
DotPlot. Compare finds the points using either a
window/stringency or a word match criterion. The word comparison
is 1,000 times faster than the window/stringency comparison, but
somewhat less sensitive.

 COMPARE what horizontal sequence ?  humprp.pep

                  Begin (* 1 *) ?  <rtn>
                End (*   231 *) ?  <rtn>

 to what vertical sequence (* humprp.pep *) ?  pir:s14078

                  Begin (* 1 *) ?  <rtn>
                End (*   253 *) ?  <rtn>

 What comparison window size (* 30 *) ?  <rtn>

 What stringency (* 33 *) ?  <rtn>

 What should I call the output file (* humprp.pnt *) ?  humprp-s14708.pnt

 Number of points: 516

 Writing ..
After you have worked out appropriate stringency parameters, provide the .pnt file as input to DotPlot. Type dotplot and provide the correct filename; accept all program defaults. One exception would be if you are comparing a very long sequence to a very short one, then decreasing the plotting density will spread the graphic onto more than one page allowing you to 'zoom in' on the section of interest. The prion cross human amyloid protein comparison example follows:
% dotplot

DotPlot makes a dot-plot with the output file from Compare,
FoldRNA, or StemLoop.

  Process set to plot with VERSATERM-TEK4105 attached to term:
  using the tekd graphic interface.

 DOTPLOT what point file ?  humprp-s14708.pnt

 humprp-s14078.pnt contains COMPARE results of

    Axis                 Name   Check   Start     End   Dir

 Horizontal        humprp.pep     414       1     231   for
  Vertical             s14078    8781       1     253   for

             Window . . . . . . . . . 30
             Stringency . . . . . . . 33.0
             Number of points . . . . 516
             Percent of possible  . . 0.883

 The minimum density for a one-page plot is
 289.8 bases/100 platen units on each axis.

 What point density would you like (* 289.8 *) ?  <rtn>

 DOTPLOT will take 1 pages.  Would you like to:  <rtn>

       P)lot the points
       D)ifferent density
       G)et another point file to plot

       Q)uit

 Please select one (* P *):  <rtn>

 When your VERSATERM-TEK4105 attached to tty is ready, press <Return>.  <rtn>
The human prion cross human amyloid protein dotplot follows:

       P)lot the points
       D)ifferent density
       G)et another point file to plot

       Q)uit

 Please select one (* Q *):  <rtn>
Notice that running the comparison at an appropriate stringency, in this case 33 in a window of 30, produces a very clean plot with very little confusing noise. Still, sometimes interpreting a dotplot can be a major accomplishment in itself -- just remember that diagonals are regions of similarity between the two sequences and that any diagonal off the main center line is indicative of regions which do not correspond in linear placement between the two sequences yet are still similar. The regions of similarity between the human prion protein and the human amyloid protein are very clear in the dotplot. In fact, the amyloid protein turns out to be a variant prion molecule; the extensive full-length similarity confirms this. Also notice the strong direct repeat region; a sequence located near the beginning of the proteins occurs at least six times in each sequence. Columns and rows of diagonals always mean direct repeat sequences.

To contrast the extensive similarity seen above, here's the dotplot of the best non-prion match-up from FastA, the ribonucleoprotein RB87 from Drosophila. I ran this comparison at a stringency of 40 with the default window size of 30 to generate 719 points:

Here similarity is restricted to an area of direct repeats -- a doublet section near the beginning of the prion repeats itself several times in the carboxy half of RB87. When running all the dotplots, take notes of those particular regions in the proteins that appear significantly similar. For example, as noted in the above plot, the region of prion sequence from about residue 20 through 60 is quite similar to and repeats itself in RB87 sequence from around residue 210 through 320. We will need these numbers in the next section. Rerun all three of your DotPlots with appropriate stingencies, however, add the option -figure=(your name).dotplot1, 2, or 3 respectively to the command line of each.

7) The dynamic programming alignment algorithms: use the right one for the right job -- Gap and BestFit.

You need to understand the difference between these two algorithms! Gap is more a 'global' alignment scheme and BestFit more 'local.' Using one versus the other implies that you are looking for distinctly different relationships. Know what they mean. If you already know that the full length of two sequences are pretty close, that they probably belong to the same family, then Gap is the program for you; if you only suspect an area of one is similar to another, then you should use BestFit. To force BestFit to be even more local, try specifying a more stringent alternative symbol comparison table, such as pam250.cmp or blosum100.cmp located in the logical directory GenMoreData. Both programs can generate 'gapped' output files in standard sequence formats; this can be very handy as direct input to other GCG routines -- particularly multiple sequence analysis programs.

BESTFIT and GAP: how to use them to estimate significance.

What is significant? GCG provides a way to estimate significance in these two programs. When running them specify the option randomizations on the command line and the second input sequence will be jumbled a set number of times after the initial alignment is produced. Comparing the quality scores of the randomized alignments to the initial alignment can help you get a feeling for the relative meaning of the scores. An old rule-of-thumb that people often use is, if the actual score is much more than three standard deviations above the mean of the randomized scores, the analysis may be significant, if it is more than five, then it most probably is significant, if it's above around nine, then certainly so. This distance above the mean is often known as a "Z score" and can be calculated with the following formula:

Z score =   [ ( actual score ) - ( mean of randomized scores ) ]
( standard deviation of randomized score distribution )
This type of significance analysis is known as a Monte Carlo approach; it has many implicit problems, however, few practical alternatives exist. I will use randomizations with BestFit to illustrate with the prion cross amyloid and RB87 proteins to illustrate. Before beginning though, study your dotplot notes from before. This approach works best when applied to areas where you already know some similarity exists and you wish to further test that similarity. Therefore, restrict your analyses to those regions identified by the dotplots. However, remember that dotplots show us all the regions that are similar, whereas dynamic programming only gives us one optimal solution. Type bestfit -random=100 and supply the appropriate sequence names to run the program. Be sure to restrict the analyses to those regions of each sequence identified by your dotplots. Accept the rest of the program's defaults; however, give the output file names that make sense to you while keeping the .pair extension. The process follows below for the human prion and amyloid protein comparison:
% bestfit -random=100

BestFit makes an optimal alignment of the best segment of similarity
between two sequences. Optimal alignments are found by inserting gaps to
maximize the number of matches using the local homology algorithm of
Smith and Waterman.

 BESTFIT of what sequence 1 ?  humprp.pep

                  Begin (* 1 *) ?  10
                End (*   231 *) ?  220

 to what sequence 2 (* humprp.pep *) ?  pir:s14078

                  Begin (* 1 *) ?  30
                End (*   253 *) ?  240

 What is the gap creation penalty (* 12 *) ?  <rtn>

 What is the gap extension penalty (* 4 *) ?  <rtn>

 What should I call the paired output display file (* humprp.pair *) ?  humprp-s
14078.pair

 Aligning ..........-.
 Aligning ..........-..

          Gaps:     0
       Quality:  1195
 Quality Ratio: 5.718
  % Similarity: 100.000
        Length:   209

Randomized alignment              Quality
           1                           46
           2                           49
           3                           41
           4                           45
           5                           31
           6                           54
           7                           60
           8                           54
           9                           36
          10                           43
          11                           36
          12                           32
          13                           49

/////////////////////////////////////////

          96                           52
          97                           43
          98                           47
          99                           53
         100                           55

 Average quality based on 100 randomizations: 44.2 +/- 7.7
The output file is shown below. In this example's case the sequences are identical within the stretch being compared and the randomized quality scores are all much lower than the actual score. The Z score comes out at almost 150; therefore, the interpretation is that the similarity is extremely significant.
 BESTFIT of: humprp.pep  check: 414  from: 10  to: 220

TRANSLATE of: humprp.genbank check: 4378 from: 116 to: 808
(mature protein only: PrP 33-35-C)
generated symbols 1 to: 231.
LOCUS       HUMPRP       2415 bp ss-mRNA            PRI       20-MAY-1987
DEFINITION  Human prion protein (PrP) mRNA, complete cds.
ACCESSION   M13899 . . .

 to: s14078  check: 8781  from: 30  to: 240

P1;UJHU - major prion protein precursor - human
N;Alternate names: 11K amyloid protein; 27-30K sialoglycoprotein; PrP 27-30;
 PrP 33-35C; scrapie prion protein
C;Species: Homo sapiens (man)
C;Date: 25-Oct-1987 #sequence_revision 12-Apr-1996 #text_change 12-Apr-1996
C;Accession: A24173; A40372; A05017; S14078
R;Kretzschmar, H.A.; Stowring, L.E.; Westaway, D.; Stubblebine, W.H.; Prusiner,
 S.B.; Dearmond, S.J

 Symbol comparison table: /disk2/usr/local/soft/seq/gcgf9/gcgcore/data/rundata/b
losum62.cmp
 CompCheck: 6430

         Gap Weight:     12      Average Match:  2.912
      Length Weight:      4   Average Mismatch: -2.003

            Quality:   1195             Length:    209
              Ratio:  5.718               Gaps:      0
 Percent Similarity: 100.000   Percent Identity: 100.000

 Average quality based on 100 randomizations: 44.2 +/- 7.7

        Match display thresholds for the alignment(s):
                    | = IDENTITY
                    : =   2
                    . =   1

 humprp.pep x s14078       November 25, 1996 13:28  ..

                  .         .         .         .         .
      10 NTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGW 59
         ||||||||||||||||||||||||||||||||||||||||||||||||||
      32 NTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGW 81
                  .         .         .         .         .
      60 GQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLG 109
         ||||||||||||||||||||||||||||||||||||||||||||||||||
      82 GQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLG 131
                  .         .         .         .         .
     110 SAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVN 159
         ||||||||||||||||||||||||||||||||||||||||||||||||||
     132 SAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVN 181
                  .         .         .         .         .
     160 ITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSS 209
         ||||||||||||||||||||||||||||||||||||||||||||||||||
     182 ITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSS 231

     210 MVLFSSPPV 218
         |||||||||
     232 MVLFSSPPV 240
It turns out that the amyloid protein that we were testing is an aberrant prion particle, in spite of it's name, most often associated with Alzheimer's disease associated proteins, and this is why the significance level is so high. This underescores the importance of investigating any entries that appear interesting. The weird thing is how far down the BLAST output list S14078 was.

However, often a seemingly good alignment will not be significant upon further inspection -- do not blindly accept the output of any computer program! Always investigate further for similarities can be strictly artifactual. The second comparison that I chose, sw:rb87_drome against the human prion protein, turned out to be entirely insignificant. From the previous DotPlot analysis, I saw that the region of humprp.pep from about residue 20 through 60 repeats strongly in RB87 around residue 210 through residue 320. A Monte Carlo analysis of this comparison is shown in an abridged fashion below. The Z score is -0.3:

% bestfit -random=100 humprp.pep sw:rb87_drome

BestFit makes an optimal alignment of the best segment of similarity
between two sequences. Optimal alignments are found by inserting gaps to
maximize the number of matches using the local homology algorithm of
Smith and Waterman.

                  Begin (* 1 *) ?  20
                End (*   231 *) ?  60

                  Begin (* 1 *) ?  210
                End (*   386 *) ?  320

 What is the gap creation penalty (* 12 *) ?  <rtn>

 What is the gap extension penalty (* 4 *) ?  <rtn>

 What should I call the paired output display file (* humprp.pair *) ?  humprp-r
b87.pair

 Aligning .....-.
 Aligning .....-.


 BESTFIT of: humprp.pep  check: 414  from: 20  to: 60
 to: rb87_drome  check: 2689  from: 210  to: 320

ID   RB87_DROME     STANDARD;      PRT;   386 AA.
DE   HETEROGENEOUS NUCLEAR RIBONUCLEOPROTEIN 87F (HRP36.1 PROTEIN) (P11 . . .

 Symbol comparison table: /disk2/usr/local/soft/seq/gcgf9/gcgcore/data/rundata/b
losum62.cmp  CompCheck: 6430

         Gap Weight:     12      Average Match:  2.912
      Length Weight:      4   Average Mismatch: -2.003

            Quality:     78             Length:     41
              Ratio:  1.902               Gaps:      0
 Percent Similarity: 41.463   Percent Identity: 41.463

 Average quality based on 100 randomizations: 80.3 +/- 8.0

        Match display thresholds for the alignment(s):
                    | = IDENTITY
                    : =   2
                    . =   1

 humprp.pep x rb87_drome   November 25, 1996 13:55  ..

                  .         .         .         .
      20 GSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWG 60
         |   | .    ||| ||   |||     || .|   ||| |
     223 GGGWGGQNRQNGGGNWGGAGGGGGFGNSGGNFGGGQGGGSG 263
The extreme glycine richness of both sequences is what the programs found, yet the Monte Carlo test suggests that this similarity is not at all significant, it is merely the result of compositional bias. The program Seg is now available outside of BLAST for prefiltering your sequences. This may be particularly prudent in situations such as the prion molecule where you know that a lot of repeat and low complexity sequence compostion has the potential to confound search algorithms.

8) And finally -- interpreting Profile analysis -- why even bother?

The Profile method enables the researcher to recognize features that may be otherwise invisible. The greatly enhanced information content of a Profile over individual sequences has the potential to find similar motifs in sequences which may be only distantly related. Even though ProfileSearches do require some work to setup and run -- a meaningful multiple sequence alignment must be assembled and refined, ProfileMake needs to be run, and the search job itself takes quite a long time to run -- it is worth it.

What PROFILE analysis can show us.

When your ProfileSearch finally finishes, take a look at the output. There is a strong chance that some of the sequences in it were not found by other search algorithms. The abridged output file from the example ProfileSearch is shown below:

(Peptide) PROFILESEARCH of: /disk2/usr/local/people/thompson/courses/BC578/EX8/p
rion.prf Length: 251 to: SwissProt:*

         Scores are corrected for composition effects

                 Gap Weight: 33.00
          Gap Length Weight: 0.37
         Sequences Examined: 49571
         CPU time (seconds): 9826
*    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *
Profile information:
(Peptide) PROFILEMAKE v4.50 of: prion.msf{*}  Length: 251
  Sequences: 23  MaxScore: 1416.57  November 20, 1996 11:38
                          Gap: 1.00              Len: 1.00
                     GapRatio: 0.33         LenRatio: 0.10
         prion.msf{PRIO_BOVIN}  From: 11        To: 261       Weight: 1.00
         prion.msf{PRIO_TRAST}  From: 11        To: 261       Weight: 1.00 . . .

*    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *
Normalization:                                  November 21, 1996 19:09

         Curve fit using 42 length pools
         0 of 42 pools were rejected

         Normalization equation:

                 Calc_Score = 276.50 * ( 1.0 - exp(-0.0103*SeqLen + 0.4126) )

         Correlation for curve fit: 0.902

         Z score calculation:
         Average and standard deviation calculated using 49534 scores
         37 of 49571 scores were rejected

                 Z_Score = ( Score/Calc_Score - 1.017 ) / 0.175

          Sequence  Strd ZScore   Orig Length ! Documentation  ..
SW:PRIO_CALMO         +   26.89 1378.13    241 ! P40248 callicebus moloch (dusky
 titi). major prion protein precursor (prp) (prp27-3
SW:PRIO_MANSP         +   26.79 1373.70    241 ! P40255 mandrillus sphinx (mandr
ill) (papio sphinx). major prion protein precursor (
SW:PRIO_AOTTR         +   26.68 1364.99    239 ! P40245 aotus trivirgatus (night
 monkey) (douroucouli). major prion protein precurso
SW:PRIO_COLGU         +   26.60 1388.79    253 ! P40251 colobus guereza. major p
rion protein precursor (prp) (prp27-30) (prp33-35c).
////////////////////////////////////////////////////////////////////////////////
SW:PRP2_TRAST         +   25.94 1366.14    256 ! P40243 tragelaphus strepsiceros
 (greater kudu). major prion protein 2 precursor (pr
SW:PRIO_TRAST         +   25.89 1377.09    264 ! P40242 tragelaphus strepsiceros
 (greater kudu). major prion protein 1 precursor (pr
SW:PRIO_MESAU         +   25.84 1358.16    254 ! P04273 mesocricetus auratus (go
lden hamster). major prion protein precursor (prp) (
SW:PRIO_RAT           +   24.99 1266.69    226 ! P13852 rattus norvegicus (rat).
 major prion protein (prp) (fragment). 2/95
SW:YACG_ECOLI         +   12.96  86.72     50 ! P36681 escherichia coli. hypothe
tical 5.8 kd protein in mutt-guac intergenic region
SW:POR_GLUOX          +   11.67  80.79     50 ! P80354 gluconobacter oxydans. po
lyol:nadp oxidoreductase (ec 1.1.1.-) (fragment). 1
SW:INS_GADCA          +   11.64  88.42     51 ! P01336 gadus callarias (baltic c
od). insulin. 2/95
SW:YKQ2_CAEEL         +   11.52  80.07     50 ! P34297 caenorhabditis elegans. h
ypothetical 5.0 kd protein c06e1.2 in chromosome ii
SW:YKQ5_CAEEL         +   11.35  79.31     50 ! P34300 caenorhabditis elegans. h
ypothetical 5.0 kd protein c06e1.5 in chromosome ii
SW:YP55_ECOLI         +   11.26  94.04     52 ! P14505 escherichia coli. hypothe
tical 5.5 kd protein in replication origin region.
SW:INS2_BATSP         +   11.16  78.40     50 ! P01338 batrachoididae sp. (toadf
ish). insulin 2. 2/95
SW:YCUA_LACCU         +   11.15  85.94     51 ! P35923 lactobacillus curvatus. h
ypothetical 6.1 kd protein in cura 3'region (orfb).
SW:MYX4_CRODU         +   10.90  84.67     51 ! P24334 crotalus durissus terrifi
cus (south american rattlesnake). myotoxin 4 precur
SW:INS_MYOSC          +   10.79  76.70     50 ! P07453 myoxocephalus scorpius (s
horthorn sculpin) (daddy sculpin). insulin. 2/95
SW:YF69_HAEIN         +   10.77  76.62     50 ! P44258 haemophilus influenzae. h
ypothetical protein hi1569. 11/95
SW:YIAZ_ECOLI         +   10.68  76.21     50 ! P37305 escherichia coli. hypothe
tical 6.0 kd protein in cspa-glys intergenic region
SW:PND2_ECOLI         +   10.67  76.17     50 ! P16477 escherichia coli. pnd pro
tein. 8/91
SW:PND1_ECOLI         +   10.62  75.93     50 ! P11902 escherichia coli. pnd pro
tein. 11/95
SW:VG38_BPML5         +   10.57  75.69     50 ! Q05248 mycobacteriophage l5. gen
e 38 protein (gp38). 2/94
SW:INS_ONCGO          +   10.46  75.18     50 ! P23187 oncorhynchus gorbuscha (p
ink salmon) (humpback salmon), and oncorhynchus kis
//////////////////////////////////////////////////////////////////////////////
ProfileSearch Z scores are normalized and reflect the significance of the results. Here, rather than randomizing sequences to evaluate the Z score as is done in the above Monte Carlo method, it is calculated, as in BLAST, based on all of the nonsimilar sequences of the database search. As before, generally Z scores below 3 are not worth considering, those from 4 to 7 may be interesting, and those above 7 should definitely be checked out further. The program ProfileSegments makes BestFit style alignments of the results of a ProfileSearch. A great new option in ProfileSegments allows you to prepare a multiple sequence alignment of the ProfileSearch segments. Take advantage of the -msf option in your run below. The full information content of the profile is utilized in these alignment procedures. When your ProfileSearch has finished running, use pico to edit its output file the same as you did the .word file (remember -- keep a couple of reference sequences and several "interesting, twilight zone" comparisons and get rid of everything else, except the reference header material). Next, make alignments off of it with ProfileSegments. Be sure to create enough alignments to cover your revised list size. Give the output pairs file the extension .profilepairs and the output msf file an appropriate name. The abridged prion example follows:
% profilesegments -check

ProfileSegments makes optimal alignments showing the segments of
similarity found by ProfileSearch.

Minimal Syntax:  % profilesegments [-INfile=]hsp70.pfs -Default

Prompted Parameters:

-SEQLimit=15            limit the number of alignments
[-OUTfile=]hsp70.pairs  output file for the alignments

Local Data Files: None

Optional Parameters:

-LOCal                 aligns the best segment of similarity between
                         the sequence and profile (local alignment is
                         the default)
-GLObal                aligns the whole sequence and profile (global
                         alignment)
-NOAVErage             does not adjust alignment scores for sequence
                         composition
-ENDWeight             penalizes end gaps like other gaps
-LIMit1=20             lets you set a gap shift limit for the sequence
 Press q to quit or <Return> for more:
-LIMit2=20             lets you set a gap shift limit for the profile
-OUTfile2=hs7t_rat.gap new file for sequence 1 with gaps added
-OUTfile3=hsp70.gap    new file for the profile consensus with gaps added
-MSF[=hsp70.msf]       new MSF file containing alignment of all the
                           sequences with the profile consensus
-PAIr=1.0,0.5,0.1      thresholds for displaying '|', ':', and '.'
-NOMONitor             suppresses the screen summary for each alignment

 Add what to the command line ?  -msf

 PROFILESEGMENTS  version 4.50     November 26, 1996 11:11

 (Local) PROFILESEGMENTS of what PROFILESEARCH output file ? prion.pfs

 Stop after how many alignments (* 15 *) ? 50

 What should I call the paired output display file (* prion.pairs *) ?  prion.pr
ofilepairs

 What should I call the multiple sequence output file (* prion.msf *) ?  prion-p
rofile.msf

        The following levels will be marked in the alignments:
                   Bar: 3.21
                 Colon: 2.34
                   Dot: 1.47

 Aligning ............-............
 Sequence:    1 prio_calmo

 SW:PRIO_CALMO

                           Gaps:      4

                               Quality: 1378.12

                       Quality Ratio:   5.77

                              Length:    251

 Aligning ............-..
 Sequence:    2 yacg_ecoli

 SW:YACG_ECOLI


                           Gaps:      2

                               Quality:  86.72

                       Quality Ratio:   3.10

                              Length:     34

/////////////////////////////////////////////////////////////////////
Display the results of your ProfileSegments run and notice how much different the alignments, after the obvious first ones, are from the previous examples. In my case, the proline, glutamine richness of certain areas of the UV15 protein was found similar to the prion, and the string of alanines in certain chorion proteins is considered important by the algorithm. No previous program recognized either of these similarities. "Clustering" is also much more critical to Profile analyses than any other method. This is because of profile analysis' variable gap penalties where conserved areas are not allowed to gap and variable regions are free to. The abridged prion ProfileSegments example output file follows:
% more prion.profilepairs

 (Local) PROFILESEGMENTS of: PRIO_CALMO  check: 8361  from: 1  to: 241

//////////////////////////////////////////////////////////////////////////////

 Profile: prion.prf

         Gap Weight: 33.000      Average Match:  2.340
      Length Weight:  0.370   Average Mismatch: -1.705

            Quality: 1378.12             Length:    251
              Ratio:   5.77               Gaps:      4

 prio_calmo x prion.prf    November 26, 1996 11:11  ..

                  .         .         .         .         .
S      2 LVLFVATWSDLGLCKKRPKP.GGWNTGGSRYPGQGSPGGNRYPPQG.... 46
         :.:|::.|.| |:||||||| ||||||||||||||||||||||||:
P      1 LVLFVATWSBLGLCKKRPKPGGGWNTGGSRYPGQGSPGGNRYPPQGGGGW 50
                  .         .         .         .         .
S     47 ....GGSWGQPHGGGWGQPHGGGWGQPHGGGWGQPH.GGGWGQGGGTHNQ 91
             || ||||||||||||||||||||||||||||| ||||||:|||| |
P     51 GQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGGWGQGGGTHNQ 100
                  .         .         .         .         .
S     92 WNKPSKPKTNMKHVAGAAAAGAVVGGLGGYMLGSAMSRPLIHFGNDYEDR 141
         |:||:||||:||| ||||||||||||||||||||||:||.:|||:|||||
P    101 WNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPLIHFGNBYEBR 150
                  .         .         .         .         .
S    142 YYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNITIKQHTVTTTTKGENF 191
         ||||||:||||||||||.|||.|||||||||||:|.||||||||||||||
P    151 YYRENMYRYPNQVYYRPVBQYSNQNNFVHBCVNITIKQHTVTTTTKGENF 200
                  .         .         .         .         .
S    192 TETDVKMMERVVEQMCITQYEKESQAYY..QRGSSMVLFSSPPVILLISF 239
         ||||.|:|||||||||.|||..||||||  :||.:..:||:|||||||||
P    201 TETBVKMMERVVEQMCITQYEKESQAYYBGQRGSSMVLFSSPPVILLISF 250

S    240 L 240
         |
P    251 L 251

 (Local) PROFILESEGMENTS of: YACG_ECOLI  check: 6061  from: 1  to: 50

//////////////////////////////////////////////////////////////////////////////

DE   HYPOTHETICAL 5.8 KD PROTEIN IN MUTT-GUAC INTERGENIC REGION. . . 

//////////////////////////////////////////////////////////////////////////////

 Profile: prion.prf

         Gap Weight: 33.000      Average Match:  2.340
      Length Weight:  0.370   Average Mismatch: -1.705

            Quality:  86.72             Length:     34
              Ratio:   3.10               Gaps:      2

 yacg_ecoli x prion.prf    November 26, 1996 11:11  ..

                  .         .         .
S      3 WGEISPFRPFCSKRCQLIDLGEWAAEEKRIPSSG 36
         |         | ||      | |     | |  |
P      8 WSBLG....LCKKRPK..PGGGWNTGGSRYPGQG 35

//////////////////////////////////////////////////////////////////////////////

 (Local) PROFILESEGMENTS of: MYX4_CRODU  check: 9780  from: 1  to: 51

//////////////////////////////////////////////////////////////////////////////

DE   MYOTOXIN 4 PRECURSOR (CROTAMINE 4) (FRAGMENT). . . .

//////////////////////////////////////////////////////////////////////////////

 Profile: prion.prf

         Gap Weight: 33.000      Average Match:  2.340
      Length Weight:  0.370   Average Mismatch: -1.705

            Quality:  84.64             Length:     67
              Ratio:   2.02               Gaps:      4

 myx4_crodu x prion.prf    November 26, 1996 11:11  ..

                  .         .         .         .         .
S      1 FLS............EP.......GNAYKQCHKKGGHCFPKEKICIPPSS 31
         |               |       |  |      ||  :| .
P      4 FVATWSBLGLCKKRPKPGGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQP 53
                  .
S     32 DFGKMDCRWR......W 42
           |     |       |
P     54 HGGG....WGQPHGGGW 66


 (Local) PROFILESEGMENTS of: INS_MYOSC  check: 6776  from: 1  to: 50

//////////////////////////////////////////////////////////////////////////////

DE   INSULIN. . . .

//////////////////////////////////////////////////////////////////////////////

 Profile: prion.prf

         Gap Weight: 33.000      Average Match:  2.340
      Length Weight:  0.370   Average Mismatch: -1.705

            Quality:  76.70             Length:      6
              Ratio:  12.78               Gaps:      0

 ins_myosc x prion.prf     November 26, 1996 11:11  ..

S     31 IVEQCC 36
         :||| |
P    211 VVEQMC 216

 (Local) PROFILESEGMENTS of: YF69_HAEIN  check: 6428  from: 1  to: 50

//////////////////////////////////////////////////////////////////////////////

DE   HYPOTHETICAL PROTEIN HI1569. . . .

//////////////////////////////////////////////////////////////////////////////

 Profile: prion.prf

         Gap Weight: 33.000      Average Match:  2.340
      Length Weight:  0.370   Average Mismatch: -1.705

            Quality:  76.62             Length:     24
              Ratio:   3.19               Gaps:      0

 yf69_haein x prion.prf    November 26, 1996 11:11  ..

                  .         .
S      8 YTFIRQGKMLVVKILPKEIFVLSF 31
         |     |   .    |  |  .||
P    227 YYBGQRGSSMVLFSSPPVILLISF 250

 (Local) PROFILESEGMENTS of: YIAZ_ECOLI  check: 4447  from: 1  to: 50

//////////////////////////////////////////////////////////////////////////////

DE   HYPOTHETICAL 6.0 KD PROTEIN IN CSPA-GLYS INTERGENIC REGION. . . .

//////////////////////////////////////////////////////////////////////////////

 Profile: prion.prf

         Gap Weight: 33.000      Average Match:  2.340
      Length Weight:  0.370   Average Mismatch: -1.705

            Quality:  76.21             Length:     22
              Ratio:   4.48               Gaps:      4

 yiaz_ecoli x prion.prf    November 26, 1996 11:11  ..

                  .         .
S     17 LLFF..TWMIRD.SLCELHIKQ 35
         :  |  .|   |  :|    |
P      1 LVLFVATW..SBLGLCK...KR 17

 (Local) PROFILESEGMENTS of: PND2_ECOLI  check: 9184  from: 1  to: 50

//////////////////////////////////////////////////////////////////////////////

DE   PND PROTEIN. . . .

//////////////////////////////////////////////////////////////////////////////

 Profile: prion.prf

         Gap Weight: 33.000      Average Match:  2.340
      Length Weight:  0.370   Average Mismatch: -1.705

            Quality:  76.16             Length:     35
              Ratio:   2.93               Gaps:      5

 pnd2_ecoli x prion.prf    November 26, 1996 11:11  ..

                  .         .         .
S     10 LIVICVTILCFV..WMVRD.SLCGLRLQQ...GNT 38
         :         |:  |   |  :|  |       ||
P      1 LVL.......FVATW..SBLGLCKKRPKPGGGWNT 26

//////////////////////////////////////////////////////////////////////////////

 (Local) PROFILESEGMENTS of: VG38_BPML5  check: 8056  from: 1  to: 50

//////////////////////////////////////////////////////////////////////////////

DE   GENE 38 PROTEIN (GP38). . . .

//////////////////////////////////////////////////////////////////////////////

 Profile: prion.prf

         Gap Weight: 33.000      Average Match:  2.340
      Length Weight:  0.370   Average Mismatch: -1.705

            Quality:  75.64             Length:     72
              Ratio:   1.76               Gaps:      4

 vg38_bpml5 x prion.prf    November 26, 1996 11:11  ..

                  .         .         .         .         .
S      1 MKKTIAA.AIIGLAATLSLSAC...................EGGTGSSDY 30
         .   ::     |:        |                   .|  |   |
P      1 LVLFVATWSBLGL........CKKRPKPGGGWNTGGSRYPGQGSPGGNRY 42
                  .         .
S     31 TPSVPVFI.........QMPGG 43
          |               |  ||
P     43 PPQGGGGWGQPHGGGWGQPHGG 64

//////////////////////////////////////////////////////////////////////////////
All of the non-prion segment entries that I retained on my .pfs file had ProfileSearch Z scrores between 10 and 13. Are any of these seven biologically significant? It may be worth further investigation; some of the motifs look quite interesting. The multiple sequence alignment output that we created with the msf option follows below in abridged form:
% more prion-profile.msf

!!AA_MULTIPLE_ALIGNMENT 1.0
Multiple alignment of prion.prf and
          prio_calmo          yacg_ecoli           por_gluox           ins_gadca
          ykq2_caeel          ykq5_caeel          yp55_ecoli          ins2_batsp
          ycua_laccu          myx4_crodu           ins_myosc          yf69_haein
          yiaz_ecoli          pnd2_ecoli          pnd1_ecoli          vg38_bpml5
           ins_oncgo          lha6_rhoac          disi_erima          ins1_batsp
          y476_haein          p100_human          yftr_metth            atpe_rat
           ins_progu          orne_plaor          znfp_lycvp           ins_katpe

 prion-profile.msf  MSF: 360  Type: P  November 26, 1996 11:11  Check: 6266 ..

 Name: prio_calmo       Len:   360  Check: 5192  Weight:  1.00
 Name: yacg_ecoli       Len:   360  Check: 2335  Weight:  1.00
 Name: por_gluox        Len:   360  Check: 9283  Weight:  1.00
 Name: ins_gadca        Len:   360  Check: 7065  Weight:  1.00
 Name: ykq2_caeel       Len:   360  Check: 9842  Weight:  1.00
 Name: ykq5_caeel       Len:   360  Check: 6224  Weight:  1.00
 Name: yp55_ecoli       Len:   360  Check: 8947  Weight:  1.00
 Name: ins2_batsp       Len:   360  Check: 3627  Weight:  1.00
 Name: ycua_laccu       Len:   360  Check: 9583  Weight:  1.00
 Name: myx4_crodu       Len:   360  Check:   71  Weight:  1.00
 Name: ins_myosc        Len:   360  Check:  319  Weight:  1.00
 Name: yf69_haein       Len:   360  Check:  997  Weight:  1.00
 Name: yiaz_ecoli       Len:   360  Check: 8336  Weight:  1.00
 Name: pnd2_ecoli       Len:   360  Check:  927  Weight:  1.00
 Name: pnd1_ecoli       Len:   360  Check: 6285  Weight:  1.00
 Name: vg38_bpml5       Len:   360  Check: 6103  Weight:  1.00
 Name: ins_oncgo        Len:   360  Check:  319  Weight:  1.00
 Name: lha6_rhoac       Len:   360  Check: 5467  Weight:  1.00
 Name: disi_erima       Len:   360  Check: 3404  Weight:  1.00
 Name: ins1_batsp       Len:   360  Check: 7387  Weight:  1.00
 Name: y476_haein       Len:   360  Check: 1904  Weight:  1.00
 Name: p100_human       Len:   360  Check: 6781  Weight:  1.00
 Name: yftr_metth       Len:   360  Check: 8238  Weight:  1.00
 Name: atpe_rat         Len:   360  Check: 8540  Weight:  1.00
 Name: ins_progu        Len:   360  Check: 1891  Weight:  1.00
 Name: orne_plaor       Len:   360  Check: 5069  Weight:  1.00
 Name: znfp_lycvp       Len:   360  Check: 7948  Weight:  1.00
 Name: ins_katpe        Len:   360  Check: 8166  Weight:  1.00
 Name: prion.prf        Len:   360  Check: 6016  Weight:  1.00

//

            1                                                   50
prio_calmo  LV..L..... ..F....... .VAT...... W......... ..........
yacg_ecoli  .......... .......... .......... W......... ..........
 por_gluox  .......... .......... .......... .......... ..........
 ins_gadca  .......... .......... .......... .......... ..........
ykq2_caeel  .......... .......... .......... .......... ..........
ykq5_caeel  .......... .......... .......... .......... ..........
yp55_ecoli  .......... .......... .......... .......... ..........
ins2_batsp  .......... .......... .......... .......... ..........
ycua_laccu  LI..IILR.. ..F....... .LIV...... W......... ..........
myx4_crodu  .......... ..F....... .LS....... .......... ..........
 ins_myosc  .......... .......... .......... .......... ..........
yf69_haein  .......... .......... .......... .......... ..........
yiaz_ecoli  LL..F..... ..F....... ...T...... WMI....... ..........
pnd2_ecoli  LI..VICVTI LCF....... .V........ WMV....... ..........
pnd1_ecoli  LI..VVCVTI LCF....... .V........ WMV....... ..........
vg38_bpml5  MK..K..... ..T....... .IAA...... .......... ..........
 ins_oncgo  .......... .......... .......... .......... ..........
lha6_rhoac  LI..IH.... ..F....... .IALSTDRFN W......... ..........
disi_erima  RC..K..... ..FKRAGKVC RVARGD.... W......... ..........
ins1_batsp  .......... .......... .......... .......... ..........
y476_haein  .......... .......... .......... .......... ..........
p100_human  .......... .......... .......... .......... ..........
yftr_metth  .......... .......... .......... .......... ..........
  atpe_rat  .......... .......... .VAY...... W......... ..........
 ins_progu  .......... .......... .......... .......... ..........
orne_plaor  .L..L..... ..Y....... .CGE...... F......... ..........
znfp_lycvp  PL..N..... ..C....... .KSC...... WQKFDSLVRC HDHYLCRHCL
 ins_katpe  LVEAL..... ..Y....... .L........ .......... ..........
 prion.prf  LV..L..... ..F....... .VAT...... W......... ..........

//////////////////////////////////////////////////////////////////////////////

            301                                                350
prio_calmo  TTTTKGENFT ETDVKMMERV VEQMCITQYE KESQAYY..Q RGSSMVLFSS
yacg_ecoli  .......... .......... .......... .......... ..........
 por_gluox  .......... .......... .......... .......... ..........
 ins_gadca  .......... .......... .......... .......... ..........
ykq2_caeel  .......... .......... .......... .......... ..........
ykq5_caeel  .......... .......... .......... .......... ..........
yp55_ecoli  .......... .......... .......... .......... ..........
ins2_batsp  .......... .......... .......... .......... ..........
ycua_laccu  .......... .......... .......... .......... ..........
myx4_crodu  .......... .......... .......... .......... ..........
 ins_myosc  .......... .........I VEQCC..... .......... ..........
yf69_haein  .......... .......... .......... .....YTFIR QGKMLVVKIL
yiaz_ecoli  .......... .......... .......... .......... ..........
pnd2_ecoli  .......... .......... .......... .......... ..........
pnd1_ecoli  .......... .......... .......... .......... ..........
vg38_bpml5  .......... .......... .......... .......... ..........
 ins_oncgo  .......... .........I VEQCC..... .......... ..........
lha6_rhoac  .......... .......... .......... .......... ..........
disi_erima  .......... .......... .......... .......... ..........
ins1_batsp  .......... .......... .......... .......... ..........
y476_haein  .......... .......... .......... .......... ..........
p100_human  .......... .......... .......... .......... ..........
yftr_metth  .......... .......... .......... .......... ..........
  atpe_rat  .......... .......... .......... .......... ..........
 ins_progu  .......... .......EGI VDQCC..... .......... ..........
orne_plaor  .......... .......... .......... .......... ..........
znfp_lycvp  .......... .......... .......... .......... ..........
 ins_katpe  .......... .......... .......... .......... ..........
 prion.prf  TTTTKGENFT ETBVKMMERV VEQMCITQYE KESQAYYBGQ RGSSMVLFSS

            351
prio_calmo  PPVILLISFL
yacg_ecoli  ..........
 por_gluox  ..........
 ins_gadca  ..........
ykq2_caeel  ..........
ykq5_caeel  ..........
yp55_ecoli  ..........
ins2_batsp  ..........
ycua_laccu  ..........
myx4_crodu  ..........
 ins_myosc  ..........
yf69_haein  PKEIFVLSF.
yiaz_ecoli  ..........
pnd2_ecoli  ..........
pnd1_ecoli  ..........
vg38_bpml5  ..........
 ins_oncgo  ..........
lha6_rhoac  ..........
disi_erima  ..........
ins1_batsp  ..........
y476_haein  ..........
p100_human  ..........
yftr_metth  ..........
  atpe_rat  ..........
 ins_progu  ..........
orne_plaor  ..........
znfp_lycvp  ..........
 ins_katpe  ..........
 prion.prf  PPVILLISFL
Notice the extreme 'gappiness' of the alignment due to the profile method used. In this case, I doubt whether it is a biologically meaningful alignment, but you should be able to see the power of the technique illustrated, had the sequences actually belonged together. This can be a very handy way to pregap sequences in order to introduce them into existing alignments.

8) Conclusions -- do you have any?

This has been a long one; I do apologize. However, database searching analysis is one of the most commonly misunderstood areas in computational molecular biology and bioinformatics today. There is a tremendous amount of confusion in this field and anything that can be done to try and clear up some of the mess is entirely worthwhile. One point that still needs to be made is that the previous techniques were performed largely using GCG's suggested defaults. This usually will work for you, but it is a good idea to think about what these default values imply and adjust them accordingly, especially if the results seem inappropriate after running through a first pass with the default parameters intact.

The files and the renaming conventions for each that you are to rcp to the teacher account on ribozyme follow (files are copied from $GRAD_DIR/week8 and copied into teacher@ribozyme:receive just as previously):

% cp $GRAD_DIR/week8s/* .

  1. Edit week8s.week8s, then rename it to (your last name).week8s
  2. Your wordsearch.figure file renamed to (your last name).wordfigure
  3. The three dotplot figure files renamed to (your last name).dotplot1
    and (your last name).dotplot2
    and (your last name).dotplot3
  4. All three bestfit -randomizations .pair files renamed to (your lastname).pair1
    and (your last name).pair2
    and (your last name).pair3
  5. The ProfileSegments output renamed to (your lastname).profilepairs

% rcp lastname.* teacher@ribozyme:receive

This concludes your computing session for the week. Log off of ribozyme. Get out of the emulator and back to the Desktop. Do not turn off the computer.

Next week we will enter the controversial world of protein secondary structure prediction. This is where computer analysis "goes way out on a limb" and becomes quite subjective. Realize that most prediction schemes fair 60-75% at best and work better on some proteins than others. However, valuable insights can often be gained -- the techniques are worth learning and like all computer methods, the better you understand them, the more of them that you run, and the more you practice, the better they can work for you!

References

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular Biology 215, 403-410.

Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, 2013-2018.

Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis Primer. W.H. Freeman and Company, New York,. NY

Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences USA 89, 10915-10919.

Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology 48, 443-453.

Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio 1994. Nucleic Acids Research 22, 3470-3473.

Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences USA 85, 2444-2448.

Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, (M.O. Dayhoff editor) 5, Suppl. 3, 353-358, National Biomedical Research Foundation, Washington D.C.

Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied Mathematics 2, 482-489.

Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Sciences USA 80, 726-730.