'96 BC/BP 378

Week 10

The dynamic programming algorithm, symbol comparison tables, dot-matrix analysis, hashing techniques, and heuristics: the methods and programs for pairwise sequence comparison and database similarity searching.

Author:

Steven M. Thompson

Introduction

Given the nucleotide or amino acid sequence of a biological molecule, what can we know about that molecule? As we've seen in earlier exercises, we can find biologically relevant information in it by searching for particular patterns that reflect some function of the molecule. These can be catalogued motifs, secondary structure predictions and physical attributes such as hydrophobicity, or even the content of the DNA itself as in some of the gene finding techniques. But what about comparisons with other sequences? Can we learn about one molecule by comparing it to another? Yes, naturally we can; inference through homology is fundamental to all the biological sciences.

By comparing the conserved portions of sequence amongst a set, all of the sensitivity and power of the computational techniques is magnified. The basic assumption is that those portions of sequence of crucial functional value are most constrained against evolutionary change. They will not tolerate many mutations. Not that mutations don't happen in these portions, just that most mutations in the region are lethal so we never see them. Other areas of sequence are able to drift more readily and are subject to less evolutionary pressure. Therefore, the sequence ends up a mosaic of quickly and slowly changing regions over evolutionary time. However, in order to learn anything by comparing sequences, we need to know how to compare them. We can use those constrained portions as 'anchors' to create a sequence alignment so that we can compare them. But this brings up the alignment problem and 'similarity.' It is easy to see that two sequences are aligned when they have identical symbols at identical positions, but what happens when symbols are not identical or the sequences are not the same length? How can we know that the most similar portions of our sequences are aligned, when is an alignment optimal, and does optimal mean biologically correct? Part of the solution to this problem is known as the dynamic programming algorithm.

The dynamic programming examples in the suggested text (Gribskov and Devereux, 1992) are excellent. Please carefully read pages 125 through 134 to see how David States and Mark Boguski explain the algorithm. In its simplest implementation we will consider matching symbols to be worth one point and will not consider gapping at all. The solution occurs in two stages. The first begins very much like the dot matrix methods described further below; the second is totally different. I will use the same example as the text with one simplification. Instead of calculating the "score matrix" on the fly as we proceed through the graph, I like to completely fill in an original "match matrix" then add points to those positions which produce favorable alignments. Follow the example below in concert with your text example.

a) A completed match matrix using one point for matching and zero points for mismatching:

      A    A    T    G   C
A     1    1    0    0    0      
G     0    0    0    1    0      
G     0    0    0    1    0      
C     0    0    0    0    1      

b) Now begin to add points based on the best path through the matrix, always working diagonally, left to right and top to bottom. The second row is completed here:

      A    A    T    G   C
A     1    1    0    0    0      
G     0    0+1  0+1  1+1  0+1    
G     0    0    0    1    0      
C     0    0    0    0    1      

c) Continue adding points based on the best previous path through the matrix. The third row is completed here:

      A    A    T    G   C
A     1    1    0    0    0      
G     0    1    1    2    1      
G     0    0+1  0+1  1+1  0+2    
C     0    0    0    0    1      

d) The score matrix is now complete:

      A    A    T    G   C
A     1    1    0    0    0      
G     0    1    1    2    1      
G     0    1    1    2    2      
C     0    0+1  0+1  0+1  1+2    

e) Now pick the bottom, right-most, highest scores in the matrix and work your way back through it. This is called the traceback stage and the matrix is now referred to as the path graph. In this case that highest score is in the right-hand corner, but it need not be:

      A    A    T    G   C
A     1    1    0    0    0      
G     0    1    1    2    1      
G     0    1    1    2    2      
C     0    1    1    1    3      

f) The completed traceback is shown with bold characters; these are all optimal alignments:

      A    A    T    G   C
A     1    1    0    0    0      
G     0    1    1    2    1      
G     0    1    1    2    2      
C     0    1    1    1    3      

The following alignments are all generated from the above path graph (f). All five have three matches. Gap penalties would eliminate the last two of them; however, that still leaves three:

	AG.GC	A.GGC	.AGGC	A..GGC	.A.GGC
	|  ||	|  ||	 | ||	|  | |	 | | |
	AATGC	AATGC	AATGC	AATG.C	AATG.C

The software will arbitrarily choose only one of these to report as the optimal. This decision can be partly controlled in the GCG programs BestFit and Gap with the HighRoad/LowRoad options. This brings up an important point. Just because dynamic programming is guaranteed to find an optimal alignment, it is not necessarily the only optimal alignment. Furthermore, the optimal alignment is not necessarily the 'right' or biologically relevant alignment. As always, question the results of any computerized solution based on what you know about the biology of the system. The above example illustrates the Needleman and Wunsch (1970) global solution. Later refinements (Smith and Waterman, 1981) demonstrated how dynamic programming can also be used to find optimal local alignments.

A further complication occurs in protein sequences. Certain amino acids are very much alike, structurally, chemically and genetically. How can we take advantage of the similarity of amino acids in our alignments? People have been struggling with this problem since the late 1960's. Margaret Dayhoff (Schwartz and Dayhoff, 1979) unambiguously aligned closely related datasets (no more than 15% difference) available at that point in time and noticed that certain residues, if they mutate at all, are prone to change into certain other residues. As it works out, these propensities for change fell into the same categories that chemists had known for years -- those same chemical and structural classes mentioned above, conserved through the evolutionary contraints of natural selection. However, Dayhoff's empirical observation quantified these changes. Based on the alignments that she created, the assumption that estimated mutation rates in closely related proteins can be extrapolated to more distant relationships, and fancy matrix and logarithmic mathematics that smooth out the statistics of the system, she was able to specify the probabilities at which different residues mutated into other residues through evolutionary history. This is the basis of the famous PAM (corrupted acronym of accepted point mutation) 250 (meaning that the matrix has been multiplied by itself 250 times) table. Since Dayhoff's time other biomathematicians (e.g. see Henikoff and Henikoff's [1992] BLOSUM 62 table) have created newer tables with more or less success than Dayhoff's original but the concept remains the same and Dayhoff's original PAM 250 table remains the most widely used one. Collectively these types of tables are known as symbol comparison tables or scoring matrices and they are fundamental to all sequence comparison techniques.

Another powerful method that should always be considered in similarity analysis is the dot matrix procedure. In dot matrix analysis one sequence is plotted on the vertical axis against another on the horizontal axis using a very simple approach; wherever they match according to some scoring system that you specify, a dot is generated. Dot matrix analysis can point out areas of similarity between two sequences that all other methods might miss. This is because most other methods align either the overall sequences or just the 'best' parts of each to achieve one optimal alignment. Dot matrix methods enable the operator to visualize the entirety of both sequences; if you will, it allows the 'Gestalt' of the alignment to be seen. However, their interpretation is entirely up to the user -- you must know what the plots mean and how to successfully filter out extraneous background noise when running them. Using this method correctly you can identify areas within sequences that happen to have significant matches that no other method would ever notice.

Once these problems are understood we can screen the database to look for sequences to compare ours to. When a sequence is found to fall into a preexisting group we can infer function, mechanism, evolution, and possibly even structure based on homology with its neighbors. Most database searching programs use several elements of the concepts discussed above; however, they use tricks to make things happen faster. These tricks fall into two main categories, that of hashing and that of approximation. Hashing is the process of breaking your query sequence into small "words" or "ktuples" of a set size and creating a "look-up" table with those words keyed to numbers. Then when any of the words match part of an entry in the database, that match is saved. In general, hashing reduces the complexity of the search problem from N2 for dynamic programming to N, the length of all the sequences in the database. Approximation techniques are collectively kwon as "heuristics." Webster's defines heuristic as "serving to guide, discover, or reveal; . . . but unproved or incapable of proof." In database searching techniques the heuristic usually restricts the necessary search space by calculating some sort of a statistic that allows the program to decide whether further scrutiny of a particular match should be pursued. This statistic may miss things depending on the parameters set -- that's what makes it heuristic.

Database searching is far more sensitive at the amino acid level than at the DNA level because proteins have twenty match criteria versus DNA's four. This drastically cuts down the 'noise' level of the search. Therefore, whenever dealing with coding sequence, it is always prudent to search at the protein level. Even though protein searching is more sensitive, the DNA databases are much larger. This drawback has been partly overcome with programs which take a protein query and compare it to translated nucleotide databases, but one still needs to know if the translation is 'real.' So, even though there are advantages and disadvantages to both types of searching, the general rule is to query with a peptide sequence, if at all possible, and screen whichever database you choose.

One small database that should always be screened when working with protein sequence is PROSITES. The GCG program Motifs performs this search. Motifs searches for recognized structural, regulatory and enzymatic consensus sequences in the PROSITE Dictionary of Protein Sites and Patterns (Bairoch, 1992). You ran Motifs back in Week 4 and so you will not be repeating it here but its importance needs to be mentioned again. Another small database that should not be ignored is NRL_3D. Remember, this database contains all the sequences from the three-dimensional coordinate database PDB and, thus, can serve as a link between structural and sequence based methods.

A big question and a very common mistake that is made in this whole area of searching and alignment is the concept of homology versus similarity: There is a huge difference! Similarity is merely a statistical parameter which describes how much two sequences, or portions of them, are alike by some scoring table. Homology, by definition, implies an evolutionary relationship -- more than just the fact that we've all evolved from the same old pond scum. There is really no such thing as percent homology; something is either homologous or it is not. You need to be able to demonstrate some type of direct lineage between the organisms or genes of interest in order to claim homology. Even better, be able to show some type of experimental evidence, morphological or genetic, that corroborates your assertion. Do not make the common mistake of calling any old sequence similarity homology.

How do you tell if a similarity is significant or is truly homologous? Many of the programs generate percent similarity scores, however these really don't mean a whole lot. The "quality" score means a lot more but it is hard to interpret. They are only relevant within the context of a particular comparison or search. At least they take the length of similarity, all of the necessary gaps introduced, and the matching of symbols all into account. To get a better handle on what these various scores mean, read the algorithm section of the GCG Program Manual for the various methods -- statistics are always confusing but the descriptions do help. A few of the programs can generate histograms of score distributions, but again, they can be confusing.

One of the best ways of deciding significance is to take advantage of the Randomizations option available in the two GCG dynamic programming algorithms BestFit and Gap. To utilize this strategy, compare two sequences using the appropriate program, either Gap or BestFit depending on whether you're trying to compare the entire length of each sequence or only the best regions of similarity to each, respectively, and specify the Randomizations option on the command line. This option jumbles the second sequence of the comparison a set number of times and then generates scores and a standard deviation based on the jumbled matches. You can then compare the random scores to the unjumbled score using a "Z score" calculation to help decide significance. BLAST (Altschul, et al., 1990) uses a similar approach but bases its statistics on a comparison to all the rest, "insignificantly similar," members of the database being searched.

Many database searches work best when submitted as a batch process. This is because of the sizes of the databases of interest. Ribozyme currently has almost 750,000 sequences in its nucleotide databases. In spite of the fast hashing style algorithms incorporated, most programs can take quite a while to search through that much data. Most of the GCG database searches accept a Batch option which is really handy. An exception to the standard "submit the search and wait" style of most database searching is the BLAST program. This program uses an extremely fast heuristic statistical hashing algorithm on a parallel computer located at the National Center for Biotechnological Information. Typical searches run in just a minute of two; however, realize the limitations of this algorithm to finding only ungapped areas of similarity and the fact that it is optimized for protein comparisons only.

Exercise for week 10:

This week's exercise will explore methods for comparing and searching with sequences. First the databases will be searched for interesting comparisons and then methods for analyzing those comparisons will be learned. The two protein molecules used in this week's exercise are both responsible for debilitating disease and yet both are encoded by the organism's own DNA and both genes are expressed in both normal and afflicted cells. In both cases large amounts of proteinaceous plaques aggregate and are deposited in the brains of afflicted animals.

I will illustrate the use of all the programs by running through them first with the human prion protein as an example. You should not repeat the prion analyses in your own accounts; you will use the other protein sequence in your directory to run the analyses yourself. The prion protein has an unknown natural function but is found in high quantities in the brain of animals infected with the degenerative neurological diseases scrapie and Bovine Spongiform Encephalopathy, in wild stock, and kuru, Creutzfeldt-Jacob Disease, or Gerstmann-Straussler Syndrome in humans. It is also involved in Fatal Familial Insomnia and recently gained notoriety as the harbinger of "Mad-Cow Disease." In human beings the gene maps to position 20p12-pter and the disease can be inherited in an autosomal dominant fashion. Seventeen pathologic allelic variants are listed in the current OMIM (1995). One of the most peculiar aspects of the prion is no infective nucleotide entity has ever been found, yet the protein particle itself is highly infectious. Somehow the infectious protein particle induces a posttranslational, pathological change in the host's normal protein to convert it to the aberrant isoform. The primary amino acid sequence is not changed, only the structural conformation of the protein is different.

The other protein in this exercise is responsible for Alzheimer's Disease. It is a small portion of a nexin II type protease present in normal organisms. The intact protein is a neuronal receptor which couples to intracellular signaling pathways through the GTP-binding protein G(0). Some isoforms also have clathrin-binding and BPTI/Kunitz type protease inhibitor domains. However, for unknown reasons in individuals with Alzheimer's and in some older Down's syndrome cases, aberrant catabolism of the precursor creates the small insoluble, pathological protein known as the beta-amyloid protein. This small protein product comes from part of the transmembrane and extracellular domains of the intact protein. Several different mutations have been identified in the gene of afflicted individuals, however, this is only the case in a small fraction of all Alzheimer's patients. It is a very complex system as six different alternatively spliced forms of the protein are naturally found, all coded by 19 exons located at map position 21q21.2 in human beings. In this week's exercise you will perform all of the same analyses with the beta-amyloid protein, as those that I demonstrate using the prion protein.

l) Activate the machine, connect to ribozyme and log into your account.

By this time you should know how to activate the machine you want to use, make connections with ribozyme, and log into your account. If you still need help with these functions, refer to the beginning of the exercises for weeks 2 and 3 for step by step instructions.

2) Move to this week's subdirectory and copy the necessary files to it.

% cd ten

Now move over all the files needed to do this week's exercise. They are located in the directory location $UGRAD_DIR/week10.

% cp $UGRAD_DIR/week10/* .

3) Run the demo that describes this weeks activities.

This week's demo illustrates some of the concepts explored in the exercise. However, here you will see it used to compare a sequence to itself. It particularly attempts to show how to interpret the dot plots. The effect of stringency and window sizes on these plots will be illustrated. This technique can easily be used to help visualize internal repeats in structures. The prion protein used has a distinct repeat region identified in PROSITES that will be illustrated in the demo. Type the following command to launch this week's demo:

% demo10

Be sure to activate gcg and use setplot to designate your graphics configuration manually before attempting to perform any sequence analysis functions in the exercise.

4) Database searching programs: heuristics and "hashing" algorithms.

4.a. Traditional database searches.

Two different hashing style symbol matching algorithms have traditionally been utilized in database searching. These two algorithms are incorporated into GCG's FastA (Pearson and Lipman, 1988) and WordSearch (Wilbur and Lipman, 1983) programs. Since the algorithms differ, the output results may also differ. As in all types of computerized molecular biology analyses, it is best to run as many strategies as practical and try to interpret the results in light of their combined output. Most of the following programs accept GCG's Batch option; pay attention to the necessity of running many of these programs in the background.

4.a.1. TFastA.

TFastA takes advantage of the sensitivity of a protein query and the size of the nucleic acid database. It is the most robust of the single sequence searching programs onsite. It compares a peptide sequence against all six translation frames of each sequence in the DNA database. This way you can take advantage of the multitude of DNA sequences that never make it to the protein databases and yet still retain the sensitivity of protein searches.

Type tfasta -check -batch to start the Translation FastA program in batch mode and see all of the available options. At the command option prompt add -noalign to suppress all the alignments since most are redundant and we will be exploring the more interesting ones later on anyway. Some of the other options can be very helpful depending on your specific situation. -optall in particular is very useful; take advantage of it also. It takes longer to run but sorts your output list based on the optimum score, the result of the final dynamic programming pass in FastA, not the initn score, the longest combined word score. Specify the peptide's sequence name at the program's sequence prompt. Accept the defaults up to list size=, and specify 100. The prion example session follows below (Remember you are to use the beta-amyloid protein, not the prion molecule in your runs!):

% tfasta -check -batch

TFastA does a Pearson and Lipman search for similarity between a
query peptide sequence and any group of nucleotide sequences. TFastA
translates the nucleotide sequences in all six reading frames before
performing the comparison.  It is designed to answer the question, "What
implied peptide sequences in a nucleotide sequence database are similar
to my peptide sequence?"

Minimal Syntax: % TFASTA [-INfile1=]GGamma.Pep -Default

Prompted Parameters:

[-INfile2=]GenEMBL:*           search set (all of GenEMBL)
[-OUTfile=]GGammaCod.TFasta    output file name
-BEGin=1 -END=148              range of interest
-WORdsize=2                    word size
-LIStsize=40                   number of scores and alignments to show

Local Data Files:

-DATa=FastAPep.Cmp             scoring matrix for peptides

Optional Parameters:

 Press q to quit or <Return> for more:
-GAPweight=12.0    gap creation penalty
-LENgthweight=4.0  gap extension penalty
-SINce=6.90        limits search to sequences dated on or after June 1990
-THREEFrames       translates and searches only the three forward reading
                     frames
-FRAme=1           translates and searches only the frame specified.
-NOPAMfactor       uses a constant factor to calculate initial diagonal scores
-OPTall[=20]       immediately computes opt score if initn above threshold
-SCAle             scales scores by ln(n0) divided by ln(n1)
-SHOWall           shows complete sequences in alignment, not just overlaps
-MARKx=3           determines the alignment display mode
-NOALIGN           suppresses alignments
-NOHIStogram       suppresses printing the histogram
-LINEsize=60       number of sequence symbols per line of the alignment
-NODOCLines        suppresses sequence documentation in the alignment
-NOMONitor         suppresses the screen trace for each search set sequence 
-NOINCrease        suppresses increase to LIStsize when not interactive
-BATch             submits the program to run in the batch queue

 Add what to the command line ?  -noalign -opt

 TFASTA with what query sequence ?  humprp.pep

                  Begin (* 1 *) ?  <rtn>
                End (*   253 *) ?  <rtn>

 Search for query in what sequence(s) (* GenEMBL:* *) ? <rtn>

 What word size (* 2 *) ?  <rtn>

 List how many best scores (* 40 *) ?  100

 What should I call the output file (* humprp.tfasta *) ? <rtn>

 ** TFASTA will run as a batch or at job.

 ** TFASTA was submitted using the command:
    "  SUBMIT-NOPRINT-NOTIFY-QUEUE  "

Job TFASTA_152926 (queue RIBOZYME, entry 259) started on RIBOZYME

The job will submit itself to the batch queue and you can proceed with the rest of the exercise.

4.a.2. Regular FastA.

Now run the normal FastA program; search the SwissProt:* logical sequence specification. Start the program by typing fasta -batch -noalign -opt (the options are the same as in TFastA) to run the program in batch mode and again suppress alignments and sort by optimum score. Specify the sequence's filename.ext and accept the default search set, the SwissProt database. Accept the rest of the defaults, however, use a list size of 100 again. The prion example is illustrated below:

% fasta -batch -noalign -opt

FastA does a Pearson and Lipman search for similarity between a query
sequence and any group of sequences.  For nucleotide database searches,
FastA is more sensitive than BLAST.

 FASTA with what query sequence ?  humprp.pep

                  Begin (* 1 *) ?  <rtn>
                End (*   253 *) ?  <rtn>

 Search for query in what sequence(s) (* SwissProt:* *) ? <rtn>

 What word size (* 2 *) ?  <rtn>

 List how many best scores (* 40 *) ?  100

 What should I call the output file (* Humprp.fasta *) ? <rtn>

 ** fasta will run as a batch or at job.

 ** fasta was submitted using the command:
    "  SUBMIT-NOPRINT-NOTIFY-QUEUE  "

Job FASTA_154303 (entry 263) pending

Again the Batch option automatically submits the program to the batch queue and you should proceed with the exercise.

4.a.3. WordSearch.

The last traditional database search we will submit is called WordSearch. Unlike FastA, WordSearch can find more than one region of similarity between two sequences. Just like the previous two, it will accept automatic batch submission. Type wordsearch -batch to begin the program; additionally, take advantage of this program's graphics capability by specifying -plot on the command line. Give the program the sequence's name and accept the logical SwissProt:* as the sequences to be searched. Again, specify a list size of 100. An option this program will accept, but that we will not be using this time, is Simplify. This option simplifies the input sequences according to a scheme which generalizes amino acids into broad classes based on their chemical characteristics. Another helpful option available is Mask. This option allows you to search with only certain characters of your sequence such as only the first and second codon positions in coding DNA. The example session with the prion protein follows below:

% wordsearch -batch -plot

WordSearch identifies sequences similar to a query sequence using a
Wilbur and Lipman search.  WordSearch answers the question,
"What sequences in the database are similar to my sequence?"  The output
is a list of significant diagonals whose alignments can be displayed
with Segments.

  Process set to plot with TEK4107 attached to TERM:
  using the TEKTRONIX graphic interface.

 WORDSEARCH with what query sequence ?  humprp.pep

                  Begin (* 1 *) ?  <rtn>
                End (*   253 *) ?  <rtn>

 Search for query in what sequence(s) (* SwissProt:* *) ? <rtn>

 What word size (* 2 *) ?  <rtn>

 List how many best diagonals (* 50 *) ?  100

 Integrate how many adjacent diagonals (* 3 *) ?  <rtn>

 What should I call the output file (* Humprp.word *) ? <rtn>

 ** wordsearch will run as a batch or at job.

 ** wordsearch was submitted using the command:
    "  SUBMIT-NOPRINT-NOTIFY-QUEUE  "

Job WORDSEARCH_154655 (entry 265) pending

4.b. BLAST; heuristic Internet similarity searching.

The BLAST server at NCBI can provide the most up to date and quickest database search available. BLAST is a heuristic algorithm for searching sequence databases developed by the National Center for Biotechnology Information at the National Library of Medicine. The acronym stands for Basic Local Alignment Search Tool. BLAST by default runs on an eight processor parallel computer system at NCBI. It only looks for ungapped segments; however, it ranks matches statistically and provides probability values for each to help evaluate significance. As such, it is best for identifying shorter regions of high similarity -- exactly what you might want with a sequence of unknown function. It is very fast, at least an entire order of magnitude over traditional sequence similarity database searching, yet maintains all of the sensitivity of the old methods for local similarity regions with protein sequences! You'll get your reply in a manner of minutes in most cases. Another advantage with BLAST is it not only shows you the best alignment for each similar sequence found (as in the BestFit type alignments of FastA) but also shows the next best alignments for each up to a certain preset cutoff point. This combines the power of dot-matrix type analyses and the interpretative ease of traditional sequence alignments. One can fine-tune BLAST by altering its operating parameters and by taking advantage of the many options available in it; however, BLAST is really not appropriate for comparing nucleotide sequences against the nucleotide database. When you are forced to perform nucleotide-to-nucleotide searches, it is usually best to use FastA instead.

BLAST accesses the latest database (NCBI updates GenBank every night) by default, nucleotide or protein. Also by default, the GCG implementation of BLAST runs in a remote client-server mode such that NCBI's database and computer perform the analysis; alternatively you can run it locally if you have custom assembled databases that you wish to screen.

For help in interpreting BLAST results refer to the GCG BLAST documentation, the BLAST HELP file obtained by sending the single word "HELP" to BLAST@ncbi.nlm.nih.gov (leave the subject line blank), and the text book for the course. Run blast -check on the prion protein. Specify the -filter=xs option. This is very powerful for screening out low complexity and repeat sequences from your query to minimize confusion due to random noise. The prion protein is very rich in these types of confounding sequence and so is a very appropriate example for showing this option. Also specify -segments=0 to suppress the alignment section and reduce the size of the output file. The standard output file is very long because BLAST automatically aligns the best 100 matches. Accept the default database choice, 1) nr p which stands for "NonRedundant Protein," to search all of the unique sequences from the most current versions of all of the protein databases. Set the listsize parameter to 100, as in the previous searches. A screen trace of a BLAST session with the prion protein follows:

% blast -check

BLAST searches a specified database to find sequences that are similar to a
query sequence.  BLAST either can search databases on your own computer or it 
can search the databases maintained at the National Center for Biotechnology
Information (NCBI) in Bethesda, Maryland, USA.

Minimal Syntax: % BLAST [-INfile1=]SW:Zea2_Maize  -Default

Prompted Parameters:

[-INfile2=]swissprot     database to search
-EXPect=10.0             ignores scores that would occur by chance more than 10 times
-LIStsize=250            maximum number of sequences listed in the output
[-OUTfile=]Zea2.Blastp   output file name

Local Data Files:

[-DATa1=BLAST.rDBs]      list of available remote databases
[-DATa2=BLAST.lDBs]      list of available local databases
[-DATa3=BLAST.sDBs]      list of available site-specific databases

Optional Parameters:

 Press q to quit or <Return> for more:  <rtn>
-ONEstrand                   only translates the top strand
-FILter=xs                   filters low complexity segments out of query
-MATrix=PAM120               uses alternative protein scoring matrix
-TRANSlate=0                 selects among alternative genetic codes
-REMoteonly                  searches only remote databases (simplifies menu)
-LOCalonly                   searches only local databases   (   "   )
-DBNucleotideonly            searches only nucleic databases (   "   )
-DBProteinonly               searches only protein databases (   "   )
-SERver=cruncher.nlm.nih.gov runs remote search on this particular host
-CUToffscore=56              score below which MSPs are ignored
-WORdsize=3                  word size
-SYNonym=10                  threshold for words to be treated as equivalent

-SEGments=100                the number of segment pairs displayed in output
-APPend="string"             appends "string" to pass-through command line
-BATch                       submits program to batch queue

 Add what to the command line ?  -filter=xs -segments=0

 BLAST search with what query sequence?  humprp.pep

 Search for query in what sequence database:

   1) nr          p Non-redundant updated protein database PDB+SwissProt+PIR
   2)   pdb       p Brookhaven Protein Data Bank, current release
   3)   swissprot p SWISS-PROT current release
   4)   pir       p PIR, current release
   5)   spupdate  p SWISS-PROT cumulative weekly update
   6)   genpept   p translations of coding sequences in GenBank
   7)   gpupdate  p cumulative daily updates of GenPept
   8) kabatpro    p Kabat Proteins of immunological interest
   9) tfd         p TFD transcription factor
  10) palu        p Six-frame translations of representative human ALU repeats
  11) nr          n Non-redundant updated nucleotide database PDB+GenBank+EMBL
  12)   pdb       n Brookhaven Protein Data Bank
  13)   genbank   n GenBank, current release
  14)   gbupdate  n GenBank cumulative daily update
  15)   embl      n EMBL current release
  16)   emblu     n EMBL cumulative daily update
  17) vector      n Vector subset of GenBank
  18) repbase     n Primate Alu repeats
  19) kabatnuc    n Kabat nucleotide sequences of immunological interest
  20) epd         n Eukaryotic Promoter Database
  21) dbest       n Expressed sequence tags
 Press q to quit or <Return> for more:  <rtn>

 Please choose one (* 1 *):  <rtn>

 Ignore hits expected to occur by chance more than (* 10.0 *) times? <rtn> 

Limit the number of sequences in my output to (* 250 *) ? 100 What should I call the output file (* humprp.blastp *) ? <rtn> Trying cruncher.nlm.nih.gov (130.14.25.175) Connected to cruncher.nlm.nih.gov Search in progress on the network server. ............................ Retrieving results. ... WARNING: HSPs involving 69 database sequences were not reported due to the limiting value of parameter B = 0. . Done!

The prion BLAST search reply follows:

% more humprp.blastp

National Center for Biotechnology Information (NCBI)

Experimental GENINFO(R) BLAST Network Service (Cruncher)

Tue Sep 26 19:21:59 EDT 1995, Up 18 days, 10:27, load: 19.50, 24.50, 27.76

  If results of this search are reported or published, please mention that
  the computation was performed at the NCBI using the BLAST network service.
  Problems with the service should be reported to a local system administrator. 

PEPTIDE SEQUENCE DATABASES
 nr  Non-redundant PDB+SwissProt+PIR+SPUpdate+GenPept+GPUpdate, updated daily
     for efficient, complete searches of the five component databases:
   pdb        Brookhaven Protein Data Bank, April 1994 Release
   swissprot  SWISS-PROT Release 31.0, March 1995
   pir        PIR Release 45.0 (complete), June 30, 1995
   spupdate   SWISS-PROT cumulative weekly update to the major release
   genpept    CDS translations from GenBank(R) Release 90, August 15, 1995
   gpupdate   cumulative daily updates to the major release of genpept
 kabatpro Kabat Sequences of Proteins of Immunological Interest, June 1995
 tfd      TFD transcription factor (protein) database Release 7.0, June 1993
 alu *    Translations of select Alu repeats from REPBASE

NUCLEOTIDE SEQUENCE DATABASES
 nr    Non-redundant PDB+GBUpdate+GenBank+EmblUpdate+EMBL, updated daily
       for efficient, complete searches of the four component databases:
   pdb        Brookhaven Protein Data Bank, April 1994 Release
   genbank    GenBank(R) Release 90 (no daily updates), August 15, 1995
   gbupdate   GenBank(R) cumulative daily updates to the major release
   embl       EMBL Data Library, Release 43.0, June 1995
   emblu      EMBL Data Library cumulative daily updates to the major release
 vector     Vector subset of GenBank(R), February 3rd, 1995
 alu *+     Select Alu repeats from REPBASE
 kabatnuc Kabat Sequences of Nucleic Acid of Immunological Interest, June 1995
 epd      Eukaryotic Promoter Database Release 40, September 1994
 dbest +  Database of Expressed Sequence Tags (cumulative daily update)
 dbsts +  Database of Sequence Tagged Sites Release 1.5, October 26, 1994

  * Databases that are not accessible through the NCBI Retrieve E-mail server.
  + The TBLASTX program is restricted to searching these databases.
=============================================================================
 You can obtain the BLAST documentation files, send a message consisting of
 just the word ''help'' (without the quotes) to: blast@ncbi.nlm.nih.gov
 Last modification dates: August 10th 95 for the E-mail server help, January
 19th 94 for the BLAST manual and March 18th 95 for the BLAST FAQ.
=============================================================================
 For a free subscription to "NCBI News", the NCBI newsletter, send a request
 along with your name and postal mailing address to: info@ncbi.nlm.nih.gov
=============================================================================
 A new GenBank sequence submission tool, called BankIt, is now available
 through the NCBI's home page on the World Wide Web. The URL is
 http:--www.ncbi.nlm.nih.gov-
=============================================================================

BLASTP 1.4.8MP [20-June-1995] [Build 16:33:21 Sep  5 1995]

Reference:  Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers,
and David J. Lipman (1990).  Basic local alignment search tool.  J. Mol. Biol.  215:403-10. 

Query=  TITLE humprp.pep
        (253 letters)
>lcl|TITLE humprp.pep
MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXTHSQXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCV
NITIKQHXXXXXXXXXXXXXXDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSXXX
XXXXXXXXXXXVG

Database:  Non-redundant PDB+SwissProt+SPupdate+PIR+GenPept+GPupdate
           157,254 sequences; 46,130,256 total letters.
Searching..................................................done

                                                                    Smallest
                                                                      Sum
                                                             High  Probability
Sequences producing High-scoring Segment Pairs:             Score  P(N)      N 

pir|A40372|A40372     major prion protein PrP variant 1 -...  506  4.6e-106  2
sp|P40250|PRIO_CERAE  MAJOR PRION PROTEIN PRECURSOR (PRP)...  485  1.0e-102  2
sp|P40249|PRIO_CEBAP  MAJOR PRION PROTEIN PRECURSOR (PRP)...  488  1.4e-102  2
sp|P40258|PRIO_SAISC  MAJOR PRION PROTEIN PRECURSOR (PRP)...  479  2.5e-102  2
sp|P40247|PRIO_CALJA  MAJOR PRION PROTEIN PRECURSOR (PRP)...  482  5.0e-102  2
gp|U15164|APU15164_1  major prion protein precursor [Atel...  485  1.3e-101  2
gp|U15165|SSU15165_1  major prion protein precursor [Saim...  476  2.4e-100  2
pir|A05017|A05017     major prion PrP27-30 protein precur...  506  8.7e-100  2
gp|M81929|HUMPRPRA_1  prion protein [Homo sapiens]            506  4.8e-99   2
sp|P04156|PRIO_HUMAN  MAJOR PRION PROTEIN PRECURSOR (PRP)...  733  9.2e-99   1
sp|P40252|PRIO_GORGO  MAJOR PRION PROTEIN PRECURSOR (PRP)...  730  2.4e-98   1
sp|P40253|PRIO_PANTR  MAJOR PRION PROTEIN PRECURSOR (PRP)...  725  1.2e-97   1
sp|P40246|PRIO_ATEGE  MAJOR PRION PROTEIN PRECURSOR (PRP)...  485  7.9e-97   2
gp|U15166|GGU15166_1  major prion protein precursor [Gori...  719  8.4e-97   1
sp|P40251|PRIO_COLGU  MAJOR PRION PROTEIN PRECURSOR (PRP)...  713  5.9e-96   1
sp|P40256|PRIO_PONPY  MAJOR PRION PROTEIN PRECURSOR (PRP)...  712  8.1e-96   1
sp|P40254|PRIO_MACFA  MAJOR PRION PROTEIN PRECURSOR (PRP)...  710  1.5e-95   1
sp|P40245|PRIO_AOTTR  MAJOR PRION PROTEIN PRECURSOR (PRP)...  481  1.9e-95   2
sp|P40257|PRIO_PREFR  MAJOR PRION PROTEIN PRECURSOR (PRP)...  708  2.9e-95   1
sp|P04925|PRIO_MOUSE  MAJOR PRION PROTEIN PRECURSOR (PRP)...  424  8.0e-95   3
gp|M18071|MUSPRPB_1   Mouse (with long incubation period)...  423  1.1e-94   3
gp|M13685|MUSPRP_1    Mouse prion protein (PrP) mRNA, com...  420  2.9e-94   3
gp|M81930|HUMPRPRB_1  prion protein [Homo sapiens]            688  2.0e-92   1
pir|A23545|A23545     major prion PrP27-30 protein - hamster  409  3.4e-92   3
sp|Q01880|PRIP_BOVIN  MAJOR PRION PROTEIN 2 PRECURSOR (PR...  543  6.8e-90   2
sp|P40248|PRIO_CALMO  MAJOR PRION PROTEIN PRECURSOR (PRP)...  665  3.3e-89   1
sp|P40255|PRIO_MANSP  MAJOR PRION PROTEIN PRECURSOR (PRP)...  664  4.5e-89   1
pir|A54281|A54281     prion protein precursor - sheep >gp...  536  6.4e-89   2
sp|P10279|PRIO_BOVIN  MAJOR PRION PROTEIN 1 PRECURSOR (PR...  481  1.0e-88   3
sp|P40243|PRP2_TRAST  MAJOR PRION PROTEIN 2 PRECURSOR (PR...  540  1.2e-88   2
sp|P23907|PRIO_SHEEP  MAJOR PRION PROTEIN PRECURSOR (PRP)...  534  1.2e-88   2
pir|S37149|S37149     prion protein - goat >gp|X74758|CHP...  532  2.3e-88   2
gp|U25965|OHU25965_1  prion protein precursor [Odocoileus...  532  2.3e-88   2
gp|D10612|BOVPRP1_1   prion protein [Bos taurus]              477  3.6e-88   3
sp|P40242|PRIO_TRAST  MAJOR PRION PROTEIN 1 PRECURSOR (PR...  475  2.4e-87   3
pir|A34759|A34759     prion protein - Chinese hamster >gp...  614  5.5e-86   2
pir|B34759|B34759     prion protein - golden hamster >gp|...  614  5.5e-86   2
sp|P40244|PRIO_MUSVI  MAJOR PRION PROTEIN PRECURSOR (PRP)...  456  8.5e-86   3
gp|U08952|MPU08952_1  prion protein [Mustela putorius]        446  2.1e-84   3
gp|L07623|PIGPRPG_1   prion protein [Sus scrofa]              449  2.9e-84   3
sp|P04273|PRIO_MESAU  MAJOR PRION PROTEIN PRECURSOR (PRP)...  598  9.4e-84   2
gp|M37381|HAMPRP_1    Hamster scrapie prion (Prp 27-30) m...  560  2.0e-78   2
pir|A03133|UJHYIH     major prion PrP27-30 protein - gold...  547  1.3e-76   2
gp|S69654|S69654_1    prion protein [Unknown.]                544  3.2e-76   2
sp|P13852|PRIO_RAT    MAJOR PRION PROTEIN (PRP) (FRAGMENT...  491  6.4e-75   3
gp|L38993|TIOPRP_1    prion protein [Trichosurus vulpecular]  400  1.3e-72   3
gp|K02234|HAMPRQ_2    PrP 27-30 gene product [Mesocricetu...  408  5.7e-53   1
gp|M30384|MUSPRPC_1   Mouse prion protein (PrP27-30) mRNA...  188  8.0e-21   1
pir|A22315|A22315     Scrapie prion - mouse (fragment)        187  1.2e-20   1
sp|P27177|PRIO_CHICK  MAJOR PRION PROTEIN HOMOLOG PRECURS...  125  3.1e-17   3
pir|A41280|UJCH       major prion protein homolog precurs...  120  2.6e-16   3
pir|A37372|A37372     prion protein homolog precursor - c...  120  2.6e-16   3
pir|S02520|S02520     Major prion PrP, scrapie, protein -...  149  2.0e-14   1
pir|S14078|S14078     amyloid protein, 11K - human            115  9.1e-10   1
gp|X79912|OAPRP3_1    PrP gene product [Ovis sp.]              68  0.018     1
gp|X55691|LEEXTEN11_1 glycine-rich protein [Lycopersicon ...   64  0.52      1
pir|S14978|S14978     glycine-rich protein (clone wN) - t...   64  0.57      1
gp|X55689|LEEXTEN9_1  glycine-rich protein [Lycopersicon ...   64  0.57      1
pir|S14977|S14977     glycine-rich protein (clone wM) - t...   64  0.59      1
pir|A36019|A36019     Scrapie prion - golden hamster (fra...   58  0.63      1
sp|P17816|GRP_HORVU   GLYCINE-RICH CELL WALL STRUCTURAL P...   64  0.65      1
sp|P21750|SALA_DROME  PROTEIN SPALT-ACCESSORY. >pir|A3391...   47  0.74      2
sp|Q01157|GRW1_LYCES  GLYCINE-RICH CELL WALL STRUCTURAL P...   55  0.78      1
gp|X55690|LEEXTEN10_1 glycine-rich protein [Lycopersicon ...   52  0.87      1
gp|Z48625|HVGRP2_1    glycine rich protein [Hordeum vulgare]   59  0.90      1
pir|A31994|A31994     keratin 10, type I, epidermal - hum...   42  0.97      3
sp|P21749|SALA_DROSI  PROTEIN SPALT-ACCESSORY. >pir|B3391...   43  0.992     2
sp|P10569|MYSC_ACACA  MYOSIN IC HEAVY CHAIN. >pir|A33891|...   60  0.993     1
sp|P20805|GLN2_FRASC  GLUTAMINE SYNTHETASE II (EC 6.3.1.2...   44  0.9995    2


WARNING:  HSPs involving 69 database sequences were not reported due to the
          limiting value of parameter B = 0.

Parameters:
  V=100
  B=0
  -filter=xnu+seg
  -echofilter

  -ctxfactor=1.00
  E=10

  Query                        -----  As Used  -----    -----  Computed  ----
  Frame  MatID Matrix name     Lambda    K       H      Lambda    K       H
   +0      0   BLOSUM62        0.323   0.137   0.437    same    same    same

  Query
  Frame  MatID  Length  Eff.Length   E    S W   T  X     E2  S2
   +0      0      253       150      10. 59 3  11 22    0.22 31

Statistics:
  Query          Expected         Observed           HSPs       HSPs
Frame  MatID  High Score       High Score       Reportable    Reported
 +0      0    62 (28.9 bits)  506 (235.6 bits)      244          7

  Query         Neighborhd  Word      Excluded    Failed   Successful  Overlaps
  Frame  MatID   Words      Hits        Hits    Extensions Extensions  Excluded
   +0      0      4844    15914728     3944497    11936634    33590       542

  Database:  Non-redundant PDB+SwissProt+SPupdate+PIR+GenPept+GPupdate
    Release date:  6:02 AM EDT Sep 26, 1995
    Posted date:  6:03 AM EDT Sep 26, 1995
  # of letters in database:  46,130,256
  # of sequences in database:  157,254
  # of database sequences satisfying E:  69
  No. of states in DFA:  558 (55 KB)
  Total size of DFA:  107 KB (128 KB)
  Time to generate neighborhood:  0.03u 0.01s 0.04t  Real: 00:00:00
  No. of processors used:  8
  Time to search database:  88.03u 3.25s 91.28t  Real: 00:00:15
  Total cpu time:  88.15u 3.42s 91.57t  Real: 00:00:16

WARNINGS ISSUED:  1
<END

Some very important features of BLAST's output need to be pointed out. First notice that the Filter=xs switch causes the troublesome portions of the query sequence to be ignored in the search. Pay attention to the Poisson distribution scores. These are the likelihoods that the observed matches could be due to chance. Therefore, the smaller the number, the more significant. You should be able to see a fairly clear demarcation where the scores quickly drop off between the significant hits and the background noise. Another important point to notice is, unlike all other GCG programs, the list generated by this search is not appropriate as input to other GCG analyses. This output list is actually generated by NCBI's computer which doesn't know about GCG format requirements. Therefore, if you want to use the results of a BLAST search in another GCG program you must manually edit the list changing the database names to reflect the logicals that GCG understands. For example, change sp|P04145|PRIO_HUMAN to SW:P04145 PRIO_HUMAN (either insert a blank space or ! between the access code and sequence name). The "gp" designation is GenPept, all the translations from GenBank's CDS references. GenPept is not installed on VADMS systems. Therefore, if you want to include any in subsequent analyses, you must individually translate them from GenBank. For instance, gp|U15164|APU15164_1 is the first CDS translation from GenBank entry name APU15164. This does make life a bit more complicated but can be worked around.

Now that you've read over and seen how database searching works with the prion protein, go back and do all four searches in exactly the same manner with the beta-amyloid protein. Your first session for the week is complete after all of your searches have been submitted. If you are done early, you may wish to work on this week's problem set. By the time you return for your second session of the week all batch jobs should be finished and you can continue with the rest of the exercise.

5) What Next? Comparisons, interpretations, and further analyses.

When the output from the previous "traditional" batch jobs is available, take a look at the files to check out the results of all your searches with the beta-amyloid protein. You'll need these results in order to complete the report form for this exercise. Remember, all output from this lab will be located in your week 10 directory. Also, all GCG batch jobs send log files to your mail system -- you'll want to delete these files if the batch jobs finished correctly. You may want to print out your search results using the lpr command for reference in the remainder of the exercise although it is not necessary. The abridged prion example output files follow the more commands below. Naturally, the topmost "hits" are prion proteins; it's the ones just below the expected hits that may prove interesting.

% more humprp.tfasta

(Peptide) TFASTA of: humprp.pep  from: 1 to: 253  September 27, 1995 07:29

P1;PRIO_HUMAN - MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C)
ID   PRIO_HUMAN     STANDARD;      PRT;   253 AA.
AC   P04156;
DT   01-NOV-1986 (REL. 03, CREATED)
DT   01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE)
DT   01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE) . . .


 TO: GenEMBL:*  Sequences:    197,612  Symbols: 235,990,291  Word Size: 2
 Scoring matrix: GenRunData:Fastapep.Cmp
 Variable pamfactor used
 Gap creation penalty: 12.0      Gap extension penalty: 4.0


Score Init1  Initn
        (-)    (+)
<  2  10080  10080:==================================================
   4   1986   1986:==================================================
   6   2489   2489:==================================================
   8  10330  10330:==================================================
  10  15801  15801:==================================================
  12  41835  41835:==================================================
  14  23697  23697:==================================================
  16  61735  61735:==================================================
  18  88756  88756:==================================================
/////////////////////////////////////////////////////////////////////
  62    623   4253:==================================================
  64    436   3563:==================================================
  66    325   3123:==================================================
  68    257   2436:==================================================
  70    164   2116:==================================================
  72    116   1731:==================================================
  74    105   1383:==================================================
  76     76   1271:======================================++++++++++++
  78     47    933:========================++++++++++++++++++++++++++
  80     67    795:==================================++++++++++++++++
> 80    299    3395:==================================================

Mean score calculations exclude scores greater than 73
 mean initn score:  26.5 (s.d. 11.28)
 mean init1 score:  25.1 (s.d. 8.61)
 1637 scores better than 87 saved
 joining threshold: 28, optimization threshold: 19

The best scores are:                                      frame init1 initn  opt 
..

gb:Humprp  Human prion protein (PrP) mRNA, complete cds   (2) 1375  1375  1375
gb:Ggu08300  Gorilla gorilla prion protein gene, comple...(1) 1373  1373  1373
gb:Ptu08296  Pan troglodytes prion protein gene, comple...(1) 1372  1372  1372
gb:Hlu08299  Hylobates lar prion protein gene, complete...(1) 1372  1372  1372
gb:Ssu08308  Symphalangus syndactylus prion protein gen...(1) 1372  1372  1372
gb:Ptu15039  Pan troglodytes major prion protein precur...(1) 1372  1372  1372
gb:Ggu15166  Gorilla gorilla amyloid precursor protein ...(1) 1361  1361  1361
gb:Ppu08305  Pongo pygmaeus prion protein gene, complet...(1) 1360  1360  1360
gb:Cgu08297  Colobus guereza prion protein gene, comple...(1) 1355  1355  1355
gb:Mnu08306  Macaca nemestrina prion protein gene, comp...(1) 1354  1354  1354
gb:Mau08311  Macaca arctoides prion protein gene, compl...(1) 1354  1354  1354
gb:Phu08294  Papio hamadryas prion protein gene, comple...(1) 1354  1354  1354
gb:Mmu15163  Macaca mulatta amyloid precursor protein g...(1) 1354  1354  1354
gb:Mmu08307  Macaca mulatta prion protein gene, complet...(1) 1354  1354  1354
gb:Mfu08298  Macaca fascicularis prion protein gene, co...(1) 1354  1354  1354
gb:Mfu08301  Macaca fuscata prion protein gene, complet...(1) 1354  1354  1354
gb:Pfu08302  Presbytis francoisi prion protein gene, co...(1) 1350  1350  1350
gb:Humprion  Human prion protein mRNA, human PrP 27-30 ...(2)  697   697  1345
gb:Humprp0a  Human prion protein 27-30 mRNA, complete cds (2)  697   697  1345
gb:Cju08304  Callithrix jacchus prion protein gene, com...(1) 1039  1039  1329
gb:Cau08295  Cebus apella prion protein gene, complete cds(1) 1037  1037  1323
gb:Ssu15165  Saimiri sciureus amyloid precursor protein...(1) 1038  1038  1321
gb:Ssu08310  Saimiri sciureus prion protein gene, compl...(1) 1041  1041  1317
gb:Apu15164  Ateles paniscus x Ateles fusciceps amyloid...(1) 1034  1034  1317
gb:Cruprpb  Armenian hamster prion protein (PrP) gene, ...(1) 1169  1268  1277
gb:Cruprpa  Chinese hamster prion protein (PrP) gene, c...(1) 1167  1266  1275
gb:Msu08303  Mandrillus sphinx prion protein gene, part...(1) 1274  1274  1274
gb:Bovprp3  Bovine PrP gene encoding prion protein        (1) 1127  1127  1273
gb:Cmu08312  Callicebus moloch prion protein gene, part...(1) 1268  1268  1268
gb:Shpprpa  Sheep PrP gene                                (3) 1119  1119  1265
gb:Oaprp3  O.aries PrP gene, exon 3 and flanking sequences(3) 1119  1119  1265
gb:Hamprp2  Hamster (Syrian golden) PrP gene, complete cds(1) 1159  1255  1264
gb:Chpripr  C.hircus gene for prion protein               (1) 1118  1118  1264
gb:Shpprp  Sheep prion protein (PrP) gene, complete cds   (3) 1116  1116  1262
gb:S55629  PrP=prion protein [cattle, mRNA Partial, 795...(1)  996  1101  1258
gb:Bovprp2  Bovine PrP gene encoding prion protein        (1)  996  1101  1258
gb:Btprp  Bovine PrP gene for a prion-protein             (1)  996  1101  1258
gb:Cdu08292  Cercopithecus dianae prion protein gene, c...(1)  993   993  1257
gb:Cau08291  Cercopithecus aethiops prion protein gene,...(1)  993   993  1257
gb:S46825  prion protein [mink, Genomic, 2446 nt]         (2)  788   893  1255
gb:Pigprpg  Sus scrofa prion protein (PrP) gene, comple...(1)  790   897  1254
gb:Bovprp1  Bovine mRNA for prion protein                 (3)  992  1097  1254
gb:Humprprb  Human prion protein gene, 5' end             (1) 1252  1252  1252
gb:Tspripr  T.streptsiceros gene for prion protein        (1) 1108  1108  1250
gb:Musprpa  Mouse (with short incubation period) prion ...(2)  874   973  1249
gb:Musprp  Mouse prion protein (PrP) mRNA, complete cds   (1)  870   969  1245
gb:U08952  Mustela putorius prion protein (PrP) gene, c...(1)  776   881  1243
gb:Tsprip  T.strepsiceros gene for prion protein          (1)  990  1091  1243
gb:S69654  Prn=prion protein [zitter rats, liver, Genom...(1) 1134  1233  1242
gb:Atu08293  Aotus trivirgatus prion protein gene, part...(1) 1000  1000  1242
gb:Musprpb  Mouse (with long incubation period) prion p...(2)  867   966  1242
gb:Hamprq  Syrian golden hamster scrapie (prion) protei...(2) 1109  1205  1214
gb:Hamprp  Hamster scrapie prion (Prp 27-30) mRNA, 3' end (2) 1109  1205  1214
gb:Agu08309  Ateles geoffroyi prion protein gene, parti...(1)  979   979  1176
gb:Humprpra  Human prion protein gene, 5' end             (2)  941   941  1153
gb:Ratprp  Rat prion-related protein (PrP) mRNA, 3' end   (1) 1008  1107  1116
gb:Tioprp  Trichosurus vulpecular prion protein (PrP) g...(3)  652   652   974
gb:Musprpc  Mouse prion protein (PrP27-30) mRNA, partia...(2)  381   381   381
gb:Chkprlp  Gallus gallus prion-like protein mRNA, comp...(1)  209   249   332
gb:Chkprion  Chicken prion protein gene, complete cds     (1)  211   251   323
gb:S80539  PRNP=PrP amyloid [human, Genomic Mutant, 291...(1)  270   270   278
gb:Dogbgall  Canis beta-galactosides-binding lectin (LG...(5)   60    60   161
gb:Atms1  A.thaliana minisatellite sequence 1, chromoso...(1)  104   104   155
gb:Celrbp1  C.elegans mRNA for hnRNP like protein         (1)   95    95   153
gb:Ggeap300  G.gallus EAP-300 mRNA                        (3)   58    58   153
gb:Dmp11  D.melanogaster gene for P11 protein, A1 relat...(3)   81    81   138
gb:Hse1s2  Equine Herpesvirus 1 Subtype 2,BamHI fragment  (1)  117   117   138
gb:Dmhrb87  Drosphila Hrb87F mRNA for a member of the A...(1)   81    81   137
gb:Dmhnrib2  D.melanogaster gene for heterogeneous nucl...(3)   77    77   137
gb:Saidrd4d  Saimiri boliviensis dopamine receptor D4 (...(4)   98    98   134
gb:Hse1s2  Equine Herpesvirus 1 Subtype 2,BamHI fragment  (2)  115   115   132
gb:Hshox51  Human HOX 5.1 gene for HOX 5.1 protein        (2)   68    89   129
gb:A16444  Xho-PstI fragment from THP-I cells             (6)   65    65   129
gb:Hse1s2  Equine Herpesvirus 1 Subtype 2,BamHI fragment  (3)  115   115   129
gb:Crewp6a  Chlamydomonas eugametos WP6 mRNA, complete cds(4)   62    62   128
gb:Ornreprc  Oreochromis niloticus DNA sequence, repeat...(1)  126   126   128
gb:Ornreprd  Oreochromis niloticus DNA sequence, repeat...(1)  126   126   128
gb:Saidrd4b  Saimiri boliviensis dopamine receptor D4 (...(4)   92    92   128
////////////////////////////////////////////////////////////////////////////////
gb:Ratcbva  Rattus norvegicus vesicla-associate calmodu...(5)  107   107   120
gb:Bovap  Bovine alkaline phosphatase (AP) mRNA, comple...(6)   52    52   120
gb:Ttnusainf  T.thermophilus nusA-infB operon DNA         (6)   64    76   120
gb:Tetrstel2  T.thermophila (type B) tandem repeat 'ccc...(1)   96    96   120
gb:Dyadhps  D.yakuba Adh pseudogene                       (6)   39    39   119

! CPU time used:
!        Database scan: 13:58:45.8
! Post-scan processing:  0:00:34.8
!       Total CPU time: 13:59:22.9
! Output File: Humprp.Tfasta

Had we chosen to not suppress aligning the results, in addition to showing the entry list and translation frames the TFastA output would show the sequence location for each alignment. This information can be used to go back to the original nucleotide entry and check to see whether they correspond to actually translated areas. Sometimes the sequences found by TFastA will not show up in any other searches. This could be valuable information, especially if sometime during your peptide's evolutionary history it incorporated (was 'infected' by) any type of mobile DNA element. This comes to mind in my example because of frequent hits within the herpes virus family.

Next the example FastA output file:

% more humprp.fasta

(Peptide) FASTA of: humprp.pep  from: 1 to: 253  September 26, 1995 17:36

P1;PRIO_HUMAN - MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C)
ID   PRIO_HUMAN     STANDARD;      PRT;   253 AA.
AC   P04156;
DT   01-NOV-1986 (REL. 03, CREATED)
DT   01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE)
DT   01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE) . . .


 TO: SwissProt:*  Sequences:     40,292  Symbols: 14,147,368  Word Size: 2
 Scoring matrix: GenRunData:Fastapep.Cmp
 Variable pamfactor used
 Gap creation penalty: 12.0      Gap extension penalty: 4.0


Score Init1  Initn
        (-)    (+)
<  2     13     13:=======
   4      0      0:
   6      7      7:====
   8     32     32:================
  10     87     87:============================================
  12    323    323:==================================================
  14    426    426:==================================================
  16   1374   1374:==================================================
  18   2603   2603:==================================================
  20   3562   3562:==================================================
/////////////////////////////////////////////////////////////////////
  48    112    306:==================================================
  50     86    257:===========================================+++++++
  52     43    195:======================++++++++++++++++++++++++++++
  54     36    164:==================++++++++++++++++++++++++++++++++
  56     25    132:=============+++++++++++++++++++++++++++++++++++++
  58     18     90:=========++++++++++++++++++++++++++++++++++++
  60     12     79:======++++++++++++++++++++++++++++++++++
  62     18     59:=========+++++++++++++++++++++
  64      9     57:=====++++++++++++++++++++++++
  66      7     45:====+++++++++++++++++++
  68      8     37:====+++++++++++++++
  70      6     21:===++++++++
  72      4     19:==++++++++
  74      3     15:==++++++
  76      3      6:==+
  78      2      9:=++++
  80      2     10:=++++
> 80     26     39:=============+++++++

Mean score calculations exclude scores greater than 73
 mean initn score:  26.5 (s.d. 8.47)
 mean init1 score:  25.7 (s.d. 6.76)
 1869 scores better than 50 saved
 joining threshold: 28, optimization threshold: 19

The best scores are:                                        init1 initn  opt..

Swissprotein:Prio_Human  MAJOR PRION PROTEIN PRECURSOR (P...1375  1375  1375
Swissprotein:Prip_Bovin  MAJOR PRION PROTEIN 2 PRECURSOR ...1127  1127  1273
Swissprotein:Prio_Mesau  MAJOR PRION PROTEIN PRECURSOR (P...1159  1255  1264
Swissprotein:Prio_Sheep  MAJOR PRION PROTEIN PRECURSOR (PRP)1116  1116  1262
Swissprotein:Prio_Bovin  MAJOR PRION PROTEIN 1 PRECURSOR ... 996  1101  1258
Swissprotein:Prio_Mouse  MAJOR PRION PROTEIN PRECURSOR (P... 874   973  1249
Swissprotein:Prio_Rat  MAJOR PRION PROTEIN (PRP) (FRAGMENT) 1008  1107  1116
Swissprotein:Prio_Chick  MAJOR PRION PROTEIN HOMOLOG PREC... 211   251   323
Swissprotein:Grp2_Orysa  GLYCINE-RICH CELL WALL STRUCTURA... 101   101   121
Swissprotein:Roa1_Drome  HETEROGENEOUS NUCLEAR RIBONUCLEO...  85    85   121
Swissprotein:K1ci_Human  KERATIN, TYPE I CYTOSKELETAL 9 (... 105   105   115
Swissprotein:Fbrl_Yeast  FIBRILLARIN (NUCLEOLAR PROTEIN 1)    61    61   112
Swissprotein:Fbrl_Leima  FIBRILLARIN                          93    93   111
Swissprotein:Roa1_Scham  HETEROGENEOUS NUCLEAR RIBONUCLEO...  81    88   110
Swissprotein:Grp_Arath  GLYCINE-RICH CELL WALL STRUCTURAL... 110   110   110
Swissprotein:K1c0_Xenla  KERATIN 3, TYPE I CYTOSKELETAL 5... 108   108   109
Swissprotein:K1c1_Mouse  KERATIN, TYPE I CYTOSKELETAL 59 KD   76    76   107
Swissprotein:Roab_Xenla  HETEROGENEOUS NUCLEAR RIBONUCLEO...  57    57   106
Swissprotein:Ebn1_Ebv  EBNA-1 NUCLEAR PROTEIN                103   103   106
Swissprotein:Roaa_Xenla  HETEROGENEOUS NUCLEAR RIBONUCLEO...  57    57   106
Swissprotein:Vg38_Bpox2  RECEPTOR RECOGNIZING PROTEIN (PR...  49    49   106
Swissprotein:Grp_Horvu  GLYCINE-RICH CELL WALL STRUCTURAL...  82    82   105
Swissprotein:K1cj_Human  KERATIN, TYPE I CYTOSKELETAL 10 ...  84    84   104
Swissprotein:Fbrl_Mouse  FIBRILLARIN (NUCLEOLAR PROTEIN 1)    66    66   104
Swissprotein:Mysb_Acaca  MYOSIN HEAVY CHAIN IB                76    76   103
Swissprotein:Fib1_Petma  FIBRINOGEN ALPHA-1 CHAIN PRECURS...  50    50   103
Swissprotein:K2ce_Human  KERATIN, TYPE II CYTOSKELETAL 2 ...  71    71   103
Swissprotein:Grp1_Phavu  GLYCINE-RICH CELL WALL STRUCTURA...  96    96   102
Swissprotein:Ded1_Yeast  PUTATIVE ATP-DEPENDENT RNA HELIC...  56    79   102
Swissprotein:Grp1_Orysa  GLYCINE-RICH CELL WALL STRUCTURA...  87    87   102
Swissprotein:K1cj_Bovin  KERATIN, TYPE I CYTOSKELETAL VIB...  81    81    99
Swissprotein:Gr10_Brana  GLYCINE-RICH RNA-BINDING PROTEIN 10  80    80    99
Swissprotein:Fibs_Nepcl  SILK FIBROIN (FRAGMENT)              79    79    99
Swissprotein:Oc3n_Mouse  NERVOUS-SYSTEM SPECIFIC OCTAMER-...  42    42    98
Swissprotein:Hmpr_Human  HOMEOBOX-POU DOMAIN PROTEIN RDC-1    67    67    98
Swissprotein:K1cj_Mouse  KERATIN, TYPE I CYTOSKELETAL 10 ...  67    67    98
Swissprotein:M130_Strpu  MESENCHYME-SPECIFIC CELL SURFACE...  89    89    98
Swissprotein:Grp8_Arath  GLYCINE-RICH RNA-BINDING PROTEIN 8   77    77    97
Swissprotein:Spt5_Yeast  TRANSCRIPTION INITIATION PROTEIN...  62    83    97
Swissprotein:Vnua_Prvka  PROBABLE NUCLEAR ANTIGEN             70    70    95
Swissprotein:Prlb_Achly  BETA-LYTIC METALLOENDOPEPTIDASE ...  37    37    95
Swissprotein:Oc3n_Human  NERVOUS-SYSTEM SPECIFIC OCTAMER-...  42    42    94
Swissprotein:Vasa_Drome  VASA PROTEIN                         43    43    94
Swissprotein:Grp2_Phavu  GLYCINE-RICH CELL WALL STRUCTURA...  90    90    94
Swissprotein:Grp2_Sorvu  GLYCINE-RICH RNA-BINDING PROTEIN 2   62    80    94
Swissprotein:Bind_Lytva  BINDIN PRECURSOR                     59    91    94
Swissprotein:K2c1_Mouse  KERATIN, TYPE II CYTOSKELETAL 67...  74    74    94
Swissprotein:Egg1_Schja  EGGSHELL PROTEIN 1 PRECURSOR         59    87    93
Swissprotein:Lori_Mouse  LORICRIN                             92    92    93
Swissprotein:Knh2_Bovin  KININOGEN, HMW II PRECURSOR (THI...  42    42    93
Swissprotein:Grp1_Pethy  GLYCINE-RICH CELL WALL STRUCTURA...  91    91    92
Swissprotein:Saa_Anapl  SERUM AMYLOID A PROTEIN (SAA)         56    56    92
Swissprotein:Grp1_Sorvu  GLYCINE-RICH RNA-BINDING PROTEIN...  78    78    92
Swissprotein:Glhr_Antel  PROBABLE GLYCOPROTEIN HORMONE G ...  60    60    92
Swissprotein:Sox4_Human  SOX-4 PROTEIN                        59    59    90
Swissprotein:Cea6_Ecoli  COLICIN E6 (EC 3.1.-.-) (RIBONUC...  62    62    90
////////////////////////////////////////////////////////////////////////////
Swissprotein:Fbrl_Human  FIBRILLARIN (34 KD NUCLEOLAR SCL...  53    53    80
Swissprotein:Agal_Cyate  ALPHA-GALACTOSIDASE PRECURSOR (E...  76    76    80
Swissprotein:Gar1_Schpo  GAR1 PROTEIN                         64    64    80
Swissprotein:Saa3_Mouse  SERUM AMYLOID A-3 PROTEIN PRECURSOR  40    40    80
Swissprotein:Coaa_Bppf1  POSSIBLE COAT PROTEIN A PRECURSOR    52    52    80
Swissprotein:Hmdf_Drome  HOMEOTIC DEFORMED PROTEIN            52    52    79
Swissprotein:K2c1_Human  KERATIN, TYPE II CYTOSKELETAL 1 ...  55    55    79
Swissprotein:Ca13_Mouse  PROCOLLAGEN ALPHA 1(III) CHAIN P...  37    50    79
Swissprotein:Lyg_Cygat  LYSOZYME G (EC 3.2.1.17) (1,4-BET...  32    32    79
Swissprotein:Xyna_Rumfl  BIFUNCTIONAL ENDO-1,4-BETA-XYLAN...  56    56    79
Swissprotein:K1cp_Human  KERATIN, TYPE I CYTOSKELETAL 16 ...  62    62    79
Swissprotein:Rog_Human  HETEROGENEOUS NUCLEAR RIBONUCLEOP...  36    36    79
Swissprotein:Mycl_Mouse  L-MYC PROTO-ONCOGENE PROTEIN         33    33    79
Swissprotein:P_Human  P PROTEIN                               27    27    79


! CPU time used:
!        Database scan:  0:33:02.8
! Post-scan processing:  0:00:11.5
!       Total CPU time:  0:33:16.2
! Output File: humprp.fasta

Both FastA output files show a histogram of the score distribution, then a sorted list of the top scores and finally, if alignments are not suppressed, BestFit style alignments of the score list. Next, an abridged version of the example WordSearch output:

% more humprp.word

 (Peptide) WORDSEARCH of: /disk3/usr/local/people/thompson/bc378/week10/humprp.pep  check: 8781
 from: 1  to: 253

P1;PRIO_HUMAN - MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C)
ID   PRIO_HUMAN     STANDARD;      PRT;   253 AA.
AC   P04156;
DT   01-NOV-1986 (REL. 03, CREATED)
DT   01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE)
DT   01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE) . . .

 TO: SwissProt:*  Sequences: 40,292 Total-length: 14,147,368  September 26, 1995
 17:02

 Word-size: 2  Words: 62867213  Diagonals: 11,864,381 Total-diagonals: 24,300,70 
0
 Integral-width: 3  Alphabet: 20 List-size: 100  CPU minutes: 11.93

Sequence                   Strd   Diag Score Width Documentation ..

Swissprotein:Prio_Human       +     -1  278   53   MAJOR PRION PROTEIN PRECURSOR 
 (PRP) (PRP27-30) (PRP33-35C)
Swissprotein:Prip_Bovin       +      3  265   17   MAJOR PRION PROTEIN 2 PRECURS 
OR (PRP) (MAJOR SCRAPIE-ASSOCIATED FIBRIL
Swissprotein:Prio_Sheep       +      3  264   25   MAJOR PRION PROTEIN PRECURSOR 
 (PRP)
Swissprotein:Prio_Mesau       +      0  255   15   MAJOR PRION PROTEIN PRECURSOR
 (PRP) (PRP27-30) (PRP33-35C)
Swissprotein:Prio_Mouse       +      0  240   21   MAJOR PRION PROTEIN PRECURSOR
 (PRP) (PRP27-30) (PRP33-35C)
Swissprotein:Prio_Rat         +    -28  239    5   MAJOR PRION PROTEIN (PRP) (FR
AGMENT)
Swissprotein:Prio_Bovin       +     11  226   15   MAJOR PRION PROTEIN 1 PRECURS
OR (PRP) (MAJOR SCRAPIE-ASSOCIATED FIBRIL
Swissprotein:Grp1_Pethy       +     46   72   30   GLYCINE-RICH CELL WALL STRUCT
URAL PROTEIN 1 PRECURSOR
Swissprotein:Grp1_Pethy       +    176   71   36   GLYCINE-RICH CELL WALL STRUCT
URAL PROTEIN 1 PRECURSOR
Swissprotein:Grp_Arath        +    111   71    3   GLYCINE-RICH CELL WALL STRUCT
URAL PROTEIN PRECURSOR
Swissprotein:Grp1_Phavu       +     44   70    7   GLYCINE-RICH CELL WALL STRUCT
URAL PROTEIN 1.0 PRECURSOR (GRP 1.0)
Swissprotein:Grp_Arath        +    115   69    2   GLYCINE-RICH CELL WALL STRUCT
URAL PROTEIN PRECURSOR
Swissprotein:Grp_Arath        +     82   68    7   GLYCINE-RICH CELL WALL STRUCT
URAL PROTEIN PRECURSOR
Swissprotein:Grp1_Pethy       +     82   67   33   GLYCINE-RICH CELL WALL STRUCT
URAL PROTEIN 1 PRECURSOR
Swissprotein:Grp_Arath        +     91   67    3   GLYCINE-RICH CELL WALL STRUCT
URAL PROTEIN PRECURSOR
Swissprotein:Prio_Chick       +     13   67   19   MAJOR PRION PROTEIN HOMOLOG P
RECURSOR (PR-LP) (ACETYLCHOLINE
///////////////////////////////////////////////////////////////////////////////
Swissprotein:Lori_Mouse       +     44   57   11   LORICRIN
Swissprotein:Lori_Human       +     45   57    5   LORICRIN
Swissprotein:Lori_Human       +    -31   57    4   LORICRIN
Swissprotein:Grp_Arath        +    135   57    2   GLYCINE-RICH CELL WALL STRUCT
URAL PROTEIN PRECURSOR
Swissprotein:Grp2_Phavu       +    180   56   21   GLYCINE-RICH CELL WALL STRUCT
URAL PROTEIN 1.8 PRECURSOR (GRP 1.8)
Swissprotein:Grp2_Phavu       +     99   56    6   GLYCINE-RICH CELL WALL STRUCT
URAL PROTEIN 1.8 PRECURSOR (GRP 1.8)
Swissprotein:Grp2_Phavu       +     90   56    4   GLYCINE-RICH CELL WALL STRUCT
URAL PROTEIN 1.8 PRECURSOR (GRP 1.8)

Notice that the various lists have entries in common but also have substantial differences. Also notice that in many cases WordSearch finds multiple segments. These are reasons why it is always a good idea to run as many different types of analyses as practical. We ran WordSearch with the Plot option to generate a graphic of the score distribution of that procedure. Make sure to initialize your graphics configuration first (remember setplot) and then type figure of wordsearch.figure to draw the plot:

% figure wordsearch.figure

FIGURE makes figures and posters by drawing graphics and text
together. You can include output from other GCG graphics programs as
part of a figure.

  Process set to plot with TEK4107 attached to TERM:
  using the TEKTRONIX graphic interface.

 When your TEK4107 attached to tty is ready, press <rtn>. <rtn> 

The asterisk on the abscissa at a score of about 56 indicates your lowest reported score and helps you to decide if you ran your list large enough. Normally you want your list size big enough to include some of the population of random low scores to help ascertain the significance of the higher scores. In my case above, the list was just big enough to catch the upward slope of the random scores; therefore, the list size was appropriate. The inset shows a blowup of the high score end of the graph -- these are the best alignments found by the program, not the worst!

WordSearch does not illustrate its alignments with sequence pairs. To see only the most interesting alignments found by WordSearch, first use pico to edit your .word file by removing all of the entries near the top that are clearly the same protein as your query. Do not edit out the header information in the file. It may be a good idea to leave the top-most hit as a reference sequence. Also remove the sequences near the bottom of the list which are obviously totally dissimilar. However, do not remove those sequences which are somewhat similar in the middle of the file; they are the interesting ones. We are mainly interested here in how these somewhat similar sequences align with your query. In order to visualize the alignments run the program segments, providing it with your edited WordSearch output file's name; accept the default output file name. This process is shown for the prion example below:

% segments humprp.word

Segments aligns and displays the segments of similarity found by
WordSearch.

 What should I call the output file (* humprp.pairs *) ? <rtn>

 Aligning ............-..
 Swissprotein:Prio_Human    253 aa  Gaps:  0  Quality: 379.5 - Length: 253
 Aligning ............-..
 Swissprotein:Grp1_Pethy    384 aa  Gaps:  8  Quality:  96.1 - Length: 250
 Aligning ............-.
 Swissprotein:Grp1_Pethy    384 aa  Gaps:  6  Quality:  87.6 - Length: 234
 Aligning ...........-..
 Swissprotein:Grp_Arath    338 aa  Gaps:  4  Quality:  91.2 - Length: 234
 Aligning ..........-.
 Swissprotein:Grp1_Phavu    252 aa  Gaps:  6  Quality:  84.8 - Length: 215
 Aligning ...........-..
 Swissprotein:Grp_Arath    338 aa  Gaps:  3  Quality:  89.3 - Length: 225
 Aligning ............-..
 Swissprotein:Grp_Arath    338 aa  Gaps:  6  Quality:  92.0 - Length: 248
 Aligning ............-..
 Swissprotein:Grp1_Pethy    384 aa  Gaps:  8  Quality:  96.3 - Length: 294
 Aligning ............-..
 Swissprotein:Grp_Arath    338 aa  Gaps:  5  Quality:  91.2 - Length: 251
 Aligning ............-..
 Swissprotein:Prio_Chick    273 aa  Gaps: 12  Quality: 165.1 - Length: 267
 Aligning ........-.
 Swissprotein:Grp_Arath    338 aa  Gaps:  2  Quality:  75.3 - Length: 177
 Aligning .........-.
 Swissprotein:Grp_Arath    338 aa  Gaps:  4  Quality:  82.6 - Length: 184
 Aligning .........-.
 Swissprotein:Grp_Arath    338 aa  Gaps:  6  Quality:  83.9 - Length: 201
 Aligning ............-..
 Swissprotein:Grp1_Pethy    384 aa  Gaps:  7  Quality:  87.1 - Length: 249
 Aligning ............-..
 Swissprotein:Grp2_Phavu    465 aa  Gaps:  5  Quality:  95.5 - Length: 247
 Aligning ............-..
 Swissprotein:Grp2_Phavu    465 aa  Gaps:  8  Quality:  83.4 - Length: 250
 Aligning ............-..
 Swissprotein:Grp_Arath    338 aa  Gaps:  3  Quality:  94.7 - Length: 244
 Aligning ............-.
 Swissprotein:Grp_Horvu    200 aa  Gaps: 11  Quality:  80.3 - Length: 194
 Aligning ..........-..
 Swissprotein:Grp_Arath    338 aa  Gaps:  5  Quality:  85.9 - Length: 215

Use more to display the generated file. The prion example follows in an abridged fashion:

% more humprp.pairs

 (BestFit) SEGMENTS from: Humprp.Word  September 27, 1995 11:16

(Peptide) WORDSEARCH of: Disk03:[Thompson.Bc378.Week_10]Humprp.Pep check:
 8781 from: 1  to: 253
P1;PRIO_HUMAN - MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C)
ID   PRIO_HUMAN     STANDARD;      PRT;   253 AA.
AC   P04156;
DT   01-NOV-1986 (REL. 03, CREATED)
DT   01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE) . . .

 AvMatch: 0.54  AvMisMatch: -0.40  GapWeight: 2.00  LengthWeight: 0.10   ..

Humprp.Pep                check: 8781  from: 1      to: 253
Swissprotein:Prio_Human   check: 8781  from: 1      to: 253
     MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C)
 Gaps: 0  Quality: 379.5  Ratio: 1.500  Score: 278  Width: 53  Limits: +/-54
                  .         .         .         .         .
       1 MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYP 50
         ||||||||||||||||||||||||||||||||||||||||||||||||||
       1 MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYP 50
                  .         .         .         .         .
      51 PQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWN 100
         ||||||||||||||||||||||||||||||||||||||||||||||||||
      51 PQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWN 100
                  .         .         .         .         .
     101 KPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYY 150
         ||||||||||||||||||||||||||||||||||||||||||||||||||
     101 KPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYY 150
                  .         .         .         .         .
     151 RENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTE 200
         ||||||||||||||||||||||||||||||||||||||||||||||||||
     151 RENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTE 200
                  .         .         .         .         .
     201 TDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFL 250
         ||||||||||||||||||||||||||||||||||||||||||||||||||
     201 TDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFL 250

     251 IVG 253
         |||
     251 IVG 253

Humprp.Pep                check: 8781  from: 1      to: 253
Swissprotein:Grp1_Pethy   check: 1852  from: 46     to: 384
     GLYCINE-RICH CELL WALL STRUCTURAL PROTEIN 1 PRECURSOR<
 Gaps: 8  Quality: 96.1  Ratio: 0.420  Score: 72  Width: 30  Limits: +/-31
                  .         .         .         .         .
       2 ANLGCWMLVLFVATW...SDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNR 48
         :.:|.:  . |. .:   :::| .   . .||:..||:  .|.|..||.
      52 GRFGGRG.PSFGRGRGAGGGFGGGAGGGAGGGLGGGGGLGGGGGAGGGGG 100
                  .         .         .         .         .
      49 YPPQG..GGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTH 96
         ....|  |||:|.. ||| |.. ||| | . ||| |.. ||| |.|:|.
     101 LGGGGGAGGGFGGGAGGGAGGGLGGGGGLGGGGGGGAGGGGGVGGGAGSG 150
                  .         .         .         .         .
      97 SQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYE 146
         :.:. .:        ::|:|:||:.||| ||:  |:: :   :  ||:.:
     151 GGFGAGGG.......VGGGAGAGGGVGGGGGFGGGGGGG...VGGGSGHG 190
                  .         .         .         .         .
     147 DRY.YRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKG 195
         : :   :.:   :....  .::: :.... . :.:.        .....|
     191 GGFGAGGGVGGGAGGGLGGGVGGGGGGGSGGGGGIG........GGSGHG 232
                  .         .         .         .         .
     196 ENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLIS 245
         :.|...:.  :: .|:. . .. : ::.:  . |:::.  |:... :  :
     233 GGFGAGGG..VGGGVGGGAAGGGGGGGGGGGGGGGGLGGGSGHGGGFGAG 280

Humprp.Pep                check: 8781  from: 1      to: 253
Swissprotein:Grp1_Pethy   check: 1852  from: 176    to: 384
     GLYCINE-RICH CELL WALL STRUCTURAL PROTEIN 1 PRECURSOR
 Gaps: 6  Quality: 87.6  Ratio: 0.429  Score: 71  Width: 36  Limits: +/-37
                  .         .         .         .         .
      14 ATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGG 63
         :. :: |:.  ....||:..||:  .|.|:. |.  ...||||.|.. |
     176 GGGGGGGVGGGSGHGGGFGAGGGVGGGAGGGLGGGVGGGGGGGSGGGGGI 225
                  .         .         .         .         .
      64 GWGQPHGGGWGQPH......GGGWGQPHGGGWGQGGGTHSQWNKPSKPKT 107
         | | .||||:|..       ||| :.. ||| |.|||. :.:. .| . .
     226 GGGSGHGGGFGAGGGVGGGVGGGAAGGGGGGGGGGGGGGGGLGGGSGHGG 275
                  .         .         .         .         .
     108 NMKHMAG.AAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHR 156
         .:   :| :::|::.||| ||:  |:: :   :  ||:.:: :
     276 GFGAGGGVGGGAAGGVGGGGGFGGGGGGG...VGGGSGHGGGF....... 315
                  .         .         .         .         .
     157 YPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMM 206
                       .... | :... .:   ...... |:.:.:...  :
     316 ..............GAGGGVGGGAGGGLG..GGGGAGGGGGIGGGHGGGF 349
                  .         .         .
     207 ERVVEQMCITQYERESQAYYQRGSSMVLFSSPPV 240
         : .|:    ...: : .|  .:| ::.  |:...
     350 GVGVG....IGIGVGVGAGAGHGVGVGSGSGSGG 379

Humprp.Pep                check: 8781  from: 1      to: 253
Swissprotein:Grp_Arath    check: 6121  from: 111    to: 338
     GLYCINE-RICH CELL WALL STRUCTURAL PROTEIN PRECURSOR 
 Gaps: 4  Quality: 91.2  Ratio: 0.409  Score: 71  Width: 3  Limits: +/-4
                  .         .         .         .         .
       2 ANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPP 51
         :.:|.   .  .:. ::|| .   . .|| ..||:  .|:|:. |.  ..
     116 GGIGG...GAGGGSGGGLGGGIGGGAGGGAGGGGGLGGGHGGGIGGGAGG 162
                  .         .         .         .         .
      52 QGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNK 101
         .:|||:|..|||| |.. |||.|.. ||| |.. ||| |.|||. :. .
     163 GAGGGLGGGHGGGIGGGAGGGSGGGLGGGIGGGAGGGAGGGGGAGGGGGL 212
                  .         .         .         .         .
     102 PSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYR 151
         .: . ..:   ||:: :|:..|| || : |:| : :    |:::::
     213 GGGHGGGFGGGAGGGLGGGAGGGTGGGFGGGAGGGAGGGAGGGFGGGAGG 262
                  .         .         .         .         .
     152 ENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVT...TTTKGENF 198
         :.   :...    : :: :.... . :... . .. .|.   ....|:.|
     263 GAGGGFGGG....AGGGAGGGAGGGFGGGAGGGHGGGVGGGFGGGSGGGF 308
                  .         .         .
     199 TETDVKMMERVVEQMCITQYERESQAYYQRGSSM 232
         .:..    : ..:. . ..:: ::.|  . |:::
     309 GGGA....GGGAGGGAGGGFGGGGGAGGGFGGGF 338

///////////////////////////////////////////////////////////////////////////

Humprp.Pep                check: 8781  from: 1      to: 253
Swissprotein:Grp_Arath    check: 6121  from: 91     to: 338
     GLYCINE-RICH CELL WALL STRUCTURAL PROTEIN PRECURSOR
 Gaps: 5  Quality: 91.2  Ratio: 0.375  Score: 67  Width: 3  Limits: +/-4
                  .         .         .         .         .
       2 ANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPP 51
         :.:|.   .  .:. ::|| ..  . .|| ..|::   |.| .||.  .:
      96 GGIGG...GAGGGAGGGLGGGHGGGIGGGAGGGSGGGLGGGIGGGAGGGA 142
                  .         .         .         .         .
      52 QGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNK 101
         .||||:|..|||| |.. ||| |.. ||| |.. ||| |.|:|.  . .
     143 GGGGGLGGGHGGGIGGGAGGGAGGGLGGGHGGGIGGGAGGGSGGGLGGGI 192
                  .         .         .         .         .
     102 PSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYR 151
         .: : ..    :||:::|::.|| || : |:| :      |:: :: :
     193 GGGAGGGAGGGGGAGGGGGLGGGHGGGFGGGAGGGLGGGAGGGTGGGFGG 242
                  .         .         .         .         .
     152 ENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTET 201
         :.    ...    : ::::.... . ....:.  . ...... |:.|.:.
     243 GAGGGAGGG....AGGGFGGGAG.GGAGGGFGGGAGGGAGGGAGGGFGGG 287
                  .         .         .         .         .
     202 DVKMMERVV....EQMCITQYERESQAYYQRGSSMVL..FSSPPVILLIS 245
         ..   : .|    :. : ..:: :..:  . |.: .:   :::.. :  :
     288 AGGGHGGGVGGGFGGGSGGGFGGGAGGGAGGGAGGGFGGGGGAGGGFGGG 337

     246 F 246
         |
     338 F 338

Humprp.Pep                check: 8781  from: 1      to: 253
Swissprotein:Prio_Chick   check: 9640  from: 13     to: 273
     MAJOR PRION PROTEIN HOMOLOG PRECURSOR (PR-LP) (ACETYLCHOLINE
 Gaps: 12  Quality: 165.1  Ratio: 0.682  Score: 67  Width: 19  Limits: +/-20
                  .         .         .         .         .
      11 LFVATWSDLGLCKK...RPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGW 57
         |::|. .|::|:||   :|..|||..|: | |: ....| .. |. . .
      13 LLLAACTDVALSKKGKGKPSGGGWGAGSHRQPSYPRQPGYPHNPGYPHNP 62
                  .         .         .         .         .
      58 GQPHGGGWGQ....PHGGGWGQ....PHGGGWGQPHGGGWGQGGGTHSQW 99
         | ||..|:.:    ||..|:.|    ||..|: .. |.|:..::|.  :
      63 GYPHNPGYPHNPGYPHNPGYPQNPGYPHNPGY.PGWGQGYNPSSGGSYHN 111
                  .         .         .         .         .
     100 NKPSK.PKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDR 148
         .||.| ||||:||:|||||||||||||||| :|..||   .||:|. | |
     112 QKPWKPPKTNFKHVAGAAAAGAVVGGLGGYAMGRUNIXGMNYHFDSPDEYR 161
                  .         .         .         .         .
     149 YYRENMHRYPNQVYYRPMDEYSN...QNNFVHDCVNITIKQHTVTTTTK. 194
         ::.||  ||||.||||   :||.   |: || ||.|||:.:..:....|
     162 WWSENSARYPNRVYYR...DYSSPVPQDVFVADCFNITVTEYSIGPAAKK 208
                  .         .         .         .         .
     195 .......GENFTETDV..KMMERVVEQMCITQYERESQAYYQRGSSMVLF 235
                :.| ||.::  |::.:|: :||: || ||    |. :|:: |
     209 NTSEAVAAANQTEVEMENKVVTKVIREMCVQQY.RE....YRLASGIQLH 253
                  .
     236 SSPPVILLISFLIFLIV 252
           |: .:|  :|::|..
     254 ..PADTWLAVLLLLLTT 268

///////////////////////////////////////////////////////////////////////////

It's interesting to note that all of the non-prion hits in the WordSearch output are a glycine-rich cell wall proteins, whereas the FastA and BLAST approaches found these proteins plus some other unrelated ones. WordSearch seems to be particularly sensitive to the glycine repeat regions of the prion sequence. Also notice that multiple segments of the same glycine-rich protein were aligned to the prion protein. As stated earlier, it is good to gather as much different data as you can and not entirely rely on any one algorithm.

5.a. Interpreting Results -- what is significant.

Now find three different higher scoring sequences, one each, from each of the three programs BLAST, FastA, and WordSearch, which are not obviously the same sort of protein as your beta-amyloid query. Only do this for the beta-amyloid protein. We are going to experiment with various methods of analyzing similarity significance with these sequences. I will again illustrate with the prion protein as you follow along my examples. In your case, only do these procedures with the beta-amyloid protein nearest neighbors. There are no `correct' entries to choose, just try to get three completely different interesting ones from just below the obvious matches for each query. Relevant lines that show what I picked as interesting from the prion search examples follow.

From BLAST, the top-most, non-obvious match:

sp|P27177|PRIO_CHICK  MAJOR PRION PROTEIN HOMOLOG PRECURS...   125  3.1e-17   3

Even though the above sequence is a prion homolog it is quite a bit different than all the other prions based on their scores and, therefore, it will be an interesting comparison. From the FastA output we'll pick something totally different:

Swissprotein:Roa1_Drome  HETEROGENEOUS NUCLEAR RIBONUCLEO...  85    85    121

And from WordSearch another different protein:

Swissprotein:Grp1_Pethy       +     46   72   30   GLYCINE-RICH CELL WALL STRUCT

5.a.1. Dot matrix methods: Compare and DotPlot.

Dot matrix analysis is one of the few ways to identify other elements beyond what the dynamic programming algorithms show to be similar between two sequences.

GCG implements dot matrix methods with two programs. Compare generates the data that serves as input to DotPlot which actually draws the matrix. Compare your query sequences to each of its three near neighbors (as described above) using these methods. Start the program by typing compare; supply it with the appropriate sequence names and specify output names that will allow you to keep track of the files. Rerun the program increasing the stringency of the comparisons until the number of points generated is of the same magnitude as the length of the longest sequence being compared. In the first run below, I found that a stringency of 15 out of 30 resulted in 280 points -- very close to the sequences' lengths being compared. In this case, wherever the average of match scores within the window exceeds 15/30's, a point is drawn at the middle of the window; the window is then slid over one position and the process is repeated. Just as in all windowing algorithms, in general you want to use a window of approximately the same size as the feature that you're trying to recognize. For these runs you can leave the window at its default setting of 30; that will be appropriate as an example.

Save the three best output files from this program for the beta-amyloid comparisons. This is three files total: one each from the three programs BLAST, FastA, and WordSearch, all for the beta-amyloid protein query. I will illustrate with two different prion examples:

% compare

Compare compares two protein or nucleic acid sequences and creates a
file of the points of similarity between them for plotting with
DotPlot. Compare finds the points using either a
window-stringency or a word match criterion. The word comparison
is 1,000 times faster than the window-stringency comparison, but
somewhat less sensitive.

 COMPARE what horizontal sequence ?  humprp.pep

                  Begin (* 1 *) ?  <rtn>
                End (*   253 *) ?  <rtn>

 to what vertical sequence (* humprp.pep *) ?  sw:prio_chick

                  Begin (* 1 *) ?  <rtn>
                End (*   273 *) ?  <rtn>

 What comparison window size (* 30 *) ?  <rtn>

 What stringency (* 9.0 *) ?  15

 What should I call the output file (* humprp.pnt *) ? humprp-prio_chick.pnt

 Number of points: 280

 Writing . 

Next, provide this file as input to DotPlot. Type dotplot and provide the correct filename; accept all program defaults. The human prion cross chicken prion-homolog protein comparison example follows:

% dotplot

DOTPLOT makes a dot-plot with the output file from COMPARE,
FOLD, or STEMLOOP.

  Process set to plot with TEK4107 attached to TERM:
  using the TEKTRONIX graphic interface.

 DOTPLOT what point file ?  humprp-prio_chick.pnt

 humprp-prio_chick.pnt contains COMPARE results of

    Axis                 Name   Check   Start     End   Dir

 Horizontal        Humprp.Pep    8781       1     253   for
  Vertical         Prio_Chick    9640       1     273   for

             Window . . . . . . . . . 30
             Stringency . . . . . . . 15.0
             Number of points . . . . 280
             Percent of possible  . . 0.405

 The minimum density for a one-page plot is
 312.5 bases-100 platen units on each axis.

 What point density would you like (* 312.50 *) ? <rtn>

 DOTPLOT will take 1 pages.  Would you like to:

       P)lot the points
       D)ifferent density
       G)et another point file to plot

       Q)uit

 Please select one (* P *):  <rtn>

 When your TEK4107 attached to tty is ready, press <rtn>. <rtn>

The humprp-prio_chick dotplot follows:

Notice that running the comparison at an appropriate stringency, in this case 15 out of 30, produces a very clean plot with very little confusing noise. Still, sometimes interpreting a dotplot can be a major accomplishment in itself -- just remember that diagonals are regions of similarity between the two sequences and any diagonal off the main center line is indicative of regions which do not correspond in linear placement between the two sequences yet are still similar. The regions of similarity between the human prion protein and the homolog in chicken are very clear in the dotplot. Notice that most coincide in order although one chicken region is seen to repeat strongly four times in the human sequence.

The next example does not show such clear-cut similarity:

% compare humprp.pep sw:roa1_drome

Compare compares two protein or nucleic acid sequences and creates a
file of the points of similarity between them for plotting with
DotPlot. Compare finds the points using either a
window-stringency or a word match criterion. The word comparison
is 1,000 times faster than the window-stringency comparison, but
somewhat less sensitive.

                  Begin (* 1 *) ?  <rtn>
                End (*   253 *) ?  <rtn>

                  Begin (* 1 *) ?  <rtn>
                End (*   365 *) ?  <rtn>

 What comparison window size (* 30 *) ?  <rtn>

 What stringency (* 9.0 *) ?  17

 What should I call the output file (* humprp.pnt *) ? humprp-roa1_drome.pnt

 Number of points: 269

 Writing .

In this run I had to increase the stringency to 17 of 30 in order to filter appropriately. The resultant dotplot is shown below:

The strong column of diagonals in this example shows that one local region of Roa1_Drome repeats itself several times in the prion protein. These probably relate to the very high level of glycine repeat sequences in both proteins. Perform similar analyses with your beta-amyloid protein and its three interesting partners. When running all the dotplots, take notes of those particular regions in the proteins that appear significantly similar. For example, as noted in the above plot, the region of Roa1_Drome from about residue 210 through 250 repeats strongly in humprp from around residue 40 through residue 100. After performing dot matrix analysis on the three beta-amyloid protein comparisons, rerun the DotPlots with the same optimum parameters as you determined previously; however, add the option -figure=(your last name).dotplot1, 2 or 3 respectively, to the command line to create three Figure output files that will be sent to teacher for evaluation.

5.a.2 The dynamic programming alignment algorithms and significance estimation.

Gap and BestFit are GCG's implementation of pairwise dynamic programming -- use the right program for the job at hand. There is a difference between these two algorithms! Gap is a 'global' alignment scheme while BestFit is more 'local.' Using one versus the other implies that you are looking for distinctly different relationships. Know what this mean. If you already know that the full length of two sequences is pretty close, that they probably belong to the same gene family, then Gap is the program of choice; if you only suspect an area of one is similar to another, then you should use BestFit. To force BestFit to be even more local you can specify the more stringent alternative symbol comparison table pam250.cmp located in the GCG logical directory GenMoreData. Both programs can also generate "gapped" output files in standard sequence formats; this can be very handy as direct input to other GCG routines -- particularly multiple sequence analysis programs.

GCG provides a way to estimate alignment significance with BestFit and Gap. When running either program specify -randomizations=100 on the command line and the second input sequence will be jumbled 100 times after the initial alignment is produced. Comparing the quality scores of the randomized alignments to the initial alignment can help give a feeling for the relative meaning of the scores. An old rule-of-thumb that people often use is, if the actual score is much more than three standard deviations above the mean of the randomized scores, the analysis may be significant; if it is much more than five, than it probably is significant; and if it is above nine, than it definitely is significant. This distance above the mean is often called the "Z score" and can be calculated with the following formula:

     Z score  =  [ ( actual score ) - ( mean of randomized scores) ]
                 -------------------------------------------------------   
                 ( standard deviation of randomized score distribution )

This type of significance estimation is known as a Monte Carlo approach. It has many implicit statistical problems; however, few practical alternatives exist. I will use BestFit with Randomizations to first illustrate the prion and Roa1_Drome example. Before beginning with your own data though, study your dotplot notes from before. This approach works best when applied to areas where you already know some similarity exists and you wish to further test that similarity. Therefore, restrict your analyses to those regions identified by the dotplots. However, remember that dotplots show us all the regions that are similar, whereas dynamic programming only gives us one optimal solution. Type bestfit -random=100 and supply the appropriate beta-amyloid sequence comparison data names to run the program. Be sure to restrict the analyses to those regions of each sequence identified by your dotplots. Accept the rest of the program's defaults; however, give the output file names that make sense to you while keeping the .pair extension. The process follows below for the human prion and Drosophila ribonucleoprotein Roa1_Drome comparison:

% bestfit -random=100

BestFit makes an optimal alignment of the best segment of similarity
between two sequences. Optimal alignments are found by inserting gaps to
maximize the number of matches using the local homology algorithm of
Smith and Waterman.

 BESTFIT of what sequence 1 ?  humprp.pep

                  Begin (* 1 *) ?  40
                End (*   253 *) ?  100

 to what sequence 2 (* humprp.pep *) ?  sw:roa1_drome

                  Begin (* 1 *) ?  210
                End (*   365 *) ?  250

 What is the gap creation penalty (* 3.00 *) ?  <rtn>

 What is the gap extension penalty (* 0.10 *) ?  <rtn>

 What should I call the paired output display file (* humprp.pair *) ? humprp-roa1
.pair

 Aligning ..-.

          Gaps:     1
       Quality:  29.2
 Quality Ratio: 0.712
  % Similarity: 56.098
        Length:    46

Randomized alignment              Quality
           1                         24.0
           2                         24.9
           3                         23.8
           4                         28.7
           5                         25.0
////////////////////////////////////////////////////////////
          93                         25.3
          94                         24.4
          95                         27.5
          96                         25.3
          97                         26.0
          98                         23.6
          99                         26.1
         100                         24.2

 Average quality based on 100 randomizations: 25.2 +/- 1.9

The output file is shown below. In this example, the randomized quality scores are all fairly close to the actual score. The Z score calculates to be 2.1; therefore, the interpretation is that the similarity is not significant.

% more humprp-roa1.pair

 BESTFIT of: humprp.pep  check: 8781  from: 40  to: 100

P1;PRIO_HUMAN - MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C)
ID   PRIO_HUMAN     STANDARD;      PRT;   253 AA.
AC   P04156;
DT   01-NOV-1986 (REL. 03, CREATED)
DT   01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE)
DT   01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE) . . .

 to: roa1_drome  check: 8848  from: 210  to: 250

P1;ROA1_DROME - HETEROGENEOUS NUCLEAR RIBONUCLEOPROTEIN A1 (HNRNP CORE PROTEIN 
 A1-A)
ID   ROA1_DROME     STANDARD;      PRT;   365 AA.
AC   P07909;
DT   01-AUG-1988 (REL. 08, CREATED)
DT   01-AUG-1988 (REL. 08, LAST SEQUENCE UPDATE)
DT   01-OCT-1994 (REL. 30, LAST ANNOTATION UPDATE) . . .

 Symbol comparison table: Gencoredisk:Gcgcore/Data/Rundata/Swgappep.Cmp
 CompCheck: 1254

         Gap Weight:  3.000      Average Match:  0.540
      Length Weight:  0.100   Average Mismatch: -0.396

            Quality:   29.2             Length:     46
              Ratio:  0.712               Gaps:      1
 Percent Similarity: 56.098   Percent Identity: 41.463

 Average quality based on 100 randomizations: 25.2 +/- 1.9

 humprp.pep x roa1_drome   September 30, 1995 16:53  ..

      40 GQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPH 85
         |.|:|||     .:||.:|.  ||.:|..:|||  .  |..||. :
     210 GRGGPGG.....RAGGNRGNMGGGNYGNQNGGGNWNNGGNNWGNNR 250

The extreme glycine richness of both sequences is what the program found, yet the Monte Carlo test suggests that this similarity is not significant -- do not blindly accept the output of any computer program! It may be strictly artifactual, as in this case.

Contrast the previous run with the output from the comparison between the human prion protein and the chicken homolog. Here the region of similarity was much more extensive; therefore, I extended the analysis from residue 15 through 220 in HumPrP and 15 through 240 in Prio_Chick:

% more prp-chick.pair

 BESTFIT of: humprp.pep  check: 8781  from: 15  to: 220

P1;PRIO_HUMAN - MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C)
ID   PRIO_HUMAN     STANDARD;      PRT;   253 AA.
AC   P04156;
DT   01-NOV-1986 (REL. 03, CREATED)
DT   01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE)
DT   01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE) . . .

 to: prio_chick  check: 9640  from: 15  to: 240

P1;PRIO_CHICK - MAJOR PRION PROTEIN HOMOLOG PRECURSOR (PR-LP) (ACETYLCHOLINE
ID   PRIO_CHICK     STANDARD;      PRT;   273 AA.
AC   P27177;
DT   01-AUG-1992 (REL. 23, CREATED)
DT   01-JUL-1993 (REL. 26, LAST SEQUENCE UPDATE)
DT   01-OCT-1993 (REL. 27, LAST ANNOTATION UPDATE) . . .

 Symbol comparison table: Gencoredisk:Gcgcore/Data/Rundata/Swgappep.Cmp
 CompCheck: 1254

         Gap Weight:  3.000      Average Match:  0.540
      Length Weight:  0.100   Average Mismatch: -0.396

            Quality:  139.3             Length:    223
              Ratio:  0.693               Gaps:      7
 Percent Similarity: 60.000   Percent Identity: 43.500

 Average quality based on 100 randomizations: 62.5 +/- 3.6

 humprp.pep x prio_chick   September 30, 1995 19:25  ..

                  .         .         .         .         .
      17 SDLGLCKK...RPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGG 63
         .|::|:||   :|..|||..|: | |: ....| .. |. . . | ||..
      19 TDVALSKKGKGKPSGGGWGAGSHRQPSYPRQPGYPHNPGYPHNPGYPHNP 68
                  .         .         .         .         .
      64 GWGQ....PHGGGWGQ....PHGGGWGQPHGGGWGQGGGTHSQWNKPSK. 104
         |:.:    ||..|:.|    ||..|: .. |.|:..::|.  : .||.|
      69 GYPHNPGYPHNPGYPQNPGYPHNPGY.PGWGQGYNPSSGGSYHNQKPWKP 117
                  .         .         .         .         .
     105 PKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENM 154
         ||||:||:|||||||||||||||| :|..||   .||:|. | |::.||
     118 PKTNFKHVAGAAAAGAVVGGLGGYAMGRUNIXGMNYHFDSPDEYRWWSENS 167
                  .         .         .         .         .
     155 HRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTK........GE 196
          ||||.||||. ..   |: || ||.|||:.:..:....|        :.
     168 ARYPNRVYYRDYSSPVPQDVFVADCFNITVTEYSIGPAAKKNTSEAVAAA 217
                  .         .
     197 NFTETDV..KMMERVVEQMCITQ 217
         | ||.::  |::.:|: :||: |
     218 NQTEVEMENKVVTKVIREMCVQQ 240

The Z score calculation for this comparison yields a value of 21.3, definite significance!

Run the bestfit -randomizations with all the beta-amyloid comparisons being careful to restrict your analyses to those regions that you determined were most similar from the dotplots. This should produce a total of three .pair Monte Carlo comparison files. Be sure all have identifying names that make sense to you. I will be asking about these results in the exercise report form.

5.b. Where to go next -- what to do with significant matches.

Further directions naturally depend on the focus of your research. If you are dealing with a protein of unknown function, then perhaps functions suggested by matches with other known proteins can be experimentally tested in the wet-laboratory. Often regions of similarity can be refined into a motif and used to search the databases for further matches, especially when more than one sequence can be used to create the motif. If areas turn out to be similar to regions of proteins that have had their three-dimensional structure solved, structural features may be inferred and whole new vistas of research may open. The most powerful analyses will come with multiple sequence alignments and profiles created from similar areas. You will be learning techniques for multiple sequence alignment in next week's exercise. And whenever a reliable multiple sequence alignment can be created, then evolutionary relationships, phylogenies, can be estimated.

6) For extra credit:

Devise a method for ascertaining whether there is any significant similarity between any regions of the human prion protein and any regions of the human beta-amyloid protein? Prepare a file and rcp it to teacher that shows me the most significant region you can find; explain how you found it and describe the significance of the similarity.

7) Finishing up and conclusions -- do you have any?

To get credit for this lab, you need to complete the exercise report form that was copied into your directory at the beginning of the tutorial. Rename it to have your last name with the mv command but leave the extension .week10, and then go into the file using the pico editor to fill in the report.

% mv week10.week10 (your lastname).week10

% pico (your lastname).week10

Remote copy the file to teacher with the rcp command:

% rcp (your lastname).week10 teacher@ribozyme:receive

You should also have three dotplot Figure files from the beta-amyloid comparisons. Be sure to rcp them to teacher also:

% rcp (your lastname).dotplot* teacher@ribozyme:receive

This concludes your computing session for this week. Log off ribozyme, get out of the emulator and back to the overlapping windows screen.

% exit

Press the <alt> and <x> keys together. You will be asked if you really want to exit the program. Respond with <y> to get out of the teemtalk emulator and return to the overlapping windows screen.

This has been a very long exercise; I do apologize. However, database searching techniques and their interpretation are among the most misunderstood of all sequence based biocomputing techniques. There is a tremendous amount of confusion and misinformation in this area and anything that can be done to try to clear up some of the mess is entirely worthwhile. One point that still needs to be made is that the previous techniques were performed largely using GCG's suggested defaults. This usually will work for you, but it is a good idea to think about what these default values imply and adjust them accordingly, especially if the results seem inappropriate after running through a first pass with the default parameters intact.

References

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular Biology 215, 403-410.

Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, 2013-2018.

Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis Primer. W.H. Freeman and Company, New York,. NY

Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences USA 89, 10915-10919.

Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology 48, 443-453.

Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio 1994. Nucleic Acids Research 22, 3470-3473.

Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences USA 85, 2444-2448.

Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, (M.O. Dayhoff editor) 5, Suppl. 3, 353-358, National Biomedical Research Foundation, Washington D.C.

Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied Mathematics 2, 482-489.

Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Sciences USA 80, 726-730.