The dynamic programming algorithm, symbol comparison tables, dot-matrix analysis, hashing techniques, and heuristics: the methods and programs for pairwise sequence comparison and database similarity searching.
Steven M. Thompson
Given the nucleotide or amino acid sequence of a biological molecule, what can we know about that molecule? As we've seen in earlier exercises, we can find biologically relevant information in it by searching for particular patterns that reflect some function of the molecule. These can be catalogued motifs, secondary structure predictions and physical attributes such as hydrophobicity, or even the content of the DNA itself as in some of the gene finding techniques. But what about comparisons with other sequences? Can we learn about one molecule by comparing it to another? Yes, naturally we can; inference through homology is fundamental to all the biological sciences.
By comparing the conserved portions of sequence amongst a set, all of the sensitivity and power of the computational techniques is magnified. The basic assumption is that those portions of sequence of crucial functional value are most constrained against evolutionary change. They will not tolerate many mutations. Not that mutations don't happen in these portions, just that most mutations in the region are lethal so we never see them. Other areas of sequence are able to drift more readily and are subject to less evolutionary pressure. Therefore, the sequence ends up a mosaic of quickly and slowly changing regions over evolutionary time. However, in order to learn anything by comparing sequences, we need to know how to compare them. We can use those constrained portions as 'anchors' to create a sequence alignment so that we can compare them. But this brings up the alignment problem and 'similarity.' It is easy to see that two sequences are aligned when they have identical symbols at identical positions, but what happens when symbols are not identical or the sequences are not the same length? How can we know that the most similar portions of our sequences are aligned, when is an alignment optimal, and does optimal mean biologically correct? Part of the solution to this problem is known as the dynamic programming algorithm.
The dynamic programming examples in the suggested text (Gribskov and Devereux, 1992) are excellent. Please carefully read pages 125 through 134 to see how David States and Mark Boguski explain the algorithm. In its simplest implementation we will consider matching symbols to be worth one point and will not consider gapping at all. The solution occurs in two stages. The first begins very much like the dot matrix methods described further below; the second is totally different. I will use the same example as the text with one simplification. Instead of calculating the "score matrix" on the fly as we proceed through the graph, I like to completely fill in an original "match matrix" then add points to those positions which produce favorable alignments. Follow the example below in concert with your text example.
a) A completed match matrix using one point for matching and zero points for mismatching:
A A T G C A 1 1 0 0 0 G 0 0 0 1 0 G 0 0 0 1 0 C 0 0 0 0 1
b) Now begin to add points based on the best path through the matrix, always working diagonally, left to right and top to bottom. The second row is completed here:
A A T G C A 1 1 0 0 0 G 0 0+1 0+1 1+1 0+1 G 0 0 0 1 0 C 0 0 0 0 1
c) Continue adding points based on the best previous path through the matrix. The third row is completed here:
A A T G C A 1 1 0 0 0 G 0 1 1 2 1 G 0 0+1 0+1 1+1 0+2 C 0 0 0 0 1
d) The score matrix is now complete:
A A T G C A 1 1 0 0 0 G 0 1 1 2 1 G 0 1 1 2 2 C 0 0+1 0+1 0+1 1+2
e) Now pick the bottom, right-most, highest scores in the matrix and work your way back through it. This is called the traceback stage and the matrix is now referred to as the path graph. In this case that highest score is in the right-hand corner, but it need not be:
A A T G C A 1 1 0 0 0 G 0 1 1 2 1 G 0 1 1 2 2 C 0 1 1 1 3
f) The completed traceback is shown with bold characters; these are all optimal alignments:
A A T G C A 1 1 0 0 0 G 0 1 1 2 1 G 0 1 1 2 2 C 0 1 1 1 3The following alignments are all generated from the above path graph (f). All five have three matches. Gap penalties would eliminate the last two of them; however, that still leaves three:
AG.GC A.GGC .AGGC A..GGC .A.GGC | || | || | || | | | | | | AATGC AATGC AATGC AATG.C AATG.CThe software will arbitrarily choose only one of these to report as the optimal. This decision can be partly controlled in the GCG programs BestFit and Gap with the HighRoad/LowRoad options. This brings up an important point. Just because dynamic programming is guaranteed to find an optimal alignment, it is not necessarily the only optimal alignment. Furthermore, the optimal alignment is not necessarily the 'right' or biologically relevant alignment. As always, question the results of any computerized solution based on what you know about the biology of the system. The above example illustrates the Needleman and Wunsch (1970) global solution. Later refinements (Smith and Waterman, 1981) demonstrated how dynamic programming can also be used to find optimal local alignments.
A further complication occurs in protein sequences. Certain amino acids are very much alike, structurally, chemically and genetically. How can we take advantage of the similarity of amino acids in our alignments? People have been struggling with this problem since the late 1960's. Margaret Dayhoff (Schwartz and Dayhoff, 1979) unambiguously aligned closely related datasets (no more than 15% difference) available at that point in time and noticed that certain residues, if they mutate at all, are prone to change into certain other residues. As it works out, these propensities for change fell into the same categories that chemists had known for years -- those same chemical and structural classes mentioned above, conserved through the evolutionary contraints of natural selection. However, Dayhoff's empirical observation quantified these changes. Based on the alignments that she created, the assumption that estimated mutation rates in closely related proteins can be extrapolated to more distant relationships, and fancy matrix and logarithmic mathematics that smooth out the statistics of the system, she was able to specify the probabilities at which different residues mutated into other residues through evolutionary history. This is the basis of the famous PAM (corrupted acronym of accepted point mutation) 250 (meaning that the matrix has been multiplied by itself 250 times) table. Since Dayhoff's time other biomathematicians (e.g. see Henikoff and Henikoff's [1992] BLOSUM 62 table) have created newer tables with more or less success than Dayhoff's original but the concept remains the same and Dayhoff's original PAM 250 table remains the most widely used one. Collectively these types of tables are known as symbol comparison tables or scoring matrices and they are fundamental to all sequence comparison techniques.
Another powerful method that should always be considered in similarity analysis is the dot matrix procedure. In dot matrix analysis one sequence is plotted on the vertical axis against another on the horizontal axis using a very simple approach; wherever they match according to some scoring system that you specify, a dot is generated. Dot matrix analysis can point out areas of similarity between two sequences that all other methods might miss. This is because most other methods align either the overall sequences or just the 'best' parts of each to achieve one optimal alignment. Dot matrix methods enable the operator to visualize the entirety of both sequences; if you will, it allows the 'Gestalt' of the alignment to be seen. However, their interpretation is entirely up to the user -- you must know what the plots mean and how to successfully filter out extraneous background noise when running them. Using this method correctly you can identify areas within sequences that happen to have significant matches that no other method would ever notice.
Once these problems are understood we can screen the database to look for sequences to compare ours to. When a sequence is found to fall into a preexisting group we can infer function, mechanism, evolution, and possibly even structure based on homology with its neighbors. Most database searching programs use several elements of the concepts discussed above; however, they use tricks to make things happen faster. These tricks fall into two main categories, that of hashing and that of approximation. Hashing is the process of breaking your query sequence into small "words" or "ktuples" of a set size and creating a "look-up" table with those words keyed to numbers. Then when any of the words match part of an entry in the database, that match is saved. In general, hashing reduces the complexity of the search problem from N2 for dynamic programming to N, the length of all the sequences in the database. Approximation techniques are collectively kwon as "heuristics." Webster's defines heuristic as "serving to guide, discover, or reveal; . . . but unproved or incapable of proof." In database searching techniques the heuristic usually restricts the necessary search space by calculating some sort of a statistic that allows the program to decide whether further scrutiny of a particular match should be pursued. This statistic may miss things depending on the parameters set -- that's what makes it heuristic.
Database searching is far more sensitive at the amino acid level than at the DNA level because proteins have twenty match criteria versus DNA's four. This drastically cuts down the 'noise' level of the search. Therefore, whenever dealing with coding sequence, it is always prudent to search at the protein level. Even though protein searching is more sensitive, the DNA databases are much larger. This drawback has been partly overcome with programs which take a protein query and compare it to translated nucleotide databases, but one still needs to know if the translation is 'real.' So, even though there are advantages and disadvantages to both types of searching, the general rule is to query with a peptide sequence, if at all possible, and screen whichever database you choose.
One small database that should always be screened when working with protein sequence is PROSITES. The GCG program Motifs performs this search. Motifs searches for recognized structural, regulatory and enzymatic consensus sequences in the PROSITE Dictionary of Protein Sites and Patterns (Bairoch, 1992). You ran Motifs back in Week 4 and so you will not be repeating it here but its importance needs to be mentioned again. Another small database that should not be ignored is NRL_3D. Remember, this database contains all the sequences from the three-dimensional coordinate database PDB and, thus, can serve as a link between structural and sequence based methods.
A big question and a very common mistake that is made in this whole area of searching and alignment is the concept of homology versus similarity: There is a huge difference! Similarity is merely a statistical parameter which describes how much two sequences, or portions of them, are alike by some scoring table. Homology, by definition, implies an evolutionary relationship -- more than just the fact that we've all evolved from the same old pond scum. There is really no such thing as percent homology; something is either homologous or it is not. You need to be able to demonstrate some type of direct lineage between the organisms or genes of interest in order to claim homology. Even better, be able to show some type of experimental evidence, morphological or genetic, that corroborates your assertion. Do not make the common mistake of calling any old sequence similarity homology.
How do you tell if a similarity is significant or is truly homologous? Many of the programs generate percent similarity scores, however these really don't mean a whole lot. The "quality" score means a lot more but it is hard to interpret. They are only relevant within the context of a particular comparison or search. At least they take the length of similarity, all of the necessary gaps introduced, and the matching of symbols all into account. To get a better handle on what these various scores mean, read the algorithm section of the GCG Program Manual for the various methods -- statistics are always confusing but the descriptions do help. A few of the programs can generate histograms of score distributions, but again, they can be confusing.
One of the best ways of deciding significance is to take advantage of the Randomizations option available in the two GCG dynamic programming algorithms BestFit and Gap. To utilize this strategy, compare two sequences using the appropriate program, either Gap or BestFit depending on whether you're trying to compare the entire length of each sequence or only the best regions of similarity to each, respectively, and specify the Randomizations option on the command line. This option jumbles the second sequence of the comparison a set number of times and then generates scores and a standard deviation based on the jumbled matches. You can then compare the random scores to the unjumbled score using a "Z score" calculation to help decide significance. BLAST (Altschul, et al., 1990) uses a similar approach but bases its statistics on a comparison to all the rest, "insignificantly similar," members of the database being searched.
Many database searches work best when submitted as a batch process. This is because of the sizes of the databases of interest. Ribozyme currently has almost 750,000 sequences in its nucleotide databases. In spite of the fast hashing style algorithms incorporated, most programs can take quite a while to search through that much data. Most of the GCG database searches accept a Batch option which is really handy. An exception to the standard "submit the search and wait" style of most database searching is the BLAST program. This program uses an extremely fast heuristic statistical hashing algorithm on a parallel computer located at the National Center for Biotechnological Information. Typical searches run in just a minute of two; however, realize the limitations of this algorithm to finding only ungapped areas of similarity and the fact that it is optimized for protein comparisons only.
This week's exercise will explore methods for comparing and searching with sequences. First the databases will be searched for interesting comparisons and then methods for analyzing those comparisons will be learned. The two protein molecules used in this week's exercise are both responsible for debilitating disease and yet both are encoded by the organism's own DNA and both genes are expressed in both normal and afflicted cells. In both cases large amounts of proteinaceous plaques aggregate and are deposited in the brains of afflicted animals.
I will illustrate the use of all the programs by running through them first with the human prion protein as an example. You should not repeat the prion analyses in your own accounts; you will use the other protein sequence in your directory to run the analyses yourself. The prion protein has an unknown natural function but is found in high quantities in the brain of animals infected with the degenerative neurological diseases scrapie and Bovine Spongiform Encephalopathy, in wild stock, and kuru, Creutzfeldt-Jacob Disease, or Gerstmann-Straussler Syndrome in humans. It is also involved in Fatal Familial Insomnia and recently gained notoriety as the harbinger of "Mad-Cow Disease." In human beings the gene maps to position 20p12-pter and the disease can be inherited in an autosomal dominant fashion. Seventeen pathologic allelic variants are listed in the current OMIM (1995). One of the most peculiar aspects of the prion is no infective nucleotide entity has ever been found, yet the protein particle itself is highly infectious. Somehow the infectious protein particle induces a posttranslational, pathological change in the host's normal protein to convert it to the aberrant isoform. The primary amino acid sequence is not changed, only the structural conformation of the protein is different.
The other protein in this exercise is responsible for Alzheimer's Disease. It is a small portion of a nexin II type protease present in normal organisms. The intact protein is a neuronal receptor which couples to intracellular signaling pathways through the GTP-binding protein G(0). Some isoforms also have clathrin-binding and BPTI/Kunitz type protease inhibitor domains. However, for unknown reasons in individuals with Alzheimer's and in some older Down's syndrome cases, aberrant catabolism of the precursor creates the small insoluble, pathological protein known as the beta-amyloid protein. This small protein product comes from part of the transmembrane and extracellular domains of the intact protein. Several different mutations have been identified in the gene of afflicted individuals, however, this is only the case in a small fraction of all Alzheimer's patients. It is a very complex system as six different alternatively spliced forms of the protein are naturally found, all coded by 19 exons located at map position 21q21.2 in human beings. In this week's exercise you will perform all of the same analyses with the beta-amyloid protein, as those that I demonstrate using the prion protein.
l) Activate the machine, connect to ribozyme and log into your account.
By this time you should know how to activate the machine you want to use, make connections with ribozyme, and log into your account. If you still need help with these functions, refer to the beginning of the exercises for weeks 2 and 3 for step by step instructions.
2) Move to this week's subdirectory and copy the necessary files to it.
% cd ten
Now move over all the files needed to do this week's exercise. They are located in the directory location $UGRAD_DIR/week10.
% cp $UGRAD_DIR/week10/* .
3) Run the demo that describes this weeks activities.
This week's demo illustrates some of the concepts explored in the exercise. However, here you will see it used to compare a sequence to itself. It particularly attempts to show how to interpret the dot plots. The effect of stringency and window sizes on these plots will be illustrated. This technique can easily be used to help visualize internal repeats in structures. The prion protein used has a distinct repeat region identified in PROSITES that will be illustrated in the demo. Type the following command to launch this week's demo:
% demo10
Be sure to activate gcg and use setplot to designate your graphics configuration manually before attempting to perform any sequence analysis functions in the exercise.
4) Database searching programs: heuristics and "hashing" algorithms.
4.a. Traditional database searches.
Two different hashing style symbol matching algorithms have traditionally been utilized in database searching. These two algorithms are incorporated into GCG's FastA (Pearson and Lipman, 1988) and WordSearch (Wilbur and Lipman, 1983) programs. Since the algorithms differ, the output results may also differ. As in all types of computerized molecular biology analyses, it is best to run as many strategies as practical and try to interpret the results in light of their combined output. Most of the following programs accept GCG's Batch option; pay attention to the necessity of running many of these programs in the background.
4.a.1. TFastA.
TFastA takes advantage of the sensitivity of a protein query and the size of the nucleic acid database. It is the most robust of the single sequence searching programs onsite. It compares a peptide sequence against all six translation frames of each sequence in the DNA database. This way you can take advantage of the multitude of DNA sequences that never make it to the protein databases and yet still retain the sensitivity of protein searches.
Type tfasta -check -batch to start the Translation FastA program in batch mode and see all of the available options. At the command option prompt add -noalign to suppress all the alignments since most are redundant and we will be exploring the more interesting ones later on anyway. Some of the other options can be very helpful depending on your specific situation. -optall in particular is very useful; take advantage of it also. It takes longer to run but sorts your output list based on the optimum score, the result of the final dynamic programming pass in FastA, not the initn score, the longest combined word score. Specify the peptide's sequence name at the program's sequence prompt. Accept the defaults up to list size=, and specify 100. The prion example session follows below (Remember you are to use the beta-amyloid protein, not the prion molecule in your runs!):
% tfasta -check -batch
TFastA does a Pearson and Lipman search for similarity between a
query peptide sequence and any group of nucleotide sequences. TFastA
translates the nucleotide sequences in all six reading frames before
performing the comparison. It is designed to answer the question, "What
implied peptide sequences in a nucleotide sequence database are similar
to my peptide sequence?"
Minimal Syntax: % TFASTA [-INfile1=]GGamma.Pep -Default
Prompted Parameters:
[-INfile2=]GenEMBL:* search set (all of GenEMBL)
[-OUTfile=]GGammaCod.TFasta output file name
-BEGin=1 -END=148 range of interest
-WORdsize=2 word size
-LIStsize=40 number of scores and alignments to show
Local Data Files:
-DATa=FastAPep.Cmp scoring matrix for peptides
Optional Parameters:
Press q to quit or <Return> for more:
-GAPweight=12.0 gap creation penalty
-LENgthweight=4.0 gap extension penalty
-SINce=6.90 limits search to sequences dated on or after June 1990
-THREEFrames translates and searches only the three forward reading
frames
-FRAme=1 translates and searches only the frame specified.
-NOPAMfactor uses a constant factor to calculate initial diagonal scores
-OPTall[=20] immediately computes opt score if initn above threshold
-SCAle scales scores by ln(n0) divided by ln(n1)
-SHOWall shows complete sequences in alignment, not just overlaps
-MARKx=3 determines the alignment display mode
-NOALIGN suppresses alignments
-NOHIStogram suppresses printing the histogram
-LINEsize=60 number of sequence symbols per line of the alignment
-NODOCLines suppresses sequence documentation in the alignment
-NOMONitor suppresses the screen trace for each search set sequence
-NOINCrease suppresses increase to LIStsize when not interactive
-BATch submits the program to run in the batch queue
Add what to the command line ? -noalign -opt
TFASTA with what query sequence ? humprp.pep
Begin (* 1 *) ? <rtn>
End (* 253 *) ? <rtn>
Search for query in what sequence(s) (* GenEMBL:* *) ? <rtn>
What word size (* 2 *) ? <rtn>
List how many best scores (* 40 *) ? 100
What should I call the output file (* humprp.tfasta *) ? <rtn>
** TFASTA will run as a batch or at job.
** TFASTA was submitted using the command:
" SUBMIT-NOPRINT-NOTIFY-QUEUE "
Job TFASTA_152926 (queue RIBOZYME, entry 259) started on RIBOZYME
The job will submit itself to the batch queue and you can proceed with the rest of the exercise.
4.a.2. Regular FastA.
Now run the normal FastA program; search the SwissProt:* logical sequence specification. Start the program by typing fasta -batch -noalign -opt (the options are the same as in TFastA) to run the program in batch mode and again suppress alignments and sort by optimum score. Specify the sequence's filename.ext and accept the default search set, the SwissProt database. Accept the rest of the defaults, however, use a list size of 100 again. The prion example is illustrated below:
% fasta -batch -noalign -opt
FastA does a Pearson and Lipman search for similarity between a query
sequence and any group of sequences. For nucleotide database searches,
FastA is more sensitive than BLAST.
FASTA with what query sequence ? humprp.pep
Begin (* 1 *) ? <rtn>
End (* 253 *) ? <rtn>
Search for query in what sequence(s) (* SwissProt:* *) ? <rtn>
What word size (* 2 *) ? <rtn>
List how many best scores (* 40 *) ? 100
What should I call the output file (* Humprp.fasta *) ? <rtn>
** fasta will run as a batch or at job.
** fasta was submitted using the command:
" SUBMIT-NOPRINT-NOTIFY-QUEUE "
Job FASTA_154303 (entry 263) pending
Again the Batch option automatically submits the program to the batch queue and you should proceed with the exercise.
4.a.3. WordSearch.
The last traditional database search we will submit is called WordSearch. Unlike FastA, WordSearch can find more than one region of similarity between two sequences. Just like the previous two, it will accept automatic batch submission. Type wordsearch -batch to begin the program; additionally, take advantage of this program's graphics capability by specifying -plot on the command line. Give the program the sequence's name and accept the logical SwissProt:* as the sequences to be searched. Again, specify a list size of 100. An option this program will accept, but that we will not be using this time, is Simplify. This option simplifies the input sequences according to a scheme which generalizes amino acids into broad classes based on their chemical characteristics. Another helpful option available is Mask. This option allows you to search with only certain characters of your sequence such as only the first and second codon positions in coding DNA. The example session with the prion protein follows below:
% wordsearch -batch -plot
WordSearch identifies sequences similar to a query sequence using a
Wilbur and Lipman search. WordSearch answers the question,
"What sequences in the database are similar to my sequence?" The output
is a list of significant diagonals whose alignments can be displayed
with Segments.
Process set to plot with TEK4107 attached to TERM:
using the TEKTRONIX graphic interface.
WORDSEARCH with what query sequence ? humprp.pep
Begin (* 1 *) ? <rtn>
End (* 253 *) ? <rtn>
Search for query in what sequence(s) (* SwissProt:* *) ? <rtn>
What word size (* 2 *) ? <rtn>
List how many best diagonals (* 50 *) ? 100
Integrate how many adjacent diagonals (* 3 *) ? <rtn>
What should I call the output file (* Humprp.word *) ? <rtn>
** wordsearch will run as a batch or at job.
** wordsearch was submitted using the command:
" SUBMIT-NOPRINT-NOTIFY-QUEUE "
Job WORDSEARCH_154655 (entry 265) pending4.b. BLAST; heuristic Internet similarity searching.
The BLAST server at NCBI can provide the most up to date and quickest database search available. BLAST is a heuristic algorithm for searching sequence databases developed by the National Center for Biotechnology Information at the National Library of Medicine. The acronym stands for Basic Local Alignment Search Tool. BLAST by default runs on an eight processor parallel computer system at NCBI. It only looks for ungapped segments; however, it ranks matches statistically and provides probability values for each to help evaluate significance. As such, it is best for identifying shorter regions of high similarity -- exactly what you might want with a sequence of unknown function. It is very fast, at least an entire order of magnitude over traditional sequence similarity database searching, yet maintains all of the sensitivity of the old methods for local similarity regions with protein sequences! You'll get your reply in a manner of minutes in most cases. Another advantage with BLAST is it not only shows you the best alignment for each similar sequence found (as in the BestFit type alignments of FastA) but also shows the next best alignments for each up to a certain preset cutoff point. This combines the power of dot-matrix type analyses and the interpretative ease of traditional sequence alignments. One can fine-tune BLAST by altering its operating parameters and by taking advantage of the many options available in it; however, BLAST is really not appropriate for comparing nucleotide sequences against the nucleotide database. When you are forced to perform nucleotide-to-nucleotide searches, it is usually best to use FastA instead.
BLAST accesses the latest database (NCBI updates GenBank every night) by default, nucleotide or protein. Also by default, the GCG implementation of BLAST runs in a remote client-server mode such that NCBI's database and computer perform the analysis; alternatively you can run it locally if you have custom assembled databases that you wish to screen.
For help in interpreting BLAST results refer to the GCG BLAST documentation, the BLAST HELP file obtained by sending the single word "HELP" to BLAST@ncbi.nlm.nih.gov (leave the subject line blank), and the text book for the course. Run blast -check on the prion protein. Specify the -filter=xs option. This is very powerful for screening out low complexity and repeat sequences from your query to minimize confusion due to random noise. The prion protein is very rich in these types of confounding sequence and so is a very appropriate example for showing this option. Also specify -segments=0 to suppress the alignment section and reduce the size of the output file. The standard output file is very long because BLAST automatically aligns the best 100 matches. Accept the default database choice, 1) nr p which stands for "NonRedundant Protein," to search all of the unique sequences from the most current versions of all of the protein databases. Set the listsize parameter to 100, as in the previous searches. A screen trace of a BLAST session with the prion protein follows:
% blast -check BLAST searches a specified database to find sequences that are similar to a query sequence. BLAST either can search databases on your own computer or it can search the databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA. Minimal Syntax: % BLAST [-INfile1=]SW:Zea2_Maize -Default Prompted Parameters: [-INfile2=]swissprot database to search -EXPect=10.0 ignores scores that would occur by chance more than 10 times -LIStsize=250 maximum number of sequences listed in the output [-OUTfile=]Zea2.Blastp output file name Local Data Files: [-DATa1=BLAST.rDBs] list of available remote databases [-DATa2=BLAST.lDBs] list of available local databases [-DATa3=BLAST.sDBs] list of available site-specific databases Optional Parameters: Press q to quit or <Return> for more: <rtn> -ONEstrand only translates the top strand -FILter=xs filters low complexity segments out of query -MATrix=PAM120 uses alternative protein scoring matrix -TRANSlate=0 selects among alternative genetic codes -REMoteonly searches only remote databases (simplifies menu) -LOCalonly searches only local databases ( " ) -DBNucleotideonly searches only nucleic databases ( " ) -DBProteinonly searches only protein databases ( " ) -SERver=cruncher.nlm.nih.gov runs remote search on this particular host -CUToffscore=56 score below which MSPs are ignored -WORdsize=3 word size -SYNonym=10 threshold for words to be treated as equivalent -SEGments=100 the number of segment pairs displayed in output -APPend="string" appends "string" to pass-through command line -BATch submits program to batch queue Add what to the command line ? -filter=xs -segments=0 BLAST search with what query sequence? humprp.pep Search for query in what sequence database: 1) nr p Non-redundant updated protein database PDB+SwissProt+PIR 2) pdb p Brookhaven Protein Data Bank, current release 3) swissprot p SWISS-PROT current release 4) pir p PIR, current release 5) spupdate p SWISS-PROT cumulative weekly update 6) genpept p translations of coding sequences in GenBank 7) gpupdate p cumulative daily updates of GenPept 8) kabatpro p Kabat Proteins of immunological interest 9) tfd p TFD transcription factor 10) palu p Six-frame translations of representative human ALU repeats 11) nr n Non-redundant updated nucleotide database PDB+GenBank+EMBL 12) pdb n Brookhaven Protein Data Bank 13) genbank n GenBank, current release 14) gbupdate n GenBank cumulative daily update 15) embl n EMBL current release 16) emblu n EMBL cumulative daily update 17) vector n Vector subset of GenBank 18) repbase n Primate Alu repeats 19) kabatnuc n Kabat nucleotide sequences of immunological interest 20) epd n Eukaryotic Promoter Database 21) dbest n Expressed sequence tags Press q to quit or <Return> for more: <rtn> Please choose one (* 1 *): <rtn> Ignore hits expected to occur by chance more than (* 10.0 *) times? <rtn>Limit the number of sequences in my output to (* 250 *) ? 100 What should I call the output file (* humprp.blastp *) ? <rtn> Trying cruncher.nlm.nih.gov (130.14.25.175) Connected to cruncher.nlm.nih.gov Search in progress on the network server. ............................ Retrieving results. ... WARNING: HSPs involving 69 database sequences were not reported due to the limiting value of parameter B = 0. . Done!
The prion BLAST search reply follows:
% more humprp.blastp
National Center for Biotechnology Information (NCBI)
Experimental GENINFO(R) BLAST Network Service (Cruncher)
Tue Sep 26 19:21:59 EDT 1995, Up 18 days, 10:27, load: 19.50, 24.50, 27.76
If results of this search are reported or published, please mention that
the computation was performed at the NCBI using the BLAST network service.
Problems with the service should be reported to a local system administrator.
PEPTIDE SEQUENCE DATABASES
nr Non-redundant PDB+SwissProt+PIR+SPUpdate+GenPept+GPUpdate, updated daily
for efficient, complete searches of the five component databases:
pdb Brookhaven Protein Data Bank, April 1994 Release
swissprot SWISS-PROT Release 31.0, March 1995
pir PIR Release 45.0 (complete), June 30, 1995
spupdate SWISS-PROT cumulative weekly update to the major release
genpept CDS translations from GenBank(R) Release 90, August 15, 1995
gpupdate cumulative daily updates to the major release of genpept
kabatpro Kabat Sequences of Proteins of Immunological Interest, June 1995
tfd TFD transcription factor (protein) database Release 7.0, June 1993
alu * Translations of select Alu repeats from REPBASE
NUCLEOTIDE SEQUENCE DATABASES
nr Non-redundant PDB+GBUpdate+GenBank+EmblUpdate+EMBL, updated daily
for efficient, complete searches of the four component databases:
pdb Brookhaven Protein Data Bank, April 1994 Release
genbank GenBank(R) Release 90 (no daily updates), August 15, 1995
gbupdate GenBank(R) cumulative daily updates to the major release
embl EMBL Data Library, Release 43.0, June 1995
emblu EMBL Data Library cumulative daily updates to the major release
vector Vector subset of GenBank(R), February 3rd, 1995
alu *+ Select Alu repeats from REPBASE
kabatnuc Kabat Sequences of Nucleic Acid of Immunological Interest, June 1995
epd Eukaryotic Promoter Database Release 40, September 1994
dbest + Database of Expressed Sequence Tags (cumulative daily update)
dbsts + Database of Sequence Tagged Sites Release 1.5, October 26, 1994
* Databases that are not accessible through the NCBI Retrieve E-mail server.
+ The TBLASTX program is restricted to searching these databases.
=============================================================================
You can obtain the BLAST documentation files, send a message consisting of
just the word ''help'' (without the quotes) to: blast@ncbi.nlm.nih.gov
Last modification dates: August 10th 95 for the E-mail server help, January
19th 94 for the BLAST manual and March 18th 95 for the BLAST FAQ.
=============================================================================
For a free subscription to "NCBI News", the NCBI newsletter, send a request
along with your name and postal mailing address to: info@ncbi.nlm.nih.gov
=============================================================================
A new GenBank sequence submission tool, called BankIt, is now available
through the NCBI's home page on the World Wide Web. The URL is
http:--www.ncbi.nlm.nih.gov-
=============================================================================
BLASTP 1.4.8MP [20-June-1995] [Build 16:33:21 Sep 5 1995]
Reference: Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers,
and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol. 215:403-10.
Query= TITLE humprp.pep
(253 letters)
>lcl|TITLE humprp.pep
MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXTHSQXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCV
NITIKQHXXXXXXXXXXXXXXDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSXXX
XXXXXXXXXXXVG
Database: Non-redundant PDB+SwissProt+SPupdate+PIR+GenPept+GPupdate
157,254 sequences; 46,130,256 total letters.
Searching..................................................done
Smallest
Sum
High Probability
Sequences producing High-scoring Segment Pairs: Score P(N) N
pir|A40372|A40372 major prion protein PrP variant 1 -... 506 4.6e-106 2
sp|P40250|PRIO_CERAE MAJOR PRION PROTEIN PRECURSOR (PRP)... 485 1.0e-102 2
sp|P40249|PRIO_CEBAP MAJOR PRION PROTEIN PRECURSOR (PRP)... 488 1.4e-102 2
sp|P40258|PRIO_SAISC MAJOR PRION PROTEIN PRECURSOR (PRP)... 479 2.5e-102 2
sp|P40247|PRIO_CALJA MAJOR PRION PROTEIN PRECURSOR (PRP)... 482 5.0e-102 2
gp|U15164|APU15164_1 major prion protein precursor [Atel... 485 1.3e-101 2
gp|U15165|SSU15165_1 major prion protein precursor [Saim... 476 2.4e-100 2
pir|A05017|A05017 major prion PrP27-30 protein precur... 506 8.7e-100 2
gp|M81929|HUMPRPRA_1 prion protein [Homo sapiens] 506 4.8e-99 2
sp|P04156|PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR (PRP)... 733 9.2e-99 1
sp|P40252|PRIO_GORGO MAJOR PRION PROTEIN PRECURSOR (PRP)... 730 2.4e-98 1
sp|P40253|PRIO_PANTR MAJOR PRION PROTEIN PRECURSOR (PRP)... 725 1.2e-97 1
sp|P40246|PRIO_ATEGE MAJOR PRION PROTEIN PRECURSOR (PRP)... 485 7.9e-97 2
gp|U15166|GGU15166_1 major prion protein precursor [Gori... 719 8.4e-97 1
sp|P40251|PRIO_COLGU MAJOR PRION PROTEIN PRECURSOR (PRP)... 713 5.9e-96 1
sp|P40256|PRIO_PONPY MAJOR PRION PROTEIN PRECURSOR (PRP)... 712 8.1e-96 1
sp|P40254|PRIO_MACFA MAJOR PRION PROTEIN PRECURSOR (PRP)... 710 1.5e-95 1
sp|P40245|PRIO_AOTTR MAJOR PRION PROTEIN PRECURSOR (PRP)... 481 1.9e-95 2
sp|P40257|PRIO_PREFR MAJOR PRION PROTEIN PRECURSOR (PRP)... 708 2.9e-95 1
sp|P04925|PRIO_MOUSE MAJOR PRION PROTEIN PRECURSOR (PRP)... 424 8.0e-95 3
gp|M18071|MUSPRPB_1 Mouse (with long incubation period)... 423 1.1e-94 3
gp|M13685|MUSPRP_1 Mouse prion protein (PrP) mRNA, com... 420 2.9e-94 3
gp|M81930|HUMPRPRB_1 prion protein [Homo sapiens] 688 2.0e-92 1
pir|A23545|A23545 major prion PrP27-30 protein - hamster 409 3.4e-92 3
sp|Q01880|PRIP_BOVIN MAJOR PRION PROTEIN 2 PRECURSOR (PR... 543 6.8e-90 2
sp|P40248|PRIO_CALMO MAJOR PRION PROTEIN PRECURSOR (PRP)... 665 3.3e-89 1
sp|P40255|PRIO_MANSP MAJOR PRION PROTEIN PRECURSOR (PRP)... 664 4.5e-89 1
pir|A54281|A54281 prion protein precursor - sheep >gp... 536 6.4e-89 2
sp|P10279|PRIO_BOVIN MAJOR PRION PROTEIN 1 PRECURSOR (PR... 481 1.0e-88 3
sp|P40243|PRP2_TRAST MAJOR PRION PROTEIN 2 PRECURSOR (PR... 540 1.2e-88 2
sp|P23907|PRIO_SHEEP MAJOR PRION PROTEIN PRECURSOR (PRP)... 534 1.2e-88 2
pir|S37149|S37149 prion protein - goat >gp|X74758|CHP... 532 2.3e-88 2
gp|U25965|OHU25965_1 prion protein precursor [Odocoileus... 532 2.3e-88 2
gp|D10612|BOVPRP1_1 prion protein [Bos taurus] 477 3.6e-88 3
sp|P40242|PRIO_TRAST MAJOR PRION PROTEIN 1 PRECURSOR (PR... 475 2.4e-87 3
pir|A34759|A34759 prion protein - Chinese hamster >gp... 614 5.5e-86 2
pir|B34759|B34759 prion protein - golden hamster >gp|... 614 5.5e-86 2
sp|P40244|PRIO_MUSVI MAJOR PRION PROTEIN PRECURSOR (PRP)... 456 8.5e-86 3
gp|U08952|MPU08952_1 prion protein [Mustela putorius] 446 2.1e-84 3
gp|L07623|PIGPRPG_1 prion protein [Sus scrofa] 449 2.9e-84 3
sp|P04273|PRIO_MESAU MAJOR PRION PROTEIN PRECURSOR (PRP)... 598 9.4e-84 2
gp|M37381|HAMPRP_1 Hamster scrapie prion (Prp 27-30) m... 560 2.0e-78 2
pir|A03133|UJHYIH major prion PrP27-30 protein - gold... 547 1.3e-76 2
gp|S69654|S69654_1 prion protein [Unknown.] 544 3.2e-76 2
sp|P13852|PRIO_RAT MAJOR PRION PROTEIN (PRP) (FRAGMENT... 491 6.4e-75 3
gp|L38993|TIOPRP_1 prion protein [Trichosurus vulpecular] 400 1.3e-72 3
gp|K02234|HAMPRQ_2 PrP 27-30 gene product [Mesocricetu... 408 5.7e-53 1
gp|M30384|MUSPRPC_1 Mouse prion protein (PrP27-30) mRNA... 188 8.0e-21 1
pir|A22315|A22315 Scrapie prion - mouse (fragment) 187 1.2e-20 1
sp|P27177|PRIO_CHICK MAJOR PRION PROTEIN HOMOLOG PRECURS... 125 3.1e-17 3
pir|A41280|UJCH major prion protein homolog precurs... 120 2.6e-16 3
pir|A37372|A37372 prion protein homolog precursor - c... 120 2.6e-16 3
pir|S02520|S02520 Major prion PrP, scrapie, protein -... 149 2.0e-14 1
pir|S14078|S14078 amyloid protein, 11K - human 115 9.1e-10 1
gp|X79912|OAPRP3_1 PrP gene product [Ovis sp.] 68 0.018 1
gp|X55691|LEEXTEN11_1 glycine-rich protein [Lycopersicon ... 64 0.52 1
pir|S14978|S14978 glycine-rich protein (clone wN) - t... 64 0.57 1
gp|X55689|LEEXTEN9_1 glycine-rich protein [Lycopersicon ... 64 0.57 1
pir|S14977|S14977 glycine-rich protein (clone wM) - t... 64 0.59 1
pir|A36019|A36019 Scrapie prion - golden hamster (fra... 58 0.63 1
sp|P17816|GRP_HORVU GLYCINE-RICH CELL WALL STRUCTURAL P... 64 0.65 1
sp|P21750|SALA_DROME PROTEIN SPALT-ACCESSORY. >pir|A3391... 47 0.74 2
sp|Q01157|GRW1_LYCES GLYCINE-RICH CELL WALL STRUCTURAL P... 55 0.78 1
gp|X55690|LEEXTEN10_1 glycine-rich protein [Lycopersicon ... 52 0.87 1
gp|Z48625|HVGRP2_1 glycine rich protein [Hordeum vulgare] 59 0.90 1
pir|A31994|A31994 keratin 10, type I, epidermal - hum... 42 0.97 3
sp|P21749|SALA_DROSI PROTEIN SPALT-ACCESSORY. >pir|B3391... 43 0.992 2
sp|P10569|MYSC_ACACA MYOSIN IC HEAVY CHAIN. >pir|A33891|... 60 0.993 1
sp|P20805|GLN2_FRASC GLUTAMINE SYNTHETASE II (EC 6.3.1.2... 44 0.9995 2
WARNING: HSPs involving 69 database sequences were not reported due to the
limiting value of parameter B = 0.
Parameters:
V=100
B=0
-filter=xnu+seg
-echofilter
-ctxfactor=1.00
E=10
Query ----- As Used ----- ----- Computed ----
Frame MatID Matrix name Lambda K H Lambda K H
+0 0 BLOSUM62 0.323 0.137 0.437 same same same
Query
Frame MatID Length Eff.Length E S W T X E2 S2
+0 0 253 150 10. 59 3 11 22 0.22 31
Statistics:
Query Expected Observed HSPs HSPs
Frame MatID High Score High Score Reportable Reported
+0 0 62 (28.9 bits) 506 (235.6 bits) 244 7
Query Neighborhd Word Excluded Failed Successful Overlaps
Frame MatID Words Hits Hits Extensions Extensions Excluded
+0 0 4844 15914728 3944497 11936634 33590 542
Database: Non-redundant PDB+SwissProt+SPupdate+PIR+GenPept+GPupdate
Release date: 6:02 AM EDT Sep 26, 1995
Posted date: 6:03 AM EDT Sep 26, 1995
# of letters in database: 46,130,256
# of sequences in database: 157,254
# of database sequences satisfying E: 69
No. of states in DFA: 558 (55 KB)
Total size of DFA: 107 KB (128 KB)
Time to generate neighborhood: 0.03u 0.01s 0.04t Real: 00:00:00
No. of processors used: 8
Time to search database: 88.03u 3.25s 91.28t Real: 00:00:15
Total cpu time: 88.15u 3.42s 91.57t Real: 00:00:16
WARNINGS ISSUED: 1
<ENDSome very important features of BLAST's output need to be pointed out. First notice that the Filter=xs switch causes the troublesome portions of the query sequence to be ignored in the search. Pay attention to the Poisson distribution scores. These are the likelihoods that the observed matches could be due to chance. Therefore, the smaller the number, the more significant. You should be able to see a fairly clear demarcation where the scores quickly drop off between the significant hits and the background noise. Another important point to notice is, unlike all other GCG programs, the list generated by this search is not appropriate as input to other GCG analyses. This output list is actually generated by NCBI's computer which doesn't know about GCG format requirements. Therefore, if you want to use the results of a BLAST search in another GCG program you must manually edit the list changing the database names to reflect the logicals that GCG understands. For example, change sp|P04145|PRIO_HUMAN to SW:P04145 PRIO_HUMAN (either insert a blank space or ! between the access code and sequence name). The "gp" designation is GenPept, all the translations from GenBank's CDS references. GenPept is not installed on VADMS systems. Therefore, if you want to include any in subsequent analyses, you must individually translate them from GenBank. For instance, gp|U15164|APU15164_1 is the first CDS translation from GenBank entry name APU15164. This does make life a bit more complicated but can be worked around.
Now that you've read over and seen how database searching works with the prion protein, go back and do all four searches in exactly the same manner with the beta-amyloid protein. Your first session for the week is complete after all of your searches have been submitted. If you are done early, you may wish to work on this week's problem set. By the time you return for your second session of the week all batch jobs should be finished and you can continue with the rest of the exercise.
5) What Next? Comparisons, interpretations, and further analyses.
When the output from the previous "traditional" batch jobs is available, take a look at the files to check out the results of all your searches with the beta-amyloid protein. You'll need these results in order to complete the report form for this exercise. Remember, all output from this lab will be located in your week 10 directory. Also, all GCG batch jobs send log files to your mail system -- you'll want to delete these files if the batch jobs finished correctly. You may want to print out your search results using the lpr command for reference in the remainder of the exercise although it is not necessary. The abridged prion example output files follow the more commands below. Naturally, the topmost "hits" are prion proteins; it's the ones just below the expected hits that may prove interesting.
% more humprp.tfasta
(Peptide) TFASTA of: humprp.pep from: 1 to: 253 September 27, 1995 07:29
P1;PRIO_HUMAN - MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C)
ID PRIO_HUMAN STANDARD; PRT; 253 AA.
AC P04156;
DT 01-NOV-1986 (REL. 03, CREATED)
DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE)
DT 01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE) . . .
TO: GenEMBL:* Sequences: 197,612 Symbols: 235,990,291 Word Size: 2
Scoring matrix: GenRunData:Fastapep.Cmp
Variable pamfactor used
Gap creation penalty: 12.0 Gap extension penalty: 4.0
Score Init1 Initn
(-) (+)
< 2 10080 10080:==================================================
4 1986 1986:==================================================
6 2489 2489:==================================================
8 10330 10330:==================================================
10 15801 15801:==================================================
12 41835 41835:==================================================
14 23697 23697:==================================================
16 61735 61735:==================================================
18 88756 88756:==================================================
/////////////////////////////////////////////////////////////////////
62 623 4253:==================================================
64 436 3563:==================================================
66 325 3123:==================================================
68 257 2436:==================================================
70 164 2116:==================================================
72 116 1731:==================================================
74 105 1383:==================================================
76 76 1271:======================================++++++++++++
78 47 933:========================++++++++++++++++++++++++++
80 67 795:==================================++++++++++++++++
> 80 299 3395:==================================================
Mean score calculations exclude scores greater than 73
mean initn score: 26.5 (s.d. 11.28)
mean init1 score: 25.1 (s.d. 8.61)
1637 scores better than 87 saved
joining threshold: 28, optimization threshold: 19
The best scores are: frame init1 initn opt
..
gb:Humprp Human prion protein (PrP) mRNA, complete cds (2) 1375 1375 1375
gb:Ggu08300 Gorilla gorilla prion protein gene, comple...(1) 1373 1373 1373
gb:Ptu08296 Pan troglodytes prion protein gene, comple...(1) 1372 1372 1372
gb:Hlu08299 Hylobates lar prion protein gene, complete...(1) 1372 1372 1372
gb:Ssu08308 Symphalangus syndactylus prion protein gen...(1) 1372 1372 1372
gb:Ptu15039 Pan troglodytes major prion protein precur...(1) 1372 1372 1372
gb:Ggu15166 Gorilla gorilla amyloid precursor protein ...(1) 1361 1361 1361
gb:Ppu08305 Pongo pygmaeus prion protein gene, complet...(1) 1360 1360 1360
gb:Cgu08297 Colobus guereza prion protein gene, comple...(1) 1355 1355 1355
gb:Mnu08306 Macaca nemestrina prion protein gene, comp...(1) 1354 1354 1354
gb:Mau08311 Macaca arctoides prion protein gene, compl...(1) 1354 1354 1354
gb:Phu08294 Papio hamadryas prion protein gene, comple...(1) 1354 1354 1354
gb:Mmu15163 Macaca mulatta amyloid precursor protein g...(1) 1354 1354 1354
gb:Mmu08307 Macaca mulatta prion protein gene, complet...(1) 1354 1354 1354
gb:Mfu08298 Macaca fascicularis prion protein gene, co...(1) 1354 1354 1354
gb:Mfu08301 Macaca fuscata prion protein gene, complet...(1) 1354 1354 1354
gb:Pfu08302 Presbytis francoisi prion protein gene, co...(1) 1350 1350 1350
gb:Humprion Human prion protein mRNA, human PrP 27-30 ...(2) 697 697 1345
gb:Humprp0a Human prion protein 27-30 mRNA, complete cds (2) 697 697 1345
gb:Cju08304 Callithrix jacchus prion protein gene, com...(1) 1039 1039 1329
gb:Cau08295 Cebus apella prion protein gene, complete cds(1) 1037 1037 1323
gb:Ssu15165 Saimiri sciureus amyloid precursor protein...(1) 1038 1038 1321
gb:Ssu08310 Saimiri sciureus prion protein gene, compl...(1) 1041 1041 1317
gb:Apu15164 Ateles paniscus x Ateles fusciceps amyloid...(1) 1034 1034 1317
gb:Cruprpb Armenian hamster prion protein (PrP) gene, ...(1) 1169 1268 1277
gb:Cruprpa Chinese hamster prion protein (PrP) gene, c...(1) 1167 1266 1275
gb:Msu08303 Mandrillus sphinx prion protein gene, part...(1) 1274 1274 1274
gb:Bovprp3 Bovine PrP gene encoding prion protein (1) 1127 1127 1273
gb:Cmu08312 Callicebus moloch prion protein gene, part...(1) 1268 1268 1268
gb:Shpprpa Sheep PrP gene (3) 1119 1119 1265
gb:Oaprp3 O.aries PrP gene, exon 3 and flanking sequences(3) 1119 1119 1265
gb:Hamprp2 Hamster (Syrian golden) PrP gene, complete cds(1) 1159 1255 1264
gb:Chpripr C.hircus gene for prion protein (1) 1118 1118 1264
gb:Shpprp Sheep prion protein (PrP) gene, complete cds (3) 1116 1116 1262
gb:S55629 PrP=prion protein [cattle, mRNA Partial, 795...(1) 996 1101 1258
gb:Bovprp2 Bovine PrP gene encoding prion protein (1) 996 1101 1258
gb:Btprp Bovine PrP gene for a prion-protein (1) 996 1101 1258
gb:Cdu08292 Cercopithecus dianae prion protein gene, c...(1) 993 993 1257
gb:Cau08291 Cercopithecus aethiops prion protein gene,...(1) 993 993 1257
gb:S46825 prion protein [mink, Genomic, 2446 nt] (2) 788 893 1255
gb:Pigprpg Sus scrofa prion protein (PrP) gene, comple...(1) 790 897 1254
gb:Bovprp1 Bovine mRNA for prion protein (3) 992 1097 1254
gb:Humprprb Human prion protein gene, 5' end (1) 1252 1252 1252
gb:Tspripr T.streptsiceros gene for prion protein (1) 1108 1108 1250
gb:Musprpa Mouse (with short incubation period) prion ...(2) 874 973 1249
gb:Musprp Mouse prion protein (PrP) mRNA, complete cds (1) 870 969 1245
gb:U08952 Mustela putorius prion protein (PrP) gene, c...(1) 776 881 1243
gb:Tsprip T.strepsiceros gene for prion protein (1) 990 1091 1243
gb:S69654 Prn=prion protein [zitter rats, liver, Genom...(1) 1134 1233 1242
gb:Atu08293 Aotus trivirgatus prion protein gene, part...(1) 1000 1000 1242
gb:Musprpb Mouse (with long incubation period) prion p...(2) 867 966 1242
gb:Hamprq Syrian golden hamster scrapie (prion) protei...(2) 1109 1205 1214
gb:Hamprp Hamster scrapie prion (Prp 27-30) mRNA, 3' end (2) 1109 1205 1214
gb:Agu08309 Ateles geoffroyi prion protein gene, parti...(1) 979 979 1176
gb:Humprpra Human prion protein gene, 5' end (2) 941 941 1153
gb:Ratprp Rat prion-related protein (PrP) mRNA, 3' end (1) 1008 1107 1116
gb:Tioprp Trichosurus vulpecular prion protein (PrP) g...(3) 652 652 974
gb:Musprpc Mouse prion protein (PrP27-30) mRNA, partia...(2) 381 381 381
gb:Chkprlp Gallus gallus prion-like protein mRNA, comp...(1) 209 249 332
gb:Chkprion Chicken prion protein gene, complete cds (1) 211 251 323
gb:S80539 PRNP=PrP amyloid [human, Genomic Mutant, 291...(1) 270 270 278
gb:Dogbgall Canis beta-galactosides-binding lectin (LG...(5) 60 60 161
gb:Atms1 A.thaliana minisatellite sequence 1, chromoso...(1) 104 104 155
gb:Celrbp1 C.elegans mRNA for hnRNP like protein (1) 95 95 153
gb:Ggeap300 G.gallus EAP-300 mRNA (3) 58 58 153
gb:Dmp11 D.melanogaster gene for P11 protein, A1 relat...(3) 81 81 138
gb:Hse1s2 Equine Herpesvirus 1 Subtype 2,BamHI fragment (1) 117 117 138
gb:Dmhrb87 Drosphila Hrb87F mRNA for a member of the A...(1) 81 81 137
gb:Dmhnrib2 D.melanogaster gene for heterogeneous nucl...(3) 77 77 137
gb:Saidrd4d Saimiri boliviensis dopamine receptor D4 (...(4) 98 98 134
gb:Hse1s2 Equine Herpesvirus 1 Subtype 2,BamHI fragment (2) 115 115 132
gb:Hshox51 Human HOX 5.1 gene for HOX 5.1 protein (2) 68 89 129
gb:A16444 Xho-PstI fragment from THP-I cells (6) 65 65 129
gb:Hse1s2 Equine Herpesvirus 1 Subtype 2,BamHI fragment (3) 115 115 129
gb:Crewp6a Chlamydomonas eugametos WP6 mRNA, complete cds(4) 62 62 128
gb:Ornreprc Oreochromis niloticus DNA sequence, repeat...(1) 126 126 128
gb:Ornreprd Oreochromis niloticus DNA sequence, repeat...(1) 126 126 128
gb:Saidrd4b Saimiri boliviensis dopamine receptor D4 (...(4) 92 92 128
////////////////////////////////////////////////////////////////////////////////
gb:Ratcbva Rattus norvegicus vesicla-associate calmodu...(5) 107 107 120
gb:Bovap Bovine alkaline phosphatase (AP) mRNA, comple...(6) 52 52 120
gb:Ttnusainf T.thermophilus nusA-infB operon DNA (6) 64 76 120
gb:Tetrstel2 T.thermophila (type B) tandem repeat 'ccc...(1) 96 96 120
gb:Dyadhps D.yakuba Adh pseudogene (6) 39 39 119
! CPU time used:
! Database scan: 13:58:45.8
! Post-scan processing: 0:00:34.8
! Total CPU time: 13:59:22.9
! Output File: Humprp.TfastaHad we chosen to not suppress aligning the results, in addition to showing the entry list and translation frames the TFastA output would show the sequence location for each alignment. This information can be used to go back to the original nucleotide entry and check to see whether they correspond to actually translated areas. Sometimes the sequences found by TFastA will not show up in any other searches. This could be valuable information, especially if sometime during your peptide's evolutionary history it incorporated (was 'infected' by) any type of mobile DNA element. This comes to mind in my example because of frequent hits within the herpes virus family.
Next the example FastA output file:
% more humprp.fasta
(Peptide) FASTA of: humprp.pep from: 1 to: 253 September 26, 1995 17:36
P1;PRIO_HUMAN - MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C)
ID PRIO_HUMAN STANDARD; PRT; 253 AA.
AC P04156;
DT 01-NOV-1986 (REL. 03, CREATED)
DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE)
DT 01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE) . . .
TO: SwissProt:* Sequences: 40,292 Symbols: 14,147,368 Word Size: 2
Scoring matrix: GenRunData:Fastapep.Cmp
Variable pamfactor used
Gap creation penalty: 12.0 Gap extension penalty: 4.0
Score Init1 Initn
(-) (+)
< 2 13 13:=======
4 0 0:
6 7 7:====
8 32 32:================
10 87 87:============================================
12 323 323:==================================================
14 426 426:==================================================
16 1374 1374:==================================================
18 2603 2603:==================================================
20 3562 3562:==================================================
/////////////////////////////////////////////////////////////////////
48 112 306:==================================================
50 86 257:===========================================+++++++
52 43 195:======================++++++++++++++++++++++++++++
54 36 164:==================++++++++++++++++++++++++++++++++
56 25 132:=============+++++++++++++++++++++++++++++++++++++
58 18 90:=========++++++++++++++++++++++++++++++++++++
60 12 79:======++++++++++++++++++++++++++++++++++
62 18 59:=========+++++++++++++++++++++
64 9 57:=====++++++++++++++++++++++++
66 7 45:====+++++++++++++++++++
68 8 37:====+++++++++++++++
70 6 21:===++++++++
72 4 19:==++++++++
74 3 15:==++++++
76 3 6:==+
78 2 9:=++++
80 2 10:=++++
> 80 26 39:=============+++++++
Mean score calculations exclude scores greater than 73
mean initn score: 26.5 (s.d. 8.47)
mean init1 score: 25.7 (s.d. 6.76)
1869 scores better than 50 saved
joining threshold: 28, optimization threshold: 19
The best scores are: init1 initn opt..
Swissprotein:Prio_Human MAJOR PRION PROTEIN PRECURSOR (P...1375 1375 1375
Swissprotein:Prip_Bovin MAJOR PRION PROTEIN 2 PRECURSOR ...1127 1127 1273
Swissprotein:Prio_Mesau MAJOR PRION PROTEIN PRECURSOR (P...1159 1255 1264
Swissprotein:Prio_Sheep MAJOR PRION PROTEIN PRECURSOR (PRP)1116 1116 1262
Swissprotein:Prio_Bovin MAJOR PRION PROTEIN 1 PRECURSOR ... 996 1101 1258
Swissprotein:Prio_Mouse MAJOR PRION PROTEIN PRECURSOR (P... 874 973 1249
Swissprotein:Prio_Rat MAJOR PRION PROTEIN (PRP) (FRAGMENT) 1008 1107 1116
Swissprotein:Prio_Chick MAJOR PRION PROTEIN HOMOLOG PREC... 211 251 323
Swissprotein:Grp2_Orysa GLYCINE-RICH CELL WALL STRUCTURA... 101 101 121
Swissprotein:Roa1_Drome HETEROGENEOUS NUCLEAR RIBONUCLEO... 85 85 121
Swissprotein:K1ci_Human KERATIN, TYPE I CYTOSKELETAL 9 (... 105 105 115
Swissprotein:Fbrl_Yeast FIBRILLARIN (NUCLEOLAR PROTEIN 1) 61 61 112
Swissprotein:Fbrl_Leima FIBRILLARIN 93 93 111
Swissprotein:Roa1_Scham HETEROGENEOUS NUCLEAR RIBONUCLEO... 81 88 110
Swissprotein:Grp_Arath GLYCINE-RICH CELL WALL STRUCTURAL... 110 110 110
Swissprotein:K1c0_Xenla KERATIN 3, TYPE I CYTOSKELETAL 5... 108 108 109
Swissprotein:K1c1_Mouse KERATIN, TYPE I CYTOSKELETAL 59 KD 76 76 107
Swissprotein:Roab_Xenla HETEROGENEOUS NUCLEAR RIBONUCLEO... 57 57 106
Swissprotein:Ebn1_Ebv EBNA-1 NUCLEAR PROTEIN 103 103 106
Swissprotein:Roaa_Xenla HETEROGENEOUS NUCLEAR RIBONUCLEO... 57 57 106
Swissprotein:Vg38_Bpox2 RECEPTOR RECOGNIZING PROTEIN (PR... 49 49 106
Swissprotein:Grp_Horvu GLYCINE-RICH CELL WALL STRUCTURAL... 82 82 105
Swissprotein:K1cj_Human KERATIN, TYPE I CYTOSKELETAL 10 ... 84 84 104
Swissprotein:Fbrl_Mouse FIBRILLARIN (NUCLEOLAR PROTEIN 1) 66 66 104
Swissprotein:Mysb_Acaca MYOSIN HEAVY CHAIN IB 76 76 103
Swissprotein:Fib1_Petma FIBRINOGEN ALPHA-1 CHAIN PRECURS... 50 50 103
Swissprotein:K2ce_Human KERATIN, TYPE II CYTOSKELETAL 2 ... 71 71 103
Swissprotein:Grp1_Phavu GLYCINE-RICH CELL WALL STRUCTURA... 96 96 102
Swissprotein:Ded1_Yeast PUTATIVE ATP-DEPENDENT RNA HELIC... 56 79 102
Swissprotein:Grp1_Orysa GLYCINE-RICH CELL WALL STRUCTURA... 87 87 102
Swissprotein:K1cj_Bovin KERATIN, TYPE I CYTOSKELETAL VIB... 81 81 99
Swissprotein:Gr10_Brana GLYCINE-RICH RNA-BINDING PROTEIN 10 80 80 99
Swissprotein:Fibs_Nepcl SILK FIBROIN (FRAGMENT) 79 79 99
Swissprotein:Oc3n_Mouse NERVOUS-SYSTEM SPECIFIC OCTAMER-... 42 42 98
Swissprotein:Hmpr_Human HOMEOBOX-POU DOMAIN PROTEIN RDC-1 67 67 98
Swissprotein:K1cj_Mouse KERATIN, TYPE I CYTOSKELETAL 10 ... 67 67 98
Swissprotein:M130_Strpu MESENCHYME-SPECIFIC CELL SURFACE... 89 89 98
Swissprotein:Grp8_Arath GLYCINE-RICH RNA-BINDING PROTEIN 8 77 77 97
Swissprotein:Spt5_Yeast TRANSCRIPTION INITIATION PROTEIN... 62 83 97
Swissprotein:Vnua_Prvka PROBABLE NUCLEAR ANTIGEN 70 70 95
Swissprotein:Prlb_Achly BETA-LYTIC METALLOENDOPEPTIDASE ... 37 37 95
Swissprotein:Oc3n_Human NERVOUS-SYSTEM SPECIFIC OCTAMER-... 42 42 94
Swissprotein:Vasa_Drome VASA PROTEIN 43 43 94
Swissprotein:Grp2_Phavu GLYCINE-RICH CELL WALL STRUCTURA... 90 90 94
Swissprotein:Grp2_Sorvu GLYCINE-RICH RNA-BINDING PROTEIN 2 62 80 94
Swissprotein:Bind_Lytva BINDIN PRECURSOR 59 91 94
Swissprotein:K2c1_Mouse KERATIN, TYPE II CYTOSKELETAL 67... 74 74 94
Swissprotein:Egg1_Schja EGGSHELL PROTEIN 1 PRECURSOR 59 87 93
Swissprotein:Lori_Mouse LORICRIN 92 92 93
Swissprotein:Knh2_Bovin KININOGEN, HMW II PRECURSOR (THI... 42 42 93
Swissprotein:Grp1_Pethy GLYCINE-RICH CELL WALL STRUCTURA... 91 91 92
Swissprotein:Saa_Anapl SERUM AMYLOID A PROTEIN (SAA) 56 56 92
Swissprotein:Grp1_Sorvu GLYCINE-RICH RNA-BINDING PROTEIN... 78 78 92
Swissprotein:Glhr_Antel PROBABLE GLYCOPROTEIN HORMONE G ... 60 60 92
Swissprotein:Sox4_Human SOX-4 PROTEIN 59 59 90
Swissprotein:Cea6_Ecoli COLICIN E6 (EC 3.1.-.-) (RIBONUC... 62 62 90
////////////////////////////////////////////////////////////////////////////
Swissprotein:Fbrl_Human FIBRILLARIN (34 KD NUCLEOLAR SCL... 53 53 80
Swissprotein:Agal_Cyate ALPHA-GALACTOSIDASE PRECURSOR (E... 76 76 80
Swissprotein:Gar1_Schpo GAR1 PROTEIN 64 64 80
Swissprotein:Saa3_Mouse SERUM AMYLOID A-3 PROTEIN PRECURSOR 40 40 80
Swissprotein:Coaa_Bppf1 POSSIBLE COAT PROTEIN A PRECURSOR 52 52 80
Swissprotein:Hmdf_Drome HOMEOTIC DEFORMED PROTEIN 52 52 79
Swissprotein:K2c1_Human KERATIN, TYPE II CYTOSKELETAL 1 ... 55 55 79
Swissprotein:Ca13_Mouse PROCOLLAGEN ALPHA 1(III) CHAIN P... 37 50 79
Swissprotein:Lyg_Cygat LYSOZYME G (EC 3.2.1.17) (1,4-BET... 32 32 79
Swissprotein:Xyna_Rumfl BIFUNCTIONAL ENDO-1,4-BETA-XYLAN... 56 56 79
Swissprotein:K1cp_Human KERATIN, TYPE I CYTOSKELETAL 16 ... 62 62 79
Swissprotein:Rog_Human HETEROGENEOUS NUCLEAR RIBONUCLEOP... 36 36 79
Swissprotein:Mycl_Mouse L-MYC PROTO-ONCOGENE PROTEIN 33 33 79
Swissprotein:P_Human P PROTEIN 27 27 79
! CPU time used:
! Database scan: 0:33:02.8
! Post-scan processing: 0:00:11.5
! Total CPU time: 0:33:16.2
! Output File: humprp.fastaBoth FastA output files show a histogram of the score distribution, then a sorted list of the top scores and finally, if alignments are not suppressed, BestFit style alignments of the score list. Next, an abridged version of the example WordSearch output:
% more humprp.word (Peptide) WORDSEARCH of: /disk3/usr/local/people/thompson/bc378/week10/humprp.pep check: 8781 from: 1 to: 253 P1;PRIO_HUMAN - MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) ID PRIO_HUMAN STANDARD; PRT; 253 AA. AC P04156; DT 01-NOV-1986 (REL. 03, CREATED) DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE) DT 01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE) . . . TO: SwissProt:* Sequences: 40,292 Total-length: 14,147,368 September 26, 1995 17:02 Word-size: 2 Words: 62867213 Diagonals: 11,864,381 Total-diagonals: 24,300,70 0 Integral-width: 3 Alphabet: 20 List-size: 100 CPU minutes: 11.93 Sequence Strd Diag Score Width Documentation .. Swissprotein:Prio_Human + -1 278 53 MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) Swissprotein:Prip_Bovin + 3 265 17 MAJOR PRION PROTEIN 2 PRECURS OR (PRP) (MAJOR SCRAPIE-ASSOCIATED FIBRIL Swissprotein:Prio_Sheep + 3 264 25 MAJOR PRION PROTEIN PRECURSOR (PRP) Swissprotein:Prio_Mesau + 0 255 15 MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) Swissprotein:Prio_Mouse + 0 240 21 MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) Swissprotein:Prio_Rat + -28 239 5 MAJOR PRION PROTEIN (PRP) (FR AGMENT) Swissprotein:Prio_Bovin + 11 226 15 MAJOR PRION PROTEIN 1 PRECURS OR (PRP) (MAJOR SCRAPIE-ASSOCIATED FIBRIL Swissprotein:Grp1_Pethy + 46 72 30 GLYCINE-RICH CELL WALL STRUCT URAL PROTEIN 1 PRECURSOR Swissprotein:Grp1_Pethy + 176 71 36 GLYCINE-RICH CELL WALL STRUCT URAL PROTEIN 1 PRECURSOR Swissprotein:Grp_Arath + 111 71 3 GLYCINE-RICH CELL WALL STRUCT URAL PROTEIN PRECURSOR Swissprotein:Grp1_Phavu + 44 70 7 GLYCINE-RICH CELL WALL STRUCT URAL PROTEIN 1.0 PRECURSOR (GRP 1.0) Swissprotein:Grp_Arath + 115 69 2 GLYCINE-RICH CELL WALL STRUCT URAL PROTEIN PRECURSOR Swissprotein:Grp_Arath + 82 68 7 GLYCINE-RICH CELL WALL STRUCT URAL PROTEIN PRECURSOR Swissprotein:Grp1_Pethy + 82 67 33 GLYCINE-RICH CELL WALL STRUCT URAL PROTEIN 1 PRECURSOR Swissprotein:Grp_Arath + 91 67 3 GLYCINE-RICH CELL WALL STRUCT URAL PROTEIN PRECURSOR Swissprotein:Prio_Chick + 13 67 19 MAJOR PRION PROTEIN HOMOLOG P RECURSOR (PR-LP) (ACETYLCHOLINE /////////////////////////////////////////////////////////////////////////////// Swissprotein:Lori_Mouse + 44 57 11 LORICRIN Swissprotein:Lori_Human + 45 57 5 LORICRIN Swissprotein:Lori_Human + -31 57 4 LORICRIN Swissprotein:Grp_Arath + 135 57 2 GLYCINE-RICH CELL WALL STRUCT URAL PROTEIN PRECURSOR Swissprotein:Grp2_Phavu + 180 56 21 GLYCINE-RICH CELL WALL STRUCT URAL PROTEIN 1.8 PRECURSOR (GRP 1.8) Swissprotein:Grp2_Phavu + 99 56 6 GLYCINE-RICH CELL WALL STRUCT URAL PROTEIN 1.8 PRECURSOR (GRP 1.8) Swissprotein:Grp2_Phavu + 90 56 4 GLYCINE-RICH CELL WALL STRUCT URAL PROTEIN 1.8 PRECURSOR (GRP 1.8)
Notice that the various lists have entries in common but also have substantial differences. Also notice that in many cases WordSearch finds multiple segments. These are reasons why it is always a good idea to run as many different types of analyses as practical. We ran WordSearch with the Plot option to generate a graphic of the score distribution of that procedure. Make sure to initialize your graphics configuration first (remember setplot) and then type figure of wordsearch.figure to draw the plot:
% figure wordsearch.figure FIGURE makes figures and posters by drawing graphics and text together. You can include output from other GCG graphics programs as part of a figure. Process set to plot with TEK4107 attached to TERM: using the TEKTRONIX graphic interface. When your TEK4107 attached to tty is ready, press <rtn>. <rtn>

The asterisk on the abscissa at a score of about 56 indicates your lowest reported score and helps you to decide if you ran your list large enough. Normally you want your list size big enough to include some of the population of random low scores to help ascertain the significance of the higher scores. In my case above, the list was just big enough to catch the upward slope of the random scores; therefore, the list size was appropriate. The inset shows a blowup of the high score end of the graph -- these are the best alignments found by the program, not the worst!
WordSearch does not illustrate its alignments with sequence pairs. To see only the most interesting alignments found by WordSearch, first use pico to edit your .word file by removing all of the entries near the top that are clearly the same protein as your query. Do not edit out the header information in the file. It may be a good idea to leave the top-most hit as a reference sequence. Also remove the sequences near the bottom of the list which are obviously totally dissimilar. However, do not remove those sequences which are somewhat similar in the middle of the file; they are the interesting ones. We are mainly interested here in how these somewhat similar sequences align with your query. In order to visualize the alignments run the program segments, providing it with your edited WordSearch output file's name; accept the default output file name. This process is shown for the prion example below:
% segments humprp.word Segments aligns and displays the segments of similarity found by WordSearch. What should I call the output file (* humprp.pairs *) ? <rtn> Aligning ............-.. Swissprotein:Prio_Human 253 aa Gaps: 0 Quality: 379.5 - Length: 253 Aligning ............-.. Swissprotein:Grp1_Pethy 384 aa Gaps: 8 Quality: 96.1 - Length: 250 Aligning ............-. Swissprotein:Grp1_Pethy 384 aa Gaps: 6 Quality: 87.6 - Length: 234 Aligning ...........-.. Swissprotein:Grp_Arath 338 aa Gaps: 4 Quality: 91.2 - Length: 234 Aligning ..........-. Swissprotein:Grp1_Phavu 252 aa Gaps: 6 Quality: 84.8 - Length: 215 Aligning ...........-.. Swissprotein:Grp_Arath 338 aa Gaps: 3 Quality: 89.3 - Length: 225 Aligning ............-.. Swissprotein:Grp_Arath 338 aa Gaps: 6 Quality: 92.0 - Length: 248 Aligning ............-.. Swissprotein:Grp1_Pethy 384 aa Gaps: 8 Quality: 96.3 - Length: 294 Aligning ............-.. Swissprotein:Grp_Arath 338 aa Gaps: 5 Quality: 91.2 - Length: 251 Aligning ............-.. Swissprotein:Prio_Chick 273 aa Gaps: 12 Quality: 165.1 - Length: 267 Aligning ........-. Swissprotein:Grp_Arath 338 aa Gaps: 2 Quality: 75.3 - Length: 177 Aligning .........-. Swissprotein:Grp_Arath 338 aa Gaps: 4 Quality: 82.6 - Length: 184 Aligning .........-. Swissprotein:Grp_Arath 338 aa Gaps: 6 Quality: 83.9 - Length: 201 Aligning ............-.. Swissprotein:Grp1_Pethy 384 aa Gaps: 7 Quality: 87.1 - Length: 249 Aligning ............-.. Swissprotein:Grp2_Phavu 465 aa Gaps: 5 Quality: 95.5 - Length: 247 Aligning ............-.. Swissprotein:Grp2_Phavu 465 aa Gaps: 8 Quality: 83.4 - Length: 250 Aligning ............-.. Swissprotein:Grp_Arath 338 aa Gaps: 3 Quality: 94.7 - Length: 244 Aligning ............-. Swissprotein:Grp_Horvu 200 aa Gaps: 11 Quality: 80.3 - Length: 194 Aligning ..........-.. Swissprotein:Grp_Arath 338 aa Gaps: 5 Quality: 85.9 - Length: 215
Use more to display the generated file. The prion example follows in an abridged fashion:
% more humprp.pairs
(BestFit) SEGMENTS from: Humprp.Word September 27, 1995 11:16
(Peptide) WORDSEARCH of: Disk03:[Thompson.Bc378.Week_10]Humprp.Pep check:
8781 from: 1 to: 253
P1;PRIO_HUMAN - MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C)
ID PRIO_HUMAN STANDARD; PRT; 253 AA.
AC P04156;
DT 01-NOV-1986 (REL. 03, CREATED)
DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE) . . .
AvMatch: 0.54 AvMisMatch: -0.40 GapWeight: 2.00 LengthWeight: 0.10 ..
Humprp.Pep check: 8781 from: 1 to: 253
Swissprotein:Prio_Human check: 8781 from: 1 to: 253
MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C)
Gaps: 0 Quality: 379.5 Ratio: 1.500 Score: 278 Width: 53 Limits: +/-54
. . . . .
1 MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYP 50
||||||||||||||||||||||||||||||||||||||||||||||||||
1 MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYP 50
. . . . .
51 PQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWN 100
||||||||||||||||||||||||||||||||||||||||||||||||||
51 PQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWN 100
. . . . .
101 KPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYY 150
||||||||||||||||||||||||||||||||||||||||||||||||||
101 KPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYY 150
. . . . .
151 RENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTE 200
||||||||||||||||||||||||||||||||||||||||||||||||||
151 RENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTE 200
. . . . .
201 TDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFL 250
||||||||||||||||||||||||||||||||||||||||||||||||||
201 TDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFL 250
251 IVG 253
|||
251 IVG 253
Humprp.Pep check: 8781 from: 1 to: 253
Swissprotein:Grp1_Pethy check: 1852 from: 46 to: 384
GLYCINE-RICH CELL WALL STRUCTURAL PROTEIN 1 PRECURSOR<
Gaps: 8 Quality: 96.1 Ratio: 0.420 Score: 72 Width: 30 Limits: +/-31
. . . . .
2 ANLGCWMLVLFVATW...SDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNR 48
:.:|.: . |. .: :::| . . .||:..||: .|.|..||.
52 GRFGGRG.PSFGRGRGAGGGFGGGAGGGAGGGLGGGGGLGGGGGAGGGGG 100
. . . . .
49 YPPQG..GGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTH 96
....| |||:|.. ||| |.. ||| | . ||| |.. ||| |.|:|.
101 LGGGGGAGGGFGGGAGGGAGGGLGGGGGLGGGGGGGAGGGGGVGGGAGSG 150
. . . . .
97 SQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYE 146
:.:. .: ::|:|:||:.||| ||: |:: : : ||:.:
151 GGFGAGGG.......VGGGAGAGGGVGGGGGFGGGGGGG...VGGGSGHG 190
. . . . .
147 DRY.YRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKG 195
: : :.: :.... .::: :.... . :.:. .....|
191 GGFGAGGGVGGGAGGGLGGGVGGGGGGGSGGGGGIG........GGSGHG 232
. . . . .
196 ENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLIS 245
:.|...:. :: .|:. . .. : ::.: . |:::. |:... : :
233 GGFGAGGG..VGGGVGGGAAGGGGGGGGGGGGGGGGLGGGSGHGGGFGAG 280
Humprp.Pep check: 8781 from: 1 to: 253
Swissprotein:Grp1_Pethy check: 1852 from: 176 to: 384
GLYCINE-RICH CELL WALL STRUCTURAL PROTEIN 1 PRECURSOR
Gaps: 6 Quality: 87.6 Ratio: 0.429 Score: 71 Width: 36 Limits: +/-37
. . . . .
14 ATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGG 63
:. :: |:. ....||:..||: .|.|:. |. ...||||.|.. |
176 GGGGGGGVGGGSGHGGGFGAGGGVGGGAGGGLGGGVGGGGGGGSGGGGGI 225
. . . . .
64 GWGQPHGGGWGQPH......GGGWGQPHGGGWGQGGGTHSQWNKPSKPKT 107
| | .||||:|.. ||| :.. ||| |.|||. :.:. .| . .
226 GGGSGHGGGFGAGGGVGGGVGGGAAGGGGGGGGGGGGGGGGLGGGSGHGG 275
. . . . .
108 NMKHMAG.AAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHR 156
.: :| :::|::.||| ||: |:: : : ||:.:: :
276 GFGAGGGVGGGAAGGVGGGGGFGGGGGGG...VGGGSGHGGGF....... 315
. . . . .
157 YPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMM 206
.... | :... .: ...... |:.:.:... :
316 ..............GAGGGVGGGAGGGLG..GGGGAGGGGGIGGGHGGGF 349
. . .
207 ERVVEQMCITQYERESQAYYQRGSSMVLFSSPPV 240
: .|: ...: : .| .:| ::. |:...
350 GVGVG....IGIGVGVGAGAGHGVGVGSGSGSGG 379
Humprp.Pep check: 8781 from: 1 to: 253
Swissprotein:Grp_Arath check: 6121 from: 111 to: 338
GLYCINE-RICH CELL WALL STRUCTURAL PROTEIN PRECURSOR
Gaps: 4 Quality: 91.2 Ratio: 0.409 Score: 71 Width: 3 Limits: +/-4
. . . . .
2 ANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPP 51
:.:|. . .:. ::|| . . .|| ..||: .|:|:. |. ..
116 GGIGG...GAGGGSGGGLGGGIGGGAGGGAGGGGGLGGGHGGGIGGGAGG 162
. . . . .
52 QGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNK 101
.:|||:|..|||| |.. |||.|.. ||| |.. ||| |.|||. :. .
163 GAGGGLGGGHGGGIGGGAGGGSGGGLGGGIGGGAGGGAGGGGGAGGGGGL 212
. . . . .
102 PSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYR 151
.: . ..: ||:: :|:..|| || : |:| : : |:::::
213 GGGHGGGFGGGAGGGLGGGAGGGTGGGFGGGAGGGAGGGAGGGFGGGAGG 262
. . . . .
152 ENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVT...TTTKGENF 198
:. :... : :: :.... . :... . .. .|. ....|:.|
263 GAGGGFGGG....AGGGAGGGAGGGFGGGAGGGHGGGVGGGFGGGSGGGF 308
. . .
199 TETDVKMMERVVEQMCITQYERESQAYYQRGSSM 232
.:.. : ..:. . ..:: ::.| . |:::
309 GGGA....GGGAGGGAGGGFGGGGGAGGGFGGGF 338
///////////////////////////////////////////////////////////////////////////
Humprp.Pep check: 8781 from: 1 to: 253
Swissprotein:Grp_Arath check: 6121 from: 91 to: 338
GLYCINE-RICH CELL WALL STRUCTURAL PROTEIN PRECURSOR
Gaps: 5 Quality: 91.2 Ratio: 0.375 Score: 67 Width: 3 Limits: +/-4
. . . . .
2 ANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPP 51
:.:|. . .:. ::|| .. . .|| ..|:: |.| .||. .:
96 GGIGG...GAGGGAGGGLGGGHGGGIGGGAGGGSGGGLGGGIGGGAGGGA 142
. . . . .
52 QGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNK 101
.||||:|..|||| |.. ||| |.. ||| |.. ||| |.|:|. . .
143 GGGGGLGGGHGGGIGGGAGGGAGGGLGGGHGGGIGGGAGGGSGGGLGGGI 192
. . . . .
102 PSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYR 151
.: : .. :||:::|::.|| || : |:| : |:: :: :
193 GGGAGGGAGGGGGAGGGGGLGGGHGGGFGGGAGGGLGGGAGGGTGGGFGG 242
. . . . .
152 ENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTET 201
:. ... : ::::.... . ....:. . ...... |:.|.:.
243 GAGGGAGGG....AGGGFGGGAG.GGAGGGFGGGAGGGAGGGAGGGFGGG 287
. . . . .
202 DVKMMERVV....EQMCITQYERESQAYYQRGSSMVL..FSSPPVILLIS 245
.. : .| :. : ..:: :..: . |.: .: :::.. : :
288 AGGGHGGGVGGGFGGGSGGGFGGGAGGGAGGGAGGGFGGGGGAGGGFGGG 337
246 F 246
|
338 F 338
Humprp.Pep check: 8781 from: 1 to: 253
Swissprotein:Prio_Chick check: 9640 from: 13 to: 273
MAJOR PRION PROTEIN HOMOLOG PRECURSOR (PR-LP) (ACETYLCHOLINE
Gaps: 12 Quality: 165.1 Ratio: 0.682 Score: 67 Width: 19 Limits: +/-20
. . . . .
11 LFVATWSDLGLCKK...RPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGW 57
|::|. .|::|:|| :|..|||..|: | |: ....| .. |. . .
13 LLLAACTDVALSKKGKGKPSGGGWGAGSHRQPSYPRQPGYPHNPGYPHNP 62
. . . . .
58 GQPHGGGWGQ....PHGGGWGQ....PHGGGWGQPHGGGWGQGGGTHSQW 99
| ||..|:.: ||..|:.| ||..|: .. |.|:..::|. :
63 GYPHNPGYPHNPGYPHNPGYPQNPGYPHNPGY.PGWGQGYNPSSGGSYHN 111
. . . . .
100 NKPSK.PKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDR 148
.||.| ||||:||:|||||||||||||||| :|..|| .||:|. | |
112 QKPWKPPKTNFKHVAGAAAAGAVVGGLGGYAMGRUNIXGMNYHFDSPDEYR 161
. . . . .
149 YYRENMHRYPNQVYYRPMDEYSN...QNNFVHDCVNITIKQHTVTTTTK. 194
::.|| ||||.|||| :||. |: || ||.|||:.:..:....|
162 WWSENSARYPNRVYYR...DYSSPVPQDVFVADCFNITVTEYSIGPAAKK 208
. . . . .
195 .......GENFTETDV..KMMERVVEQMCITQYERESQAYYQRGSSMVLF 235
:.| ||.:: |::.:|: :||: || || |. :|:: |
209 NTSEAVAAANQTEVEMENKVVTKVIREMCVQQY.RE....YRLASGIQLH 253
.
236 SSPPVILLISFLIFLIV 252
|: .:| :|::|..
254 ..PADTWLAVLLLLLTT 268
///////////////////////////////////////////////////////////////////////////
It's interesting to note that all of the non-prion hits in the WordSearch output are a glycine-rich cell wall proteins, whereas the FastA and BLAST approaches found these proteins plus some other unrelated ones. WordSearch seems to be particularly sensitive to the glycine repeat regions of the prion sequence. Also notice that multiple segments of the same glycine-rich protein were aligned to the prion protein. As stated earlier, it is good to gather as much different data as you can and not entirely rely on any one algorithm.
5.a. Interpreting Results -- what is significant.
Now find three different higher scoring sequences, one each, from each of the three programs BLAST, FastA, and WordSearch, which are not obviously the same sort of protein as your beta-amyloid query. Only do this for the beta-amyloid protein. We are going to experiment with various methods of analyzing similarity significance with these sequences. I will again illustrate with the prion protein as you follow along my examples. In your case, only do these procedures with the beta-amyloid protein nearest neighbors. There are no `correct' entries to choose, just try to get three completely different interesting ones from just below the obvious matches for each query. Relevant lines that show what I picked as interesting from the prion search examples follow.
From BLAST, the top-most, non-obvious match:
sp|P27177|PRIO_CHICK MAJOR PRION PROTEIN HOMOLOG PRECURS... 125 3.1e-17 3
Even though the above sequence is a prion homolog it is quite a bit different than all the other prions based on their scores and, therefore, it will be an interesting comparison. From the FastA output we'll pick something totally different:
Swissprotein:Roa1_Drome HETEROGENEOUS NUCLEAR RIBONUCLEO... 85 85 121
And from WordSearch another different protein:
Swissprotein:Grp1_Pethy + 46 72 30 GLYCINE-RICH CELL WALL STRUCT
5.a.1. Dot matrix methods: Compare and DotPlot.
Dot matrix analysis is one of the few ways to identify other elements beyond what the dynamic programming algorithms show to be similar between two sequences.
GCG implements dot matrix methods with two programs. Compare generates the data that serves as input to DotPlot which actually draws the matrix. Compare your query sequences to each of its three near neighbors (as described above) using these methods. Start the program by typing compare; supply it with the appropriate sequence names and specify output names that will allow you to keep track of the files. Rerun the program increasing the stringency of the comparisons until the number of points generated is of the same magnitude as the length of the longest sequence being compared. In the first run below, I found that a stringency of 15 out of 30 resulted in 280 points -- very close to the sequences' lengths being compared. In this case, wherever the average of match scores within the window exceeds 15/30's, a point is drawn at the middle of the window; the window is then slid over one position and the process is repeated. Just as in all windowing algorithms, in general you want to use a window of approximately the same size as the feature that you're trying to recognize. For these runs you can leave the window at its default setting of 30; that will be appropriate as an example.
Save the three best output files from this program for the beta-amyloid comparisons. This is three files total: one each from the three programs BLAST, FastA, and WordSearch, all for the beta-amyloid protein query. I will illustrate with two different prion examples:
% compare
Compare compares two protein or nucleic acid sequences and creates a
file of the points of similarity between them for plotting with
DotPlot. Compare finds the points using either a
window-stringency or a word match criterion. The word comparison
is 1,000 times faster than the window-stringency comparison, but
somewhat less sensitive.
COMPARE what horizontal sequence ? humprp.pep
Begin (* 1 *) ? <rtn>
End (* 253 *) ? <rtn>
to what vertical sequence (* humprp.pep *) ? sw:prio_chick
Begin (* 1 *) ? <rtn>
End (* 273 *) ? <rtn>
What comparison window size (* 30 *) ? <rtn>
What stringency (* 9.0 *) ? 15
What should I call the output file (* humprp.pnt *) ? humprp-prio_chick.pnt
Number of points: 280
Writing . Next, provide this file as input to DotPlot. Type dotplot and provide the correct filename; accept all program defaults. The human prion cross chicken prion-homolog protein comparison example follows:
% dotplot
DOTPLOT makes a dot-plot with the output file from COMPARE,
FOLD, or STEMLOOP.
Process set to plot with TEK4107 attached to TERM:
using the TEKTRONIX graphic interface.
DOTPLOT what point file ? humprp-prio_chick.pnt
humprp-prio_chick.pnt contains COMPARE results of
Axis Name Check Start End Dir
Horizontal Humprp.Pep 8781 1 253 for
Vertical Prio_Chick 9640 1 273 for
Window . . . . . . . . . 30
Stringency . . . . . . . 15.0
Number of points . . . . 280
Percent of possible . . 0.405
The minimum density for a one-page plot is
312.5 bases-100 platen units on each axis.
What point density would you like (* 312.50 *) ? <rtn>
DOTPLOT will take 1 pages. Would you like to:
P)lot the points
D)ifferent density
G)et another point file to plot
Q)uit
Please select one (* P *): <rtn>
When your TEK4107 attached to tty is ready, press <rtn>. <rtn>The humprp-prio_chick dotplot follows:

Notice that running the comparison at an appropriate stringency, in this case 15 out of 30, produces a very clean plot with very little confusing noise. Still, sometimes interpreting a dotplot can be a major accomplishment in itself -- just remember that diagonals are regions of similarity between the two sequences and any diagonal off the main center line is indicative of regions which do not correspond in linear placement between the two sequences yet are still similar. The regions of similarity between the human prion protein and the homolog in chicken are very clear in the dotplot. Notice that most coincide in order although one chicken region is seen to repeat strongly four times in the human sequence.
The next example does not show such clear-cut similarity:
% compare humprp.pep sw:roa1_drome
Compare compares two protein or nucleic acid sequences and creates a
file of the points of similarity between them for plotting with
DotPlot. Compare finds the points using either a
window-stringency or a word match criterion. The word comparison
is 1,000 times faster than the window-stringency comparison, but
somewhat less sensitive.
Begin (* 1 *) ? <rtn>
End (* 253 *) ? <rtn>
Begin (* 1 *) ? <rtn>
End (* 365 *) ? <rtn>
What comparison window size (* 30 *) ? <rtn>
What stringency (* 9.0 *) ? 17
What should I call the output file (* humprp.pnt *) ? humprp-roa1_drome.pnt
Number of points: 269
Writing .In this run I had to increase the stringency to 17 of 30 in order to filter appropriately. The resultant dotplot is shown below:

The strong column of diagonals in this example shows that one local region of Roa1_Drome repeats itself several times in the prion protein. These probably relate to the very high level of glycine repeat sequences in both proteins. Perform similar analyses with your beta-amyloid protein and its three interesting partners. When running all the dotplots, take notes of those particular regions in the proteins that appear significantly similar. For example, as noted in the above plot, the region of Roa1_Drome from about residue 210 through 250 repeats strongly in humprp from around residue 40 through residue 100. After performing dot matrix analysis on the three beta-amyloid protein comparisons, rerun the DotPlots with the same optimum parameters as you determined previously; however, add the option -figure=(your last name).dotplot1, 2 or 3 respectively, to the command line to create three Figure output files that will be sent to teacher for evaluation.
5.a.2 The dynamic programming alignment algorithms and significance estimation.
Gap and BestFit are GCG's implementation of pairwise dynamic programming -- use the right program for the job at hand. There is a difference between these two algorithms! Gap is a 'global' alignment scheme while BestFit is more 'local.' Using one versus the other implies that you are looking for distinctly different relationships. Know what this mean. If you already know that the full length of two sequences is pretty close, that they probably belong to the same gene family, then Gap is the program of choice; if you only suspect an area of one is similar to another, then you should use BestFit. To force BestFit to be even more local you can specify the more stringent alternative symbol comparison table pam250.cmp located in the GCG logical directory GenMoreData. Both programs can also generate "gapped" output files in standard sequence formats; this can be very handy as direct input to other GCG routines -- particularly multiple sequence analysis programs.
GCG provides a way to estimate alignment significance with BestFit and Gap. When running either program specify -randomizations=100 on the command line and the second input sequence will be jumbled 100 times after the initial alignment is produced. Comparing the quality scores of the randomized alignments to the initial alignment can help give a feeling for the relative meaning of the scores. An old rule-of-thumb that people often use is, if the actual score is much more than three standard deviations above the mean of the randomized scores, the analysis may be significant; if it is much more than five, than it probably is significant; and if it is above nine, than it definitely is significant. This distance above the mean is often called the "Z score" and can be calculated with the following formula:
Z score = [ ( actual score ) - ( mean of randomized scores) ]
-------------------------------------------------------
( standard deviation of randomized score distribution )This type of significance estimation is known as a Monte Carlo approach. It has many implicit statistical problems; however, few practical alternatives exist. I will use BestFit with Randomizations to first illustrate the prion and Roa1_Drome example. Before beginning with your own data though, study your dotplot notes from before. This approach works best when applied to areas where you already know some similarity exists and you wish to further test that similarity. Therefore, restrict your analyses to those regions identified by the dotplots. However, remember that dotplots show us all the regions that are similar, whereas dynamic programming only gives us one optimal solution. Type bestfit -random=100 and supply the appropriate beta-amyloid sequence comparison data names to run the program. Be sure to restrict the analyses to those regions of each sequence identified by your dotplots. Accept the rest of the program's defaults; however, give the output file names that make sense to you while keeping the .pair extension. The process follows below for the human prion and Drosophila ribonucleoprotein Roa1_Drome comparison:
% bestfit -random=100
BestFit makes an optimal alignment of the best segment of similarity
between two sequences. Optimal alignments are found by inserting gaps to
maximize the number of matches using the local homology algorithm of
Smith and Waterman.
BESTFIT of what sequence 1 ? humprp.pep
Begin (* 1 *) ? 40
End (* 253 *) ? 100
to what sequence 2 (* humprp.pep *) ? sw:roa1_drome
Begin (* 1 *) ? 210
End (* 365 *) ? 250
What is the gap creation penalty (* 3.00 *) ? <rtn>
What is the gap extension penalty (* 0.10 *) ? <rtn>
What should I call the paired output display file (* humprp.pair *) ? humprp-roa1
.pair
Aligning ..-.
Gaps: 1
Quality: 29.2
Quality Ratio: 0.712
% Similarity: 56.098
Length: 46
Randomized alignment Quality
1 24.0
2 24.9
3 23.8
4 28.7
5 25.0
////////////////////////////////////////////////////////////
93 25.3
94 24.4
95 27.5
96 25.3
97 26.0
98 23.6
99 26.1
100 24.2
Average quality based on 100 randomizations: 25.2 +/- 1.9The output file is shown below. In this example, the randomized quality scores are all fairly close to the actual score. The Z score calculates to be 2.1; therefore, the interpretation is that the similarity is not significant.
% more humprp-roa1.pair
BESTFIT of: humprp.pep check: 8781 from: 40 to: 100
P1;PRIO_HUMAN - MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C)
ID PRIO_HUMAN STANDARD; PRT; 253 AA.
AC P04156;
DT 01-NOV-1986 (REL. 03, CREATED)
DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE)
DT 01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE) . . .
to: roa1_drome check: 8848 from: 210 to: 250
P1;ROA1_DROME - HETEROGENEOUS NUCLEAR RIBONUCLEOPROTEIN A1 (HNRNP CORE PROTEIN
A1-A)
ID ROA1_DROME STANDARD; PRT; 365 AA.
AC P07909;
DT 01-AUG-1988 (REL. 08, CREATED)
DT 01-AUG-1988 (REL. 08, LAST SEQUENCE UPDATE)
DT 01-OCT-1994 (REL. 30, LAST ANNOTATION UPDATE) . . .
Symbol comparison table: Gencoredisk:Gcgcore/Data/Rundata/Swgappep.Cmp
CompCheck: 1254
Gap Weight: 3.000 Average Match: 0.540
Length Weight: 0.100 Average Mismatch: -0.396
Quality: 29.2 Length: 46
Ratio: 0.712 Gaps: 1
Percent Similarity: 56.098 Percent Identity: 41.463
Average quality based on 100 randomizations: 25.2 +/- 1.9
humprp.pep x roa1_drome September 30, 1995 16:53 ..
40 GQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPH 85
|.|:||| .:||.:|. ||.:|..:||| . |..||. :
210 GRGGPGG.....RAGGNRGNMGGGNYGNQNGGGNWNNGGNNWGNNR 250The extreme glycine richness of both sequences is what the program found, yet the Monte Carlo test suggests that this similarity is not significant -- do not blindly accept the output of any computer program! It may be strictly artifactual, as in this case.
Contrast the previous run with the output from the comparison between the human prion protein and the chicken homolog. Here the region of similarity was much more extensive; therefore, I extended the analysis from residue 15 through 220 in HumPrP and 15 through 240 in Prio_Chick:
% more prp-chick.pair
BESTFIT of: humprp.pep check: 8781 from: 15 to: 220
P1;PRIO_HUMAN - MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C)
ID PRIO_HUMAN STANDARD; PRT; 253 AA.
AC P04156;
DT 01-NOV-1986 (REL. 03, CREATED)
DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE)
DT 01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE) . . .
to: prio_chick check: 9640 from: 15 to: 240
P1;PRIO_CHICK - MAJOR PRION PROTEIN HOMOLOG PRECURSOR (PR-LP) (ACETYLCHOLINE
ID PRIO_CHICK STANDARD; PRT; 273 AA.
AC P27177;
DT 01-AUG-1992 (REL. 23, CREATED)
DT 01-JUL-1993 (REL. 26, LAST SEQUENCE UPDATE)
DT 01-OCT-1993 (REL. 27, LAST ANNOTATION UPDATE) . . .
Symbol comparison table: Gencoredisk:Gcgcore/Data/Rundata/Swgappep.Cmp
CompCheck: 1254
Gap Weight: 3.000 Average Match: 0.540
Length Weight: 0.100 Average Mismatch: -0.396
Quality: 139.3 Length: 223
Ratio: 0.693 Gaps: 7
Percent Similarity: 60.000 Percent Identity: 43.500
Average quality based on 100 randomizations: 62.5 +/- 3.6
humprp.pep x prio_chick September 30, 1995 19:25 ..
. . . . .
17 SDLGLCKK...RPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGG 63
.|::|:|| :|..|||..|: | |: ....| .. |. . . | ||..
19 TDVALSKKGKGKPSGGGWGAGSHRQPSYPRQPGYPHNPGYPHNPGYPHNP 68
. . . . .
64 GWGQ....PHGGGWGQ....PHGGGWGQPHGGGWGQGGGTHSQWNKPSK. 104
|:.: ||..|:.| ||..|: .. |.|:..::|. : .||.|
69 GYPHNPGYPHNPGYPQNPGYPHNPGY.PGWGQGYNPSSGGSYHNQKPWKP 117
. . . . .
105 PKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENM 154
||||:||:|||||||||||||||| :|..|| .||:|. | |::.||
118 PKTNFKHVAGAAAAGAVVGGLGGYAMGRUNIXGMNYHFDSPDEYRWWSENS 167
. . . . .
155 HRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTK........GE 196
||||.||||. .. |: || ||.|||:.:..:....| :.
168 ARYPNRVYYRDYSSPVPQDVFVADCFNITVTEYSIGPAAKKNTSEAVAAA 217
. .
197 NFTETDV..KMMERVVEQMCITQ 217
| ||.:: |::.:|: :||: |
218 NQTEVEMENKVVTKVIREMCVQQ 240The Z score calculation for this comparison yields a value of 21.3, definite significance!
Run the bestfit -randomizations with all the beta-amyloid comparisons being careful to restrict your analyses to those regions that you determined were most similar from the dotplots. This should produce a total of three .pair Monte Carlo comparison files. Be sure all have identifying names that make sense to you. I will be asking about these results in the exercise report form.
5.b. Where to go next -- what to do with significant matches.
Further directions naturally depend on the focus of your research. If you are dealing with a protein of unknown function, then perhaps functions suggested by matches with other known proteins can be experimentally tested in the wet-laboratory. Often regions of similarity can be refined into a motif and used to search the databases for further matches, especially when more than one sequence can be used to create the motif. If areas turn out to be similar to regions of proteins that have had their three-dimensional structure solved, structural features may be inferred and whole new vistas of research may open. The most powerful analyses will come with multiple sequence alignments and profiles created from similar areas. You will be learning techniques for multiple sequence alignment in next week's exercise. And whenever a reliable multiple sequence alignment can be created, then evolutionary relationships, phylogenies, can be estimated.
6) For extra credit:
Devise a method for ascertaining whether there is any significant similarity between any regions of the human prion protein and any regions of the human beta-amyloid protein? Prepare a file and rcp it to teacher that shows me the most significant region you can find; explain how you found it and describe the significance of the similarity.
7) Finishing up and conclusions -- do you have any?
To get credit for this lab, you need to complete the exercise report form that was copied into your directory at the beginning of the tutorial. Rename it to have your last name with the mv command but leave the extension .week10, and then go into the file using the pico editor to fill in the report.
% mv week10.week10 (your lastname).week10
% pico (your lastname).week10
Remote copy the file to teacher with the rcp command:
% rcp (your lastname).week10 teacher@ribozyme:receive
You should also have three dotplot Figure files from the beta-amyloid comparisons. Be sure to rcp them to teacher also:
% rcp (your lastname).dotplot* teacher@ribozyme:receive
This concludes your computing session for this week. Log off ribozyme, get out of the emulator and back to the overlapping windows screen.
% exit
Press the <alt> and <x> keys together. You will be asked if you really want to exit the program. Respond with <y> to get out of the teemtalk emulator and return to the overlapping windows screen.
This has been a very long exercise; I do apologize. However, database searching techniques and their interpretation are among the most misunderstood of all sequence based biocomputing techniques. There is a tremendous amount of confusion and misinformation in this area and anything that can be done to try to clear up some of the mess is entirely worthwhile. One point that still needs to be made is that the previous techniques were performed largely using GCG's suggested defaults. This usually will work for you, but it is a good idea to think about what these default values imply and adjust them accordingly, especially if the results seem inappropriate after running through a first pass with the default parameters intact.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular Biology 215, 403-410.
Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, 2013-2018.
Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis Primer. W.H. Freeman and Company, New York,. NY
Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences USA 89, 10915-10919.
Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology 48, 443-453.
Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio 1994. Nucleic Acids Research 22, 3470-3473.
Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences USA 85, 2444-2448.
Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, (M.O. Dayhoff editor) 5, Suppl. 3, 353-358, National Biomedical Research Foundation, Washington D.C.
Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied Mathematics 2, 482-489.
Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Sciences USA 80, 726-730.