Advanced Database Similarity Searching: What's available, the methods and the algorithms -- motifs, hashing techniques and heuristics, dot matrix analysis, multiple sequence alignments, and profiles.
Author:
Steven M. Thompson
What do database searches tell us and what can we gain from them? Why even bother? If our new sequence, or portions of it, falls into a preexisting group then we can gain knowledge of its function and possibly even its structure, plus sometimes a confirmation of the sequence's validity. Database searches can often provide valuable insights into enzymatic mechanism and even evolution. Are there any "families" that your newly discovered sequence falls into? Even if no similarity can be found, the very fact that your sequence is new and different could be very important. Granted, it's going to be a lot more difficult to discover functional and structural data about it, but in the long run its characterization might prove very rewarding.
Most database searching programs use a couple of tricks to make things happen faster. These tricks fall into two main categories, that of hashing and approximation. Hashing is the process of breaking your query sequence into small "words" or "ktuples" of a set size and creating a "look-up" table with those words keyed to numbers. Then when any of the words match part of an entry in the database, that match is saved. In general, hashing reduces the complexity of the search problem from N2 for dynamic programming to N, the length of all the sequences in the database. Approximation techniques are collectively known as "heuristics." Webster's defines heuristic as "serving to guide, discover, or reveal; . . . but unproved or incapable of proof." In database searching techniques the heuristic usually restricts the necessary search space by calculating some sort of a statistic that allows the program to decide whether further scrutiny of a particular match should be pursued. This statistic may miss things depending on the parameters set -- that's the heuristic.
Database searching is far more sensitive at the amino acid level than at the DNA level because proteins have twenty match criteria versus DNA's four. This drastically cuts down the "noise" level of the search. Therefore, if you are dealing with coding sequence, it is usually prudent to search at the protein level. However, sometimes a preliminary search of the DNA databases with a DNA query will yield valuable information such as the important questions: "Am I working on something that's already been done; is it worth continuing?" or, "Am I doing something totally unique -- or maybe, just `junk' DNA?" An early nucleotide search can often point out the necessity of continued research or the contemplation of abandonment, before any translational analysis has been completed. This could prove to be a tremendous time saver.
Even though protein searching is more sensitive, the DNA databases are much larger. This drawback has been partly overcome with programs which can take a protein query and compare it to translated nucleotide databases, but one still needs to know if the translation is "real." So, even though there are advantages and disadvantages to both types of searching, the general rule is to query with a peptide sequence, if at all possible, and screen whichever database you choose.
One small database that you should screen just as a matter of course is PROSITES. The GCG program Motifs performs this search. Motifs searches for recognized structural, regulatory and enzymatic consensus sequences in the PROSITE Dictionary of Protein Sites and Patterns (Bairoch, 1992). This approach is wonderful for trying to ascertain function in an unidentified peptide sequence, but keep in mind the inherent problems of consensus style searches explored last week. The program can tolerate mismatches with a mismatch option and it displays an abstract with selected references for each motif signature found. Another small database that should not be ignored is NRL_3D. This database contains all the sequences from the three-dimensional coordinate database PDB and, thus, can serve as a link between structural and sequence based methods.
A very powerful approach that should always be utilized, if at all possible, is the Profile suite (Gribskov, et al., 1987). This strategy works best when one has prepared and refined a multiple sequence alignment of significantly similar sequences or, even better, regions within sequences. PROFILE searching involves forming a "profile" from this set of related sequences and then searching the databases with that profile. Profile searching is tremendously powerful and should be pursued whenever possible. A very appropriate strategy would be to find similar genes to a newly sequenced gene using traditional database searching techniques, and then align all of the significantly similar proteins or protein domains using the multiple sequence alignment program PileUp. One could then run the aligned sequences through the Profile package to generate a profile of the family.
Often Profile analysis can show features not obvious to individual members. A distinct advantage is evolutionary issues are considered by virtue of the Profile algorithms in further manipulations and database searches. Gaps are penalized more heavily in conserved areas than they are in variable regions and the more highly conserved a residue is, the more important it becomes. Generated consensus sequences are not based merely on the positional frequency of particular residues but rather utilize the evolutionary conservation of substitutions based on the BLOSUM62 (Henikoff and Henikoff, 1992) table (by default, other substitution matrices can also be specified).
A big question and a very common mistake that is made in this whole area of searching and alignment is the concept of homology versus similarity: There is a huge difference! Similarity is merely a statistical parameter which describes how much two sequences, or portions of them, are alike according to some set scoring criteria. Homology, by definition, implies an evolutionary relationship -- more than just the fact that we've all evolved from the same old pond scum. You need to be able to demonstrate some type of lineage between the organisms or genes of interest in order to claim homology. Even better, be able show some experimental evidence, morphological or genetic, that corroborates your assertion. There is really no such thing as percent homology; something is either homologous or it is not. Do not make the mistake of calling any old sequence similarity homology, it will get you in trouble with a lot of scientists, especially the evolutionists of the world.
How do you tell if a similarity is significant or is truly homologous? Many of the programs generate percent similarity scores, however these really don't mean a whole lot. The "quality" score means a lot more but is hard to interpret. At least it takes the length of similarity, all of the necessary gaps introduced, and the matching of symbols all into account. To get a better handle on what these various scores mean, read the algorithm section of the GCG Program Manual for the various methods -- statistics are always confusing but the descriptions do help. A couple of the programs generate histograms of score distributions, but again, they can be confusing. One of the best ways of deciding significance is to take advantage of the Monte Carlo randomizations option available in the two comparison programs BestFit and Gap. To utilize this strategy, pull the sequence you wish to evaluate from your search output list and compare it to your query using the appropriate algorithm, either Gap or BestFit depending on whether you're trying to compare the entire length of each sequence or only the best regions of similarity to each, respectively, and specify -randomizations=100 on the command line. This option jumbles the second sequence of the comparison 100 times and then generates scores and a standard deviation based on the jumbled matches. You can then compare the random scores to the unjumbled score using a "Z score" calculation to help decide significance. The BLAST (Altschul, et al., 1990) and ProfileSearch algorithms use a similar approach but base their statistics on a comparison to all the rest, "insignificantly similar," members of the database being searched.
Another powerful method that should always be considered in similarity analysis is the dot matrix procedure. In dot matrix analysis one sequence is plotted on the vertical axis against another on the horizontal axis using a very simple approach; wherever they match according to some scoring system that you specify a dot is generated. Dot matrix analysis can point out areas of similarity between two sequences that all other methods might miss. This is because most other methods align either the overall length of two sequences or just the "best" parts of each to achieve one optimal alignment. Dot matrix methods enable the operator to visualize the entirety of both sequences; if you will, it allows the "Gestalt" of the alignment to be seen. However, their interpretation is entirely up to the user -- you must know what the plots mean and how to successfully filter out extraneous background noise. Using this method correctly you can identify areas within sequences that happen to have significant matches that no other method would ever notice. Again, this is a method which you would perform after running initial searches as it only compares one sequence against another, not the entire database.
Many database searches work best when submitted as a batch process. This is because of the sizes of the databases of interest. We currently have over a million sequences in our nucleotide databases. In spite of the fast hashing style algorithms incorporated, most programs take quite a while to search through that much data. There is no way you want to wait in front of a unusable terminal while the computer cranks away at work comparing your query to that many sequences, therefore, take advantage of batch capablities. Most of the GCG database searches accept a batch option which is really handy. An exception to the standard "submit the search and wait" style of most database searching is the BLAST program. This program uses an extremely fast heuristic statistical hashing algorithm on a large parallel computer located at the National Center for Biotechnological Information. Typical searches run in just a minute or two, after you get through the waiting queue, however, realize the limitations of this algorithm to finding only ungapped areas of similarity and of not being optimized for nucleic acid sequences.
I will return to the prion example for this exercise. As in the previous two exercises, use your own Selected Molecule off the list to complete all of the following steps. Log on to your account and initialize GCG. Create a new directory for this exercise and move down into it. Next, copy your translated output file from the actual database entry containing your gene of interest from last week's gene finding session into this new directory. This file should have the .pep extension and it should be identical to its counterpart in the protein sequence databases.
1) Searching the PROSITES database.
One of the quickest and easiest databases to search with your newly translated protein is Dr. Amos Bairoch's Protein signature database. This can lead to immediate answers and routes of investigation. It should always be utilized -- it's just too fast and simple to ignore.
Start the Motifs program by typing motifs; specify your .pep sequence's name. The prion example follows below:
% motifs humprp.pep
Motifs looks for sequence motifs by searching through proteins for the
patterns defined in the PROSITE Dictionary of Protein Sites and
Patterns. Motifs can display an abstract of the current literature on
each of the motifs it finds.
What should I call the output file (* humprp.motifs *) ? <rtn>
humprp.pep len: 231 .........................
Total finds: 2
Total length: 231
Total sequences: 1
CPU time (sec): 02.79
Output file: "/disk2/usr/local/people/thompson/BC578/EX8/humprp.motifs"
% more humprp.motifs
MOTIFS from: humprp.pep
Mismatches: 0 February 8, 1996 17:08 ..
humprp.pep Check: 414 Length: 231 ! TRANSLATE of: humprp.genbank
check: 4378 from: 116 to: 808
______________________________________________________________________________
Prion_1 AGAAAAGAVVGGLGGY
91: NMKHM AGAAAAGAVVGGLGGY MLGSA
****************************
* Prion protein signatures *
****************************
Prion protein (PrP) [1,2,3] is a small glycoprotein found in high quantity in
the brains of humans or animals infected with a number of degenerative
neurological diseases such as Kuru, Creutzfeldt-Jacob disease (CJD), scrapie
or bovine spongiform encephalopathy (BSE). PrP is encoded in the host genome
and expressed both in normal and infected cells. It has a tendency to
aggregate yielding polymers called rods.
Structurally, PrP is a protein consisting of a signal peptide, followed by
an N-terminal domain that contains tandem repeats of a short motif (PHGGGWGQ
in mammals, PHNPGY in chicken), itself followed by a highly conserved domain
of about 140 residues that contains a disulfide bond. Finally comes a C-
terminal hydrophobic domain post-translationally removed when PrP is attached
to the extracellular side of the cell membrane by a GPI-anchor. The structure
of PrP is shown in the following schematic representation:
+---+----------------+-******-------------------****-----+-----+
|Sig| Tandem repeats | C C S| |
+---+----------------+--------------------|--------|----|+-----+
+--------+ |
GPI
'C': conserved cysteine involved in a disulfide bond.
'*': position of the patterns.
As signature pattern for PrP, we selected a perfectly conserved alanine- and
glycine-rich region of 16 residues as well as a region centered on the second
cysteine involved in the disulfide bond.
-Consensus pattern: A-G-A-A-A-A-G-A-V-V-G-G-L-G-G-Y
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: NONE.
-Consensus pattern: E-x-[ED]-x-K-[LIVM](2)-x-[KR]-[LIVM](2)-x-[QE]-M-C-x(2)-
Q-Y
[C is involved in a disulfide bond]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: NONE.
-Last update: December 1992 / Patterns and text revised.
[ 1] Stahl N., Prusiner S.B.
FASEB J. 5:2799-2807(1991).
[ 2] Brunori M., Chiara Silvestrini M., Pocchiari M.
Trends Biochem. Sci. 13:309-313(1988).
[ 3] Prusiner S.B.
Annu. Rev. Microbiol. 43:345-374(1989).
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Prion_2 Ex(E,D)xK(L,I,V,M)2x(K,R)(L,I,V,M)2x(Q,E)MCx2QY
Ex(D)xK(M){2}x(R)(V){2}x(Q)MCx{2}QY
178: GENFT ETDVKMMERVVEQMCITQY ERESQ
Find reference above under sequence: humprp.pep, pattern: Prion_2.
Extensive
abstract and reference lists follow the identified sequence locations for each
site. This information can save anybody a tremendous amount of work! The
sites themselves are shown with their sequence locations below each consensus
pattern. More sites will be listed if you specify the frequent option. Notice
in my example that Dr. Bairoch included the prion signature in his database;
the characteristic repeat is only found in the prion protein.2) Traditional database searching: FastA and Word approaches -- running the algorithms.
Two different symbol matching algorithms have traditionally been utilized in database searching. These two algorithms (see the GCG Program Manual for details) are incorporated into GCG's FastA (Pearson and Lipman, 1988) and WordSearch (Wilbur and Lipman, 1983) programs Since the algorithms differ, the output results will also differ. As in all types of computerized molecular biology analyses, the prudent will run as many strategies as practical and try to interpret the results in light of this. Most of the following programs accept GCG's batch option; pay attention to the necessity of running these programs in the background -- otherwise you will be stuck at an unusable terminal!
TFastA is one of the most robust of the searching programs onsite. It compares your peptide sequence against translations of the DNA database. This way you can take advantage of the multitude of DNA sequences that never make it to the protein databases and yet still retain the sensitivity of protein searches.
Type tfasta -check -batch to start the Translation FastA program in batch mode and see all of the available options. At the command option prompt add -noalign to suppress all the alignments since most are redundant for our purposes and we will be investigating the more interesting ones later anyway. Some of the other options can be very helpful depending on your specific situation and should be explored in your own research. The optall option in particular is very useful and is now the program default. FastA sorts its output based on a normalized derivative of the optimum score, the result of the final dynamic programming pass in FastA, rather than the initn score, the longest combined word score. Specify your peptide's sequence name at the program's sequence prompt. Accept all of the program defaults including the Expectation Function cutoff value. This should insure that the listing goes beyond those entries which we would expect for these well studied protein molecules. The prion example session follows below:
% tfasta -check -batch
TFastA does a Pearson and Lipman search for similarity between a
query peptide sequence and any group of nucleotide sequences. TFastA
translates the nucleotide sequences in all six reading frames before
performing the comparison. It is designed to answer the question, "What
implied peptide sequences in a nucleotide sequence database are similar
to my peptide sequence?"
Minimal Syntax: % tfasta [-INfile1=]ggamma.pep -Default
Prompted Parameters:
[-INfile2=]GenEMBL:* search set (all of GenEMBL)
[-OUTfile=]ggamma.tfasta output file name
-BEGin=1 -END=148 range of interest
-WORdsize=2 word size
-EXPect=2.0 lists scores until E() value reaches 2.0
Local Data Files:
-MATRix=blosum50.cmp scoring matrix for peptides
Optional Parameters:
Press q to quit or <Return> for more: <rtn>
-GAPweight=16.0 gap creation penalty
-LENgthweight=4.0 gap extension penalty
-SINce=6.90 limits search to sequences dated on or after June 1990
-THREEFrames translates and searches only the three forward reading
frames
-FRAme=1 translates and searches only the frame specified.
-NOPAMfactor uses a constant factor to calculate initial diagonal scores
-LIStsize=40 shows the best 40 scores (overrides EXPect)
-NOATTRibutes suppresses writing the Begin, End, and Strand
list attributes to the list of best scores
-ALIgn=20 shows the best 20 alignments
-NOALIgn suppresses sequence alignments
-OPTall=20 immediately computes opt score when the initn score is 20
or higher; sorts on opt score
-NOOPTall doesn't compute opt score during search; sorts on initn
-SWalign does final alignment as Smith-Waterman
-SHOWall shows complete sequences in alignment, not just overlaps
-MARKx=3 determines the alignment display mode
-NOHIStogram suppresses printing the histogram
-LINEsize=60 number of sequence symbols per line of the alignment
-NODOCLines suppresses sequence documentation in the alignment
-NOMONitor suppresses the screen trace for each search set sequence
-BATch submits the program to run in the batch queue
Press q to quit or <Return> for more: <rtn>
-MINLength=1000 searches only sequences of 1000 or more residues
-MAXLength=5000 searches only sequences of 5000 or fewer residues
Add what to the command line ? -noalign
TFASTA with what query sequence ? humprp.pep
Begin (* 1 *) ? <rtn>
End (* 231 *) ? <rtn>
Search for query in what sequence(s) (* GenEMBL:* *) ? <rtn>
What word size (* 2 *) ? <rtn>
Don't show scores whose E() value exceeds: (* 10.0 *): <rtn>
What should I call the output file (* humprp.tfasta *) ? <rtn>
** tfasta will run as a batch or at job.
** tfasta was submitted using the command:
" batch "
warning: commands will be executed using /bin/sh
job 848620194.b at Thu Nov 21 15:49:54 1996
The job will submit itself and record a log file of any problems encountered
during
its run. This log file will be e-mailed to you by the system; it can prove
handy with debugging attempts if anything goes wrong. You should proceed with
the rest of the exercise at this point.
Now run the normal FastA program searching the SwissProt:* logical sequence specification. Start the program by typing fasta -batch -noalign (the options are the same as in TFastA) to again run the program in batch mode and suppress alignments. Specify your sequence's filename.ext and accept the program's SwissProt database search. Accept the rest of the defaults again. The prion example is illustrated below:
% fasta -batch -noalign
FastA does a Pearson and Lipman search for similarity between a query
sequence and a group of sequences of the same type (nucleic acid or
protein). For nucleotide searches, FastA may be more sensitive than BLAST.
FASTA with what query sequence ? humprp.pep
Begin (* 1 *) ? <rtn>
End (* 231 *) ? <rtn>
Search for query in what sequence(s) (* SwissProt:* *) ? <rtn>
What word size (* 2 *) ? <rtn>
Don't show scores whose E() value exceeds: (* 10.0 *): <rtn>
What should I call the output file (* humprp.fasta *) ? <rtn>
** fasta will run as a batch or at job.
** fasta was submitted using the command:
" batch "
warning: commands will be executed using /bin/sh
job 848620634.b at Thu Nov 21 15:57:14 1996
Again
the batch option automatically submits the program to the background and you
should proceed with the exercise.Repeat the FastA search specifying NRL_3D as the database to search. You do not need to run this search in batch because NRL_3D is relatively small. Be sure not to use the default output file name as this would overwrite your previous FastA search. This search should pull out the corresponding sequence from the 3-dimensional structural coordinate database. Remember NRL_3D is a "link" to PDB which can be used to specify these sequences in GCG programs and point to whole new vistas of investigation.
The last traditional database search we will submit is called WordSearch. Unlike FastA, WordSearch can find more than one region of similarity between two sequences. Just like the previous two, it will accept automatic batch submission. Type wordsearch -batch to begin the program; additionally, take advantage of this program's graphics capability by specifying -plot and -figure on the command line. Give the program the sequence's name and accept the logical SwissProt:* as the sequences to be searched. Specify a list size of 500. An option this program will accept, but that we will not be using this time, is simplify. This option simplifies the input sequences according to a scheme which generalizes amino acids into broad classes based on their chemical characteristics. Another helpful option available is mask. This option allows you to search with only certain characters of your sequence such as only the first and second codon positions in coding DNA. The example session with the prion protein follows below:
% wordsearch -plot -batch -figure
WordSearch identifies sequences similar to a query sequence using a
Wilbur and Lipman search. WordSearch answers the question,
"What sequences in the database are similar to my sequence?" The output
is a list of significant diagonals whose alignments can be displayed
with Segments.
Process set to plot with VERSATERM-TEK4105 attached to term
using the tekd graphic interface.
WORDSEARCH with what query sequence ? humprp.pep
Begin (* 1 *) ? <rtn>
End (* 231 *) ? <rtn>
Search for query in what sequence(s) (* SwissProt:* *) ? <rtn>
What word size (* 2 *) ? <rtn>
List how many best diagonals (* 50 *) ? 500
Integrate how many adjacent diagonals (* 3 *) ? <rtn>
/
What should I call the output file (* humprp.word *) ? <rtn>
** wordsearch will run as a batch or at job.
** wordsearch was submitted using the command:
" batch "
warning: commands will be executed using /bin/sh
job 848621491.b at Thu Nov 21 16:11:31 1996job 848621491.b at Thu Nov 21 16:11:31 1996
3) Running ProfileSearch.
Dr. Gribskov et al. (see references in GCG program manual) have assembled an elegant package for associating distantly related proteins and discovering structural motifs in Profile analysis. John Devereux has written an excellent overview essay of the method in the GCG program manual; please take the time to read this section!
We should actually be running a Profile analysis of a family assembled from the significant results of the previous searches; however, we just don't have the time. Therefore, we will take advantage of the Profile we assembled back in Exercise 5 when we constructed our probes/primers. Copy your .prf file from your Exercise 5 directory into the present directory. My copy example follows (yours will be different):
% cp ~/courses/BC578/EX5/prion.prf .
This program also accepts automatic batch submission so it can be launched in a similar manner as the previous three. Type profilesearch -batch -check to launch the program in batch mode and see all the available options. Answer the prompts as above, however, take advantage of the minlist option. This option allows you to cut-off the search list at a particular Z score; in this case set it equal to 4.0. We will again run the search against the SwissProt database. My screen trace follows:
% profilesearch -batch -check
ProfileSearch uses a profile (representing a group of aligned
sequences) as a query to search the database for new sequences with
similarity to the group. The profile is created with the program
ProfileMake.
Minimal Syntax: % profilesearch [-INfile1=]hsp70.prf -Default
Prompted Parameters:
[-INfile2=]SW:* the search set
-GAPweight=4.50 maximum position-specific gap creation penalty
-LENgthweight=0.05 maximum position-specific gap extension penalty
[-OUTfile=]hsp70.pfs output file name
Local Data Files: None
Optional Parameters:
-LIStsize=100 sets a limit to the size of the output list
(60,000 is the default)
-SEQLimit=100000 maximum number of sequences to search
(can't be set higher than 100,000)
-MINList=2.5 lowest Z score to report in output list.
Press q to quit or <Return> for more: <rtn>
-MINSeq=50 minimum length sequence to examine in the search
-MINNorm=50 minimum length sequence to use in normalization
-NONORmalize turns off normalization of comparison scores
-NOAVErage does not adjust quality scores for sequence
composition
-NOSEArch normalizes a pre-existing file of search scores
-FITfile[=hsp70.fit] output file with curve fitting information for
normalized scores
-CPUlimit=1000 limits the search to 1,000 seconds (default
is 86,400)
-SINce=6.90 limits search to sequences dated on or after
June 1990 (does not work for PIR databases)
-NOMONitor suppresses the screen trace for the search set
sequences
-BATch submits program to the batch queue
Add what to the command line ? -minlist=4.0
PROFILESEARCH version 4.50 November 21, 1996 16:17
PROFILESEARCH with what query profile ? prion.prf
"prion.prf" is a profile of length: 251
Search for query in what sequences(s) (* SwissProt:* *) ? <rtn>
What is the gap creation penalty (* 33.0 *) ? <rtn>
What is the gap extension penalty (* 0.37 *) ? <rtn>
What should I call the output file (* prion.pfs *) ? <rtn>
** profilesearch will run as a batch or at job.
** profilesearch was submitted using the command:
" batch "
warning: commands will be executed using /bin/sh
job 848621889.b at Thu Nov 21 16:18:09 1996
We
did refine our Profile to only include the more conserved portion of the
alignment by trimming the divergent amino and carboxy termini back in Exercise
5 which is good; however, its members should be appropriately weighted to even
out their contributions. The Profile refinement procedure, including
repeatedly searching the databases and including or excluding members and
adjusting their weights as well as adjusting the Profile's length, is known as
validating the Profile. If using Profile analysis in your own research,
following the validation procedures outlined in the GCG Program Manual in the
ProfileScan description is a very prudent idea, but we do not have the time for
that now. As it is, much of the exercise is going to have to wait for the
batch jobs to finish anyway.The ProfileSearch batch job may not be done for a couple days under ribozyme's present load. Don't worry about it until next week and then we can take a look at the results. Proceed with the rest of the exercise at this point.
4) Revisit BLAST; Internet similarity searching.
The BLAST server at NCBI can provide the most up to date and quickest database search available. BLAST is a heuristic algorithm for searching sequence databases developed by the National Center for Biotechnology Information at the National Library of Medicine. The acronym stands for Basic Local Alignment Search Tool. BLAST by default runs on an eight processor parallel computer system at NCBI. It only looks for ungapped segments; however, it ranks matches statistically and provides probability values for each to help evaluate significance. As such, it is best for identifying shorter regions of high similarity -- exactly what you might want with a sequence of unknown function. It is very fast, at least an entire order of magnitude over traditional sequence similarity database searching, yet maintains all of the sensitivity of the old methods for local similarity regions with protein sequences! You'll get your reply in a manner of minutes in most cases. Another advantage with BLAST is it not only shows you the best alignment for each similar sequence found (as in the BestFit type alignments of FastA) but also shows the next best alignments for each up to a certain preset cutoff point. This combines the power of dot-matrix type analyses and the interpretative ease of traditional sequence alignments. One can fine-tune BLAST by altering its operating parameters and taking advantage of the many options available in it; however, BLAST is not appropriate for comparing nucleotide sequences against the nucleotide database. When you are forced to perform this type of nucleotide-to-nucleotide search it is usually best to use FastA instead.
BLAST accesses the latest (GenBank updates every night) database by default, nucleotide or protein. The GCG implementation of BLAST by default runs in a remote client-server mode such that NCBI's database and computer perform the analysis; alternatively you can run it locally if you have custom assembled databases that you wish to screen.
For help in interpreting BLAST results refer to the GCG BLAST documentation, the BLAST HELP file obtained off the web or by sending the single word "HELP" to BLAST@ncbi.nlm.nih.gov (leave the subject line blank), and the suggested text book for the course. You ran a sample BLAST search back in Exercise 4; run blast -check on your selected protein now. Specify the -filter=xs option. This is very powerful for screening out low complexity and repeat sequences from your query to minimize confusion due to random noise. The prion protein is very rich in these types of confounding sequence and so is a very appropriate candidate for this option. Also specify -segments=0, to suppress segment alignments and hence reduce the size of the output file. The standard output file is very long because BLAST automatically aligns the best 100 matches. Accept the default database choice, 1) nr p which stands for "NonRedundant Protein," to search all of the unique sequences from the most current versions of all of the protein databases. Set the listsize parameter to 500. BLAST has recently become quite busy; because of this, you may have to wait for a few moments in a queue. A screen trace of a BLAST session with the prion protein follows:
% blast -check
BLAST searches for sequences similar to a query sequence. The query and the
database searched can be either peptide or nucleic acid in any combination.
BLAST can search databases on your own computer or databases maintained at
the National Center for Biotechnology Information (NCBI) in Bethesda,
Maryland, USA.
Minimal Syntax: % blast [-INfile1=]SW:Zea2_Maize -Default
Prompted Parameters:
[-INfile2=]swissprot database to search
-EXPect=10.0 ignores scores that would occur by chance
more than 10 times
-LIStsize=250 maximum number of sequences listed in the output
[-OUTfile=]zea2.blastp output file name
Local Data Files:
[-DATa1=BLAST.rDBs] list of available remote databases
[-DATa2=BLAST.lDBs] list of available local databases
[-DATa3=BLAST.sDBs] list of available site-specific databases
Optional Parameters:
Press q to quit or <Return> for more: <rtn>
-TOPstrand searches only the top strand of nuc. query
-BOTtomstrand searches only the bottom strand of nuc. query
-DBTOPstrand translates only top strand of nuc. database
-DBBOTtomstrand translates only bottom strand of nuc. database
-FILter=xs filters low complexity segments out of query
-MATrix=PAM120[,PAM250...] specifies one [or more] protein scoring matrix
-TBLASTX if query and database are both nucleotide,
translates both and does protein comparisons
-TRANSlate=1 genetic code for translating query
-DBTRANSlate=1 genetic code for translating database
-REMoteonly searches only remote databases (simplifies menu)
-LOCalonly searches only local databases ( " " )
-DBNucleotideonly searches only nucleic databases ( " " )
-DBProteinonly searches only protein databases ( " " )
-SERver=cruncher.nlm.nih.gov runs remote search on this particular host
-WORdsize=7 sets word size (primarily for nuc-nuc searches)
-SYNonym=10 sets minimum score for equivalent words
-MATCH=5 scoring matrix value for nucleotide matches
-SEGments=100 displays segment pairs from top 100 sequences
-HIStogram shows histogram of hits versus expectation level
-APPend="string" appends "string" to pass-through command line
-BATch submits program to batch queue
Add what to the command line ? -filter=xs -segments=0
BLAST search with what query sequence? humprp.pep
Search for query in what sequence database:
1) nr p Non-redundant GenBank CDS translations+PDB+SwissProt+PIR
2) pdb p PDB protein sequences
3) swissprot p SwissProt sequences
4) yeast p Saccharomyces cerevisiae protein sequences
5) kabat p Kabat Sequences of Proteins of Immunological Interest
6) alu p Translations of Select Alu Repeats from REPBASE
7) month p All new or revised GenBank CDS translation+PDB+SwissProt+PI
8) nr n Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST's
9) pdb n PDB nucleotide sequences
10) vector n Vector subset of GenBank
11) yeast n Saccharomyces cerevisiae genomic nucleotide sequences
12) est n Non-redundant Database of GenBank+EMBL+DDBJ EST Division
13) sts n Non-redundant Database of GenBank+EMBL+DDBJ STS Division
14) gss n Genome Survey Sequences
15) mito n Database of mitochondrial sequences, Rel. 1.0, July 1995
16) kabat n Kabat Sequences of Nucleic Acid of Immunological Interest
17) epd n Eukaryotic Promotor Database
18) alu n Select Alu Repeats from REPBASE
19) month n All new or revised GenBank+EMBL+DDBJ+PDB sequences released
Please choose one (* 1 *): <rtn>
Ignore hits expected to occur by chance more than (* 10.0 *) times? <rtn>
Limit the number of sequences in my output to (* 250 *) ? 500
What should I call the output file (* humprp.blastp *) ? <rtn>
Trying cruncher.nlm.nih.gov (130.14.25.175)
Connected to cruncher.nlm.nih.gov
Waiting for 2 of 5 other BLAST requests on the system to finish.
Still waiting
.
Search in progress on the network server.
................................
WARNING: HSPs invo.lving 67 database sequences were not reported due to the
limiting value of parameter B = 1.
...
Done!
The complete prion BLAST search reply follows:
% more humprp.blastp
National Center for Biotechnology Information (NCBI)
Experimental GENINFO(R) BLAST Network Service (Cruncher)
Fri Nov 22 13:20:44 EST 1996, Up 10 days, 3 mins, load: 13.38, 11.72, 11.63
If results of this search are reported or published, please mention that
the computation was performed at the NCBI using the BLAST network service.
Problems with the service should be reported to a local system administrator.
PEPTIDE SEQUENCE DATABASES
nr Non-redundant GenBank CDS translations+PDB+SwissProt+PIR
pdb PDB protein sequences
kabat Kabat Sequences of Proteins of Immunological Interest
alu * Translations of Select Alu Repeats from REPBASE
month All new or revised GenBank CDS translation+PDB+SwissProt+PIR
sequences released in the last 30 days
swissprot SwissProt sequences
NUCLEOTIDE SEQUENCE DATABASES
nr Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST's or
STS's)
est + Non-redundant Database of GenBank+EMBL+DDBJ EST Division
sts + Non-redundant Database of GenBank+EMBL+DDBJ STS Division
pdb PDB nucleotide sequences
vector Vector subset of GenBank
mito * Database of mitochondrial sequences, Rel. 1.0, July 1995
kabat Kabat Sequences of Nucleic Acid of Immunological Interest
epd Eukaryotic Promotor Database
alu *+ Select Alu Repeats from REPBASE
month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in
the last 30 days
* Databases that are not accessible through the NCBI Retrieve E-mail server.
+ The TBLASTX program is restricted to searching these databases.
==============================================================================
You can obtain the BLAST documentation files, send a message consisting of
just the word ''help'' (without the quotes) to: blast@ncbi.nlm.nih.gov
Last modification dates: August 10th 95 for the E-mail server help, January
19th 94 for the BLAST manual and February 7th 95 for the BLAST FAQ.
==============================================================================
For a free subscription to "NCBI News", the NCBI newsletter, send a request
along with your name and postal mailing address to: info@ncbi.nlm.nih.gov
==============================================================================
A new GenBank sequence submission tool, called BankIt, is now available
through the NCBI's home page on the World Wide Web. The URL is
http://www.ncbi.nlm.nih.gov/
==============================================================================
Waiting for 2 of 5 other BLAST requests on the system to finish.
Still waiting
BLASTP 1.4.9MP [26-March-1996] [Build 14:27:01 Apr 1 1996]
Reference: Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers,
and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol.
215:403-10.
Query= TITLE humprp.pep
(231 letters)
>lcl|TITLE humprp.pep
KKRPKPGGWNTGGSRYPGQGSPGGNRYXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXTHSQXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXMLGSAMSRPIIHFG
SDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHXXXXXXXXXXXXXXD
VKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSXXXXXXXXXXXXXXVG
Database: Non-redundant GenBank CDS translations+PDB+SwissProt+SPupdate+PIR
230,895 sequences; 65,241,596 total letters.
Searching..................................................done
Smallest
Sum
High Probability
Sequences producing High-scoring Segment Pairs: Score P(N) N
sp|P04156|PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR (PRP) ... 506 2.1e-87 2
pir||A05017 major prion PrP27-30 protein precurs... 506 2.2e-87 2
pir||A40372 major prion protein PrP variant 1 - ... 506 2.2e-87 2
gi|190520 (M81930) prion protein [Homo sapiens] 506 2.3e-87 2
gi|190518 (M81929) prion protein [Homo sapiens] 506 2.4e-87 2
sp|P40252|PRIO_GORGO MAJOR PRION PROTEIN PRECURSOR (PRP) ... 503 5.4e-87 2
gi|563209 (U15166) major prion protein precurs... 503 5.4e-87 2
sp|P40253|PRIO_PANTR MAJOR PRION PROTEIN PRECURSOR (PRP) ... 498 2.7e-86 2
sp|P40256|PRIO_PONPY MAJOR PRION PROTEIN PRECURSOR (PRP) ... 490 3.4e-85 2
sp|P40251|PRIO_COLGU MAJOR PRION PROTEIN PRECURSOR (PRP) ... 485 1.7e-84 2
sp|P40254|PRIO_MACFA MAJOR PRION PROTEIN PRECURSOR (PRP) ... 485 1.7e-84 2
gi|563213 (U15164) major prion protein precurs... 485 1.7e-84 2
gi|1658469 (U75389) prion protein [Colobus guer... 485 1.7e-84 2
gi|1658471 (U75386) prion protein [Cercopithecu... 485 1.7e-84 2
gi|1658475 (U75388) prion protein [Cercopithecu... 485 1.7e-84 2
sp|P40250|PRIO_CERAE MAJOR PRION PROTEIN PRECURSOR (PRP) ... 485 1.7e-84 2
sp|P40255|PRIO_MANSP MAJOR PRION PROTEIN PRECURSOR (PRP) ... 485 1.8e-84 2
sp|P40248|PRIO_CALMO MAJOR PRION PROTEIN PRECURSOR (PRP) ... 485 1.8e-84 2
gi|1658467 (U75384) prion protein [Cercocebus a... 485 1.8e-84 2
sp|P40246|PRIO_ATEGE MAJOR PRION PROTEIN PRECURSOR (PRP) ... 485 1.8e-84 2
sp|P40247|PRIO_CALJA MAJOR PRION PROTEIN PRECURSOR (PRP) ... 482 4.4e-84 2
sp|P40257|PRIO_PREFR MAJOR PRION PROTEIN PRECURSOR (PRP) ... 481 6.0e-84 2
gi|1396068 (U08302) prion protein [Presbytis fr... 481 6.0e-84 2
sp|P40249|PRIO_CEBAP MAJOR PRION PROTEIN PRECURSOR (PRP) ... 488 6.0e-84 2
gi|1658477 (U75385) prion protein [Cercocebus t... 481 6.2e-84 2
sp|P40258|PRIO_SAISC MAJOR PRION PROTEIN PRECURSOR (PRP) ... 479 1.1e-83 2
gi|1658481 (U75383) prion protein [Theropithecu... 478 1.7e-83 2
gi|595853 (U15165) major prion protein precurs... 476 3.0e-83 2
sp|P40245|PRIO_AOTTR MAJOR PRION PROTEIN PRECURSOR (PRP) ... 481 4.3e-83 2
sp|P04925|PRIO_MOUSE MAJOR PRION PROTEIN PRECURSOR (PRP) ... 424 1.5e-80 3
gi|200531 (M18071) prion protein [Mus musculus] 423 2.1e-80 3
gi|200527 (M13685) prion protein [Mus musculus] 420 5.4e-80 3
pir||A34759 prion protein - Chinese hamster >gi|... 419 1.4e-79 3
pir||B34759 prion protein - golden hamster >gi|1... 419 1.4e-79 3
sp|P10279|PRIO_BOVIN MAJOR PRION PROTEIN 1 PRECURSOR (PRP... 481 2.1e-79 2
sp|Q01880|PRIP_BOVIN MAJOR PRION PROTEIN 2 PRECURSOR (PRP... 478 5.6e-79 2
gi|217594 (D10612) prion protein [Bos taurus] 477 7.5e-79 2
sp|P40242|PRIO_TRAST MAJOR PRION PROTEIN 1 PRECURSOR (PRP... 475 1.4e-78 2
prf||1921132B PrP protein [Tragelaphus strepsiceros] 475 1.5e-78 2
sp|P40243|PRP2_TRAST MAJOR PRION PROTEIN 2 PRECURSOR (PRP... 475 1.5e-78 2
pir||A23545 major prion PrP27-30 protein - hamster 409 3.3e-78 3
sp|P04273|PRIO_MESAU MAJOR PRION PROTEIN PRECURSOR (PRP) ... 409 3.3e-78 3
gi|191423 (M37381) scrapie prion [Mesocricetus... 409 3.5e-78 3
pir||UJHYIH major prion PrP27-30 protein - golde... 409 3.5e-78 3
pir||A54281 prion protein precursor - sheep >gi|... 471 5.2e-78 2
sp|P23907|PRIO_SHEEP MAJOR PRION PROTEIN PRECURSOR (PRP).... 469 9.9e-78 2
pir||S37149 prion protein - goat >gi|400443 (X74... 467 1.9e-77 2
gi|1149617 (X91999) prion protein [Capra hircus] 464 4.9e-77 2
sp|P47852|PRIO_ODOHE MAJOR PRION PROTEIN PRECURSOR (PRP).... 463 6.7e-77 2
sp|P40244|PRIO_MUSVI MAJOR PRION PROTEIN PRECURSOR (PRP) ... 456 6.2e-76 2
sp|P13852|PRIO_RAT MAJOR PRION PROTEIN (PRP). >pir||A53... 423 2.1e-75 3
gi|556349 (L07623) prion protein [Sus scrofa] 449 5.7e-75 2
gi|619652 (U08952) prion protein [Mustela puto... 446 1.5e-74 2
gi|546665 (S69654) prion protein, PrP [zitter ... 381 2.3e-74 3
gi|1490413 (U28334) major prion protein [Orycto... 441 5.0e-73 2
gi|642126 (L38993) prion protein [Trichosurus ... 400 1.1e-57 2
gi|191432 (K02234) PrP 27-30 protein [Mesocric... 408 2.4e-52 1
gi|200533 (M30384) prion protein PrP27-30 [Mus... 188 1.7e-21 1
pir||A22315 Scrapie prion - mouse (fragment) >pr... 187 2.3e-21 1
sp|P27177|PRIO_CHICK MAJOR PRION PROTEIN HOMOLOG PRECURSO... 125 2.2e-19 3
pir||A37372 prion protein homolog precursor - ch... 120 9.9e-19 3
pir||UJCH major prion protein homolog precurso... 120 9.9e-19 3
pir||S02520 Major prion PrP, scrapie, protein - ... 149 1.5e-14 1
pir||S14078 amyloid protein, 11K - human 115 1.1e-09 1
sp|Q01157|GRW1_LYCES GLYCINE-RICH CELL WALL STRUCTURAL PR... 55 0.55 1
gnl|PID|e153043 (X55691) glycine-rich protein [Solan... 64 0.59 1
pir||S14978 glycine-rich protein (clone wN) - to... 64 0.65 1
pir||S14977 glycine-rich protein (clone wM) - to... 64 0.67 1
pir||A36019 Scrapie prion - golden hamster (frag... 58 0.72 1
sp|P17816|GRP_HORVU GLYCINE-RICH CELL WALL STRUCTURAL PR... 64 0.74 1
sp|P21750|SALA_DROME PROTEIN SPALT-ACCESSORY. >pir||S0026... 47 0.91 2
gnl|PID|e156085 (X55690) glycine-rich protein [Solan... 52 0.92 1
pir||S53051 glycine rich protein - barley (fragm... 59 0.93 1
gi|1673657 (AE000002) Mycoplasma pneumoniae, E0... 59 0.999 1
sp||P100_HUMAN_1 [Segment 1 of 2] DNA-BINDING P52/P10... 32 0.9992 2
sp|P21749|SALA_DROSI PROTEIN SPALT-ACCESSORY. >pir||B3391... 43 0.9998 2
Parameters:
V=500
B=0
-filter=xnu+seg
-echofilter
P=4
-ctxfactor=1.00
E=10
Query ----- As Used ----- ----- Computed ----
Frame MatID Matrix name Lambda K H Lambda K H
+0 0 BLOSUM62 0.320 0.135 0.418 same same same
Query
Frame MatID Length Eff.Length E S W T X E2 S2
+0 0 231 128 10. 59 3 11 22 0.19 31
Statistics:
Query Expected Observed HSPs HSPs
Frame MatID High Score High Score Reportable Reported
+0 0 63 (29.1 bits) 506 (233.6 bits) 461 0
Query Neighborhd Word Excluded Failed Successful Overlaps
Frame MatID Words Hits Hits Extensions Extensions Excluded
+0 0 3821 19292048 4710041 14542333 39674 495
Database: Non-redundant GenBank CDS translations+PDB+SwissProt+SPupdate+PIR
Release date: November 22, 1996
Posted date: 6:19 AM EST Nov 22, 1996
# of letters in database: 65,241,596
# of sequences in database: 230,895
# of database sequences satisfying E: 76
No. of states in DFA: 539 (53 KB)
Total size of DFA: 95 KB (128 KB)
Time to generate neighborhood: 0.01u 0.01s 0.02t Real: 00:00:00
No. of processors used: 4
Time to search database: 34.62u 0.33s 34.95t Real: 00:00:10
Total cpu time: 34.68u 0.37s 35.05t Real: 00:00:10
<END
Some
very important features of BLAST's output need to be pointed out. First notice
that the -filter=xs switch causes the troublesome portions of the query
sequence to be ignored in the search. Especially pay attention to the Poisson
distribution scores. These are the likelihoods that the observed matches could
be due to chance. Therefore, the smaller the number, the more significant.
You should be able to see a fairly clear demarcation where the scores quickly
drop off between the significant hits and the background noise. Also realize
that, if we had chosen to display alignments, more than one segment per
sequence can be presented by the BLAST program. Another important point to
notice is, unlike all other GCG programs, the list generated by this search is
not appropriate as input to other GCG analyses. This output list is
actually generated by NCBI's computer which doesn't know about GCG format
requirements. Therefore, if you want to use the results of a BLAST search in
another GCG program you must manually edit the list changing the database names
to reflect the logicals that GCG understands. For example, change
sp|P40250|PRIO_CERAE to SW:P40250 PRIO_CERAE (either insert a blank space or !
between the access code and sequence name). The "gi" designation is all the
translations from GenBank's CDS references. This database, GenPept, is not
installed on VADMS' systems. Therefore, if you want to include any in
subsequent analyses, you must individually translate them from GenBank. For
instance, "gi|190518 (M81929) prion protein [Homo sapiens]" is the CDS
translation from GenBank accession code M81929. This does make life a bit more
complicated but can be worked around.5) What Next? Comparisons, interpretations, and further analyses.
I hate to do this to you, but this exercise must be performed in two parts! In order to proceed you must have the output files from all of the previous searches, excluding the ProfileSearch, available. I do apologize.
When the output from the previous "traditional" batch jobs is available take a look at the files. They will be located within your Exercise 8 subdirectory. Also all batch jobs mail their log files to your account -- you'll want to delete these if all jobs ran to completion. You may want to print out your search results using the lpr command for reference in the remainder of the exercise although it is not necessary. The abridged prion example output files follow below. Naturally, the topmost "hits" are prion proteins; it's the ones just below the expected hits that may prove interesting:
!!SEQUENCE_LIST 1.0
(Peptide) TFASTA of: humprp.pep from: 1 to: 231 November 21, 1996 19:57
TRANSLATE of: humprp.genbank check: 4378 from: 116 to: 808
(mature protein only: PrP 33-35-C)
generated symbols 1 to: 231.
LOCUS HUMPRP 2415 bp ss-mRNA PRI 20-MAY-1987
DEFINITION Human prion protein (PrP) mRNA, complete cds.
ACCESSION M13899 . . .
TO: GenEMBL:* Sequences: 299,711 Symbols: 390,626,675 Word Size: 2
Sequences too short to analyze: 36 (146 symbols)
Databases searched:
gb_ba, Release 97.0, Released on 0Oct96, Formatted on 0Oct96
gb_in, Release 97.0, Released on 0Oct96, Formatted on 0Oct96
///////////////////////////////////////////////////////////////////
em_sy, Release 43.0, Released on 0May1995, Formatted on 0Mar1996
em_un, Release 43.0, Released on 0May1995, Formatted on 0Mar1996
em_vi, Release 43.0, Released on 0May1995, Formatted on 0Mar1996
Searching all six frames.
Scoring matrix: GenRunData:blosum50.cmp
Variable pamfactor used
Gap creation penalty: 16 Gap extension penalty: 4
Histogram Key:
Each histogram symbol represents 3538 search set sequences
Each inset symbol represents 39 search set sequences
z-scores computed from opt scores
z-score obs exp
(=) (*)
< 20 112558 0 :*===============================
22 1 0 :*
24 2 2 :*
26 39 35 :*
28 154 376 :*
30 544 2286 :*
32 1406 8838 := *
34 3872 23968 :== *
36 11128 49226 :==== *
38 30025 81352 :========= *
40 62970 113478 :================== *
42 111653 138713 :================================ *
44 161831 153014 :===========================================*==
46 202943 155848 :============================================*=============
48 212249 149207 :==========================================*=================
50 194974 136152 :======================================*=================
52 154066 119700 :=================================*==========
54 119534 102245 :============================*=====
56 89439 85406 :========================*=
58 64612 70117 :===================*
60 47274 56799 :============== *
62 34419 45536 :========== *
64 27565 36214 :======== *
66 22429 28623 :======= *
68 18245 22514 :======*
70 15137 17643 :====*
72 12621 13787 :===*
74 10690 10749 :===*
76 8597 8366 :==*
78 7044 6503 :=*
80 5928 5049 :=*
82 4916 3863 :=*
84 4054 3060 :*=
86 3296 2367 :*
88 2586 1832 :*
90 2414 1417 :*
92 1912 1097 :* :============================*===========
94 1455 849 :* :=====================*================
96 1271 657 :* :================*================
98 1074 508 :* :=============*==============
100 951 393 :* :==========*==============
102 724 304 :* :=======*===========
104 631 235 :* :======*==========
106 486 182 :* :====*========
108 431 141 :* :===*========
110 350 109 :* :==*======
112 327 84 :* :==*======
114 270 65 :* :=*=====
116 306 51 :* :=*======
118 203 39 :* :*=====
>120 1686 30 :* :*=======================================
Results sorted and z-values calculated from opt score
1505 scores better than 121 saved
359022 optimizations performed
Joining threshold: 36, optimization threshold: 36, opt. width: 16
The best scores are: frame init1 initn opt z-sc E(1659129) ..
gb_pr:ggu15166 Begin: 23 End: 253
! U15166 Gorilla gorilla major prion ...(1) 1686 1686 1686 1670.1 0
gb_pr:ggu08300 Begin: 23 End: 253
! U08300 Gorilla gorilla prion protei...(1) 1686 1686 1686 1670.1 0
gb_pr:humprp Begin: 39 End: 269
! M13899 Human prion protein (PrP) mR...(2) 1690 1690 1690 1667.2 0
gb_pr:ssu08308 Begin: 23 End: 253
! U08308 Symphalangus syndactylus pri...(1) 1680 1680 1680 1664.2 0
gb_pr:ptu08296 Begin: 23 End: 253
! U08296 Pan troglodytes prion protei...(1) 1680 1680 1680 1664.2 0
gb_pr:hlu08299 Begin: 23 End: 253
! U08299 Hylobates lar prion protein ...(1) 1680 1680 1680 1664.2 0
gb_pr:ptu15039 Begin: 23 End: 253
! U15039 Pan troglodytes major prion ...(1) 1680 1680 1680 1664.2 0
gb_pr:ppu08305 Begin: 23 End: 253
! U08305 Pongo pygmaeus prion protein...(1) 1668 1668 1668 1652.5 0
gb_pr:humprp0a Begin: 41 End: 270
! M13667 Human prion protein 27-30 mR...(2) 940 940 1669 1646.7 0
gb_pr:humprion Begin: 41 End: 270
! D00015 Human prion protein mRNA, hu...(2) 940 940 1669 1646.7 0
gb_pr:cgu08297 Begin: 23 End: 253
! U08297 Colobus guereza prion protei...(1) 1655 1655 1655 1639.7 0
gb_pr:pfu08302 Begin: 23 End: 253
! U08302 Presbytis francoisi prion pr...(1) 1651 1651 1651 1635.8 0
/////////////////////////////////////////////////////////////////////////////
! U28334 Oryctolagus cuniculus major ...(1) 997 997 1510 1496.5 0
gb_pr:humprpra Begin: 15 End: 224
! M81929 Human prion protein gene, 5'...(2) 1306 1306 1498 1486.7 0
gb_pr:agu08309 Begin: 21 End: 232
! U08309 Ateles geoffroyi prion prote...(1) 1327 1327 1375 1366.1 0
gb_om:tioprp Begin: 121 End: 353
! L38993 Trichosurus vulpecular prion...(3) 942 942 1174 1165.4 0
gb_ro:musprpc Begin: 1 End: 78
! M30384 Mouse prion protein (PrP27-3...(2) 523 523 523 538.2 5.9e-22
gb_ov:chkprion Begin: 40 End: 219
! M95404 Chicken prion protein gene, ...(1) 330 357 439 447.7 6.6e-17
gb_ov:chkprlp Begin: 85 End: 258
! M61145 Gallus gallus prion-like pro...(1) 324 324 434 438.0 2.3e-16
gb_pr:s80539 Begin: 9 End: 94
! S80539 PRNP=PrP amyloid [human, Gen...(1) 384 384 399 415.5 4.1e-15
gb_pr:s79978 Begin: 5 End: 86
! S79978 prion protein {144-base pair...(1) 384 384 398 415.1 4.3e-15
gb_pr:hspglap3 Begin: 5 End: 68
! Z48604 H.sapiens prion protein gene...(1) 384 384 389 407.4 1.2e-14
gb_pr:hspglap2 Begin: 1 End: 44
! Z48048 H.sapiens prion protein gene...(1) 375 375 379 400.0 3e-14
gb_in:celrbp1 Begin: 268 End: 372
! D10877 C.elegans mRNA for hnRNP lik...(1) 172 172 236 246.4 1.1e-05
gb_pl:s47405 Begin: 101 End: 198
! S47405 glycine-rich protein {clone ...(2) 197 197 202 217.2 0.00045
gb_pl:bo15h11 Begin: 38 End: 136
! Z74892 B.oleracea mRNA for glycine-...(1) 201 201 201 216.0 0.00053
gb_pl:hvgrp3 Begin: 110 End: 174
! Z48624 H.vulgare mRNA for glycine-r...(3) 201 201 202 216.0 0.00053
gb_pl:bngrp22g Begin: 167 End: 260
! Z15045 B.napus GRP22 gene encoding ...(1) 194 194 200 212.6 0.00082
gb_pl:pvgrp18 Begin: 328 End: 434
! X13596 Bean DNA for glycine-rich ce...(1) 180 180 203 212.6 0.00082
gb_pl:tau32310 Begin: 131 End: 196
! U32310 Triticum aestivum single-str...(2) 181 181 198 211.9 0.00089
gb_pr:hskerat9 Begin: 520 End: 624
! Z29074 H.sapiens mRNA for cytokerat...(1) 168 168 202 210.6 0.0011
gb_pl:vurnagrp Begin: 129 End: 235
/////////////////////////////////////////////////////////////////////////////
! J05230 S.cerevisiae fibrillarin (NO...(1) 103 103 130 139.4 9.7
gb_in:sau32943 Begin: 113 End: 212
! U32943 Schistocerca americana Anten...(3) 112 112 131 139.4 9.8
gb_pl:bnamrnaa Begin: 258 End: 338
! L33282 Brassica napus clone STA 41-...(2) 102 102 127 139.3 9.8
gb_pr:hsl60g9b Begin: 619 End: 683
! Z69363 Human DNA sequence from cosm...(5) 97 97 141 139.3 9.9
gb_pr:hummel18 Begin: 347 End: 409
! D13969 Human mRNA for Mel-18 protei...(5) 83 101 129 139.3 9.9
gb_pr:s79325 Begin: 7 End: 72
! S79325 SYT...SSX1 {translocation br...(1) 102 102 121 139.3 9.9
gb_pr:s79332 Begin: 7 End: 72
! S79332 SYT...SSX2 {translocation br...(1) 102 102 121 139.3 9.9
gb_om:shpcrfa Begin: 379 End: 454
! M22853 Sheep corticotropin releasin...(4) 95 95 131 139.2 10
\\End of List
! CPU time used:
! Database scan: 0:40:49.2
! Post-scan processing: 0:00:20.1
! Total CPU time: 0:41:09.5
! Output File: humprp.tfasta
Had we chosen to not suppress aligning the results, the TFastA output would
also show the sequence alignment at each location. The beginning and ending
alignment points along with the translation frames listed in the output can be
used to go back to the original nucleotide entry to check whether the match-
ups correspond to actually translated areas. Notice that the output file is
an acceptable GCG list file that can serve as input to other programs such as
PileUp. A histogram of the score distribution is also displayed in the FastA
outputs. This can be helpful to get a feeling for the statistical
significance of the search and in ascertaining whether you ran your search
list large enough. The histogram can be suppressed with the nohistogram
option if desired. Another thing to notice in the output is that the entries
are sorted by a "z" score parameter based on a normalization of the opt
scores and their distribution from the rest of the database. This z-
score is a bit different than the more traditional Monte Carlo style
distribution Z score that I will introduce below. Here it is calculated from
a simple linear regression against the natural log of the search set sequence
length. Either type can be very helpful as they help describe the statistical
significance of an alignment. Sometimes initial extended word scores,
initn's, are greatly improved after the opt dynamic programming
and normalization pass. A good example is shown in the above output under the
entries HumPrion and HumPrP0A. Both of these entries scored relatively poor
initn scores of 940 versus their final z-score of 1646.7. This
point underscores the importance of using mutliple algorithms.The Expectation function, E(), is the most important column. It is very similar to the Poisson value in BLAST reports and describes the number of search set sequences that would be needed to obtain a z-score greater than or equal to the z-score obtained in any particular search purely by chance; in-other-words, just like with BLAST P-values, the smaller the number, the better. As a rule-of-thumb, for a search against a protein database of about 10,000 sequences, as long as optimization is not turned off, E() scores of less than 0.01 are almost certainly homologous, and scores between 1 to 10 may be, although these guidelines can be skewed by compositional biases.
Sometimes the sequences found by TFastA will not show up in any other searches. This could be valuable information, especially if sometime during your peptide's evolutionary history it incorporated (was "infected" by) any type of mobile DNA element. This comes to mind in my example because of frequent hits within the herpes virus family.
Next the abridged example FastA output file:
% more humprp.fasta
!!SEQUENCE_LIST 1.0
(Peptide) FASTA of: humprp.pep from: 1 to: 231 November 21, 1996 19:14
TRANSLATE of: humprp.genbank check: 4378 from: 116 to: 808
(mature protein only: PrP 33-35-C)
generated symbols 1 to: 231.
LOCUS HUMPRP 2415 bp ss-mRNA PRI 20-MAY-1987
DEFINITION Human prion protein (PrP) mRNA, complete cds.
ACCESSION M13899 . . .
TO: SwissProt:* Sequences: 52,205 Symbols: 18,531,385 Word Size: 2
Databases searched:
swissprot, Release 33.0, Released on 0Feb1996, Formatted on 0Jul1996
Scoring matrix: GenRunData:blosum50.cmp
Variable pamfactor used
Gap creation penalty: 12 Gap extension penalty: 2
Histogram Key:
Each histogram symbol represents 82 search set sequences
Each inset symbol represents 6 search set sequences
z-scores computed from opt scores
z-score obs exp
(=) (*)
< 20 169 0 :*==
22 0 0 :*
24 1 0 :*
26 0 1 :*
28 1 12 :*
30 21 71 :*
32 171 276 :===*
34 605 748 :======== *
36 1562 1536 :==================*=
38 3025 2538 :==============================*======
40 4317 3540 :===========================================*=========
42 4680 4327 :====================================================*=====
44 4888 4773 :==========================================================*=
46 4879 4862 :===========================================================*
48 4830 4655 :========================================================*==
50 4266 4247 :===================================================*=
52 3507 3734 :=========================================== *
54 2938 3190 :==================================== *
56 2464 2664 :=============================== *
58 2052 2187 :==========================*
60 1600 1772 :==================== *
62 1214 1421 :=============== *
64 930 1130 :============ *
66 723 893 :========= *
68 593 702 :========*
70 478 550 :======*
72 362 430 :=====*
74 308 335 :====*
76 224 261 :===*
78 171 203 :==*
80 147 158 :=*
82 125 121 :=*
84 86 95 :=*
86 73 74 :*
88 63 57 :*
90 44 44 :*
92 51 34 :* :=====*===
94 55 26 :* :====*=====
96 48 20 :* :===*====
98 31 16 :* :==*===
100 19 12 :* :=*==
102 25 9 :* :=*===
104 36 7 :* :=*====
106 28 6 :* :*====
108 21 4 :* :*===
110 22 3 :* :*===
112 17 3 :* :*==
114 15 2 :* :*==
116 21 2 :* :*===
118 14 1 :* :*==
>120 271 1 :*=== :*=======================================
Results sorted and z-values calculated from opt score
1985 scores better than 73 saved
41992 optimizations performed
Joining threshold: 36, optimization threshold: 24, opt. width: 16
The best scores are: init1 initn opt z-sc E(51759)..
sw:prio_human Begin: 23 End: 253
! P04156 homo sapiens (human). major ... 1690 1690 1690 1682.2 0
sw:prio_gorgo Begin: 23 End: 253
! P40252 gorilla gorilla gorilla (low... 1686 1686 1686 1678.3 0
sw:prio_pantr Begin: 23 End: 253
! P40253 pan troglodytes (chimpanzee)... 1680 1680 1680 1672.4 0
sw:prio_ponpy Begin: 23 End: 253
! P40256 pongo pygmaeus (orangutan). ... 1668 1668 1668 1660.5 0
/////////////////////////////////////////////////////////////////////////
sw:prio_trast Begin: 26 End: 264
! P40242 tragelaphus strepsiceros (gr... 1352 1352 1546 1539.9 0
sw:prio_cerae Begin: 23 End: 245
! P40250 cercopithecus aethiops (gree... 1335 1335 1545 1539.4 0
sw:prio_rat Begin: 1 End: 226
! P13852 rattus norvegicus (rat). maj... 1396 1396 1524 1519.1 0
sw:prio_atege Begin: 21 End: 232
! P40246 ateles geoffroyi (black-hand... 1327 1327 1381 1377.9 0
sw:prio_chick Begin: 28 End: 270
! P27177 gallus gallus (chicken). maj... 330 363 488 495.8 4.3e-21
sw:rb87_drome Begin: 245 End: 356
! P48810 drosophila melanogaster (fru... 140 140 217 226.3 4.4e-06
sw:k1ci_human Begin: 497 End: 601
! P35527 homo sapiens (human). kerati... 168 168 218 224.4 5.6e-06
sw:grp1_orysa Begin: 60 End: 158
! P25074 oryza sativa (rice). glycine... 168 168 206 220.6 9.1e-06
sw:grp2_orysa Begin: 31 End: 130
! P29834 oryza sativa (rice). glycine... 181 181 206 220.0 9.9e-06
sw:grp1_phavu Begin: 118 End: 217
! P10495 phaseolus vulgaris (kidney b... 181 181 207 219.0 1.1e-05
sw:grp2_phavu Begin: 177 End: 283
! P10496 phaseolus vulgaris (kidney b... 180 180 209 217.2 1.4e-05
sw:grp_arath Begin: 140 End: 244
! P27483 arabidopsis thaliana (mouse-... 200 200 206 216.2 1.6e-05
sw:grp_horvu Begin: 80 End: 154
! P17816 hordeum vulgare (barley). gl... 168 168 197 210.5 3.3e-05
sw:cora_medsa Begin: 58 End: 167
! Q07202 medicago sativa (alfalfa). c... 147 147 195 208.5 4.3e-05
sw:grp2_sinal Begin: 96 End: 166
! P49311 sinapis alba (white mustard)... 147 147 193 207.6 4.8e-05
sw:k22e_human Begin: 536 End: 641
! P35908 homo sapiens (human). kerati... 166 166 200 206.4 5.7e-05
sw:hmpr_human Begin: 50 End: 123
! Q01851 homo sapiens (human). homeob... 139 139 192 202.5 9.3e-05
sw:k1cj_human Begin: 460 End: 556
! P13645 homo sapiens (human). kerati... 159 159 195 201.9 0.0001
sw:grpa_medfa Begin: 39 End: 130
! Q09134 medicago falcata (sickle med... 146 146 186 201.1 0.00011
/////////////////////////////////////////////////////////////////////////
! P27214 bos taurus (bovine). annexin... 80 124 104 113.2 8.8
sw:antf_sarpe Begin: 23 End: 84
! Q08617 sarcophaga peregrina (flesh ... 78 78 93 113.1 8.8
sw:hxb3_human Begin: 148 End: 181
! P14651 homo sapiens (human). homeob... 97 97 103 113.1 8.9
sw:rog_human Begin: 92 End: 224
! P38159 homo sapiens (human). hetero... 59 59 103 113.0 9
sw:hra1_xancv Begin: 211 End: 300
! P80151 xanthomonas campestris (pv. ... 74 74 105 113.0 9
sw:lmp1_ebv Begin: 310 End: 378
! P03230 epstein-barr virus (strain b... 72 95 102 112.8 9.2
sw:dip_drome Begin: 35 End: 87
! P24492 drosophila melanogaster (fru... 54 54 94 112.8 9.3
sw:vasa_drome Begin: 47 End: 117
! P09052 drosophila melanogaster (fru... 74 74 105 112.5 9.6
sw:prp2_human Begin: 159 End: 235
! P02812 homo sapiens (human). saliva... 88 88 99 112.5 9.6
sw:hmux_drops Begin: 104 End: 175
! P20822 drosophila pseudoobscura (fr... 74 74 99 112.3 9.8
\\End of List
! CPU time used:
! Database scan: 0:02:29.7
! Post-scan processing: 0:00:07.4
! Total CPU time: 0:02:37.2
! Output File: humprp.fasta
Both FastA output files show a histogram of the score distribution, then a
sorted list of the top scores and finally, if alignments are not suppressed,
BestFit style alignments of the score list. They both use the same E()
significance statistic in their reports. Next, the abridged prion example
WordSearch output:
% more humprp.word
!!SEQUENCE_LIST 1.0
(Peptide) WORDSEARCH of: /disk2/usr/local/people/thompson/courses/BC578/EX8/hum
prp.pep check: 414 from: 1 to: 231
TRANSLATE of: humprp.genbank check: 4378 from: 116 to: 808
(mature protein only: PrP 33-35-C)
generated symbols 1 to: 231.
LOCUS HUMPRP 2415 bp ss-mRNA PRI 20-MAY-1987
DEFINITION Human prion protein (PrP) mRNA, complete cds.
ACCESSION M13899 . . .
TO: SwissProt:* Sequences: 52,205 Total-length: 18,531,385 November 21, 1996 19:12
Database Release Information:
swissprot, Release 33.0, Released on 0Feb1996, Formatted on 0Jul1996
Word-size: 2 Words: 74263061 Diagonals: 14,481,967 Total-diagonals: 30,538,535
Integral-width: 3 Alphabet: 20 List-size: 300 CPU minutes: 1.81
Sequence Strd Diag Score Width Documentation ..
SW:PRIO_HUMAN + 22 269 25 P04156 homo sapiens (human).
major prion protein precursor (prp) (prp27-30) (prp
SW:PRIO_GORGO + 22 268 25 P40252 gorilla gorilla gorill
a (lowland gorilla). major prion protein precursor
SW:PRIO_PANTR + 22 267 25 P40253 pan troglodytes (chimp
anzee), hylobates lar (common gibbon), and symphala
SW:PRIO_PONPY + 22 264 25 P40256 pongo pygmaeus (orangu
tan). major prion protein precursor (prp) (prp27-30
SW:PRIO_COLGU + 22 261 25 P40251 colobus guereza. major
prion protein precursor (prp) (prp27-30) (prp33-35
SW:PRIO_PREFR + 22 260 25 P40257 presbytis francoisi. m
ajor prion protein precursor (prp) (prp27-30) (prp3
//////////////////////////////////////////////////////////////////////////////
SW:PRIO_TRAST + 33 226 15 P40242 tragelaphus strepsicer
os (greater kudu). major prion protein 1 precursor
SW:PRIO_CERAE + 14 221 14 P40250 cercopithecus aethiops
(green monkey) (grivet), and cercopithecus dianae.
SW:PRIO_ATEGE + 6 216 15 P40246 ateles geoffroyi (blac
k-handed spider monkey). major prion protein precur
SW:GRP1_PETHY + 68 72 30 P09789 petunia hybrida (petun
ia). glycine-rich cell wall structural protein 1 pr
SW:GRP1_PETHY + 198 71 36 P09789 petunia hybrida (petun
ia). glycine-rich cell wall structural protein 1 pr
SW:GRP_ARATH + 137 69 2 P27483 arabidopsis thaliana (
mouse-ear cress). glycine-rich cell wall structural
SW:GRP_ARATH + 133 69 3 P27483 arabidopsis thaliana (
mouse-ear cress). glycine-rich cell wall structural
SW:GRP1_PHAVU + 66 68 7 P10495 phaseolus vulgaris (ki
dney bean) (french bean). glycine-rich cell wall st
SW:GRP_ARATH + 104 68 8 P27483 arabidopsis thaliana (
mouse-ear cress). glycine-rich cell wall structural
SW:PRIO_CHICK + 35 67 35 P27177 gallus gallus (chicken
). major prion protein homolog precursor (pr-lp) (a
SW:GRP1_PETHY + 109 66 33 P09789 petunia hybrida (petun
ia). glycine-rich cell wall structural protein 1 pr
SW:GRP_ARATH + 181 66 6 P27483 arabidopsis thaliana (
mouse-ear cress). glycine-rich cell wall structural
SW:GRP2_PHAVU + 140 65 12 P10496 phaseolus vulgaris (ki
dney bean) (french bean). glycine-rich cell wall st
SW:GRP2_PHAVU + 72 65 4 P10496 phaseolus vulgaris (ki
dney bean) (french bean). glycine-rich cell wall st
SW:GRP_HORVU + 47 65 54 P17816 hordeum vulgare (barle
y). glycine-rich cell wall structural protein precu
SW:GRP_ARATH + 153 65 7 P27483 arabidopsis thaliana (
mouse-ear cress). glycine-rich cell wall structural
SW:GRP_ARATH + 113 65 2 P27483 arabidopsis thaliana (
mouse-ear cress). glycine-rich cell wall structural
//////////////////////////////////////////////////////////////////////////////
SW:K1CJ_BOVIN + 35 45 91 P06394 bos taurus (bovine). k
eratin, type i cytoskeletal vib (cytokeratin vib).
SW:LORI_MOUSE + 293 45 2 P18165 mus musculus (mouse).
loricrin. 2/96
SW:LORI_MOUSE + 240 45 1 P18165 mus musculus (mouse).
loricrin. 2/96
SW:LORI_MOUSE + 149 45 5 P18165 mus musculus (mouse).
loricrin. 2/96
SW:PRIO_CEBAP + 5 45 4 P40249 cebus apella (brown-ca
pped capuchin). major prion protein precursor (prp)
SW:PRIO_CALJA + 5 45 4 P40247 callithrix jacchus (co
mmon marmoset). major prion protein precursor (prp)
SW:YKR3_CAEEL + 4 45 67 P34309 caenorhabditis elegans
. hypothetical 11.3 kd protein c06g4.3 in chromosom
SW:SPD1_NEPCL + 314 45 10 P19837 nephila clavipes (orb
spider). spidroin 1 (dragline silk fibroin 1) (frag
SW:SPD1_NEPCL + 246 45 19 P19837 nephila clavipes (orb
spider). spidroin 1 (dragline silk fibroin 1) (frag
SW:SPD1_NEPCL + 18 45 7 P19837 nephila clavipes (orb
spider). spidroin 1 (dragline silk fibroin 1) (frag
SW:EGG1_SCHJA + 40 44 4 P19470 schistosoma japonicum
(blood fluke). eggshell protein 1 precursor. 5/91
SW:FBRL_LEIMA + 2 44 71 P35549 leishmania major. fibr
illarin. 11/95
Notice
that the various lists have entries in common but also have substantial
differences, in particular WordSearch found tons of glycine-rich protein
segments, some with scores higher than prion segments themselves. This is
another reason as to why it is always a good idea to run as many different
types of analyses as practical. We ran WordSearch with the plot option to
generate a graphic of the score distribution of that procedure. Be sure to
initialize your graphics configuration first (setplot) and then type
figure wordsearch.figure to draw the plot:
% figure wordsearch.figure Figure makes figures and posters by drawing graphics and text together. You can include output from other GCG graphics programs as part of a figure. Process set to plot with VERSATERM-TEK4105 attached to term using the tekd graphic interface. When your VERSATERM-TEK4105 attached to tty is ready, press <Return>. <rtn>

The asterisk on the abscissa indicates your lowest reported score and helps you to decide if you ran your list large enough. Normally you want your list size big enough to include some of the population of random low scores to help you ascertain the significance of the alignments. In my case above the list was plenty large to catch the upward slope of the random scores. The inset shows a blowup of the high score end of the graph -- these are the best alignments found by the program, not the worst!
Unlike the FastA and BLAST programs WordSearch can not illustrate its alignments with sequence pairs. The separate program Segments can do that. We know how your sequence aligns to itself, so first edit your .word file by removing all of the Swiss Protein entry list lines near the top except one or two positive control sequences that refer to your particular protein family. Do not remove any of the reference header section of the file nor those sequences which are somewhat similar to your query. These are the interesting sequences which are obviously not from the same gene family as your selected molecule yet have fairly high scores. Do remove all the sequences below this point all the way to the bottom of the list. These are all the sequences that are obviously totally dissimilar. We are mainly interested here in how some of the not so similar, "twilight zone," sequences align. In order to visualize the alignments run the program Segments; provide it with your WordSearch output file's name; give the new output file the extension .wordpairs. This process is shown in an abridged fashion for the prion example on the next page:
% segments humprp.word Segments aligns and displays the segments of similarity found by WordSearch. What should I call the output file (* humprp.pairs *) ? humprp.w ordpairs Aligning ...........-.. SW:PRIO_ATEGE 232 aa Gaps: 1 Quality: 315.4 / Length: 226 Aligning ...........-. SW:GRP1_PETHY 384 aa Gaps: 6 Quality: 89.5 / Length: 223 Aligning ...........-. SW:GRP1_PETHY 384 aa Gaps: 7 Quality: 80.7 / Length: 216 Aligning ..........-. SW:GRP_ARATH 338 aa Gaps: 2 Quality: 84.9 / Length: 201 Aligning ..........-. SW:GRP_ARATH 338 aa Gaps: 3 Quality: 86.8 / Length: 210 Aligning .........-. SW:GRP1_PHAVU 252 aa Gaps: 5 Quality: 78.5 / Length: 191 Aligning ...........-.. SW:GRP_ARATH 338 aa Gaps: 5 Quality: 88.9 / Length: 222 Aligning ...........-.. SW:PRIO_CHICK 273 aa Gaps: 11 Quality: 150.4 / Length: 245 Aligning ...........-.. SW:GRP1_PETHY 384 aa Gaps: 7 Quality: 89.9 / Length: 266 ///////////////////////////////////////////////////////////////// Aligning ........-. SW:LORI_MOUSE 481 aa Gaps: 4 Quality: 83.3 / Length: 218Display the generated .wordpairs file. Again, the prion example follows in an abridged fashion:
% more humprp.wordpairs
(BestFit) SEGMENTS from: humprp.word February 14, 1996 09:58
(Peptide) WORDSEARCH of: /disk2/usr/local/people/thompson/BC578/EX8/humprp.pep
check: 414 from: 1 to: 231
TRANSLATE of: humprp.genbank check: 4378 from: 116 to: 808
(mature protein only: PrP 33-35-C)
generated symbols 1 to: 231.
LOCUS HUMPRP 2415 bp ss-mRNA PRI 20-MAY-1987
DEFINITION Human prion protein (PrP) mRNA, complete cds. . . .
AvMatch: 0.54 AvMisMatch: -0.40 GapWeight: 2.00 LengthWeight: 0.10 ..
humprp.pep check: 414 from: 1 to: 231
SW:PRIO_ATEGE check: 1948 from: 6 to: 232
P40246 ateles geoffroyi (black-handed spider monkey). major prion protein p
recur
Gaps: 1 Quality: 315.4 Ratio: 1.453 Score: 216 Width: 15 Limits: +/-16
. . . . .
1 KKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGG 50
|||||||||||||||||||||||||||||| |||||||||||
16 KKRPKPGGWNTGGSRYPGQGSPGGNRYPPQ.........GGGWGQPHGGG 56
. . . . .
51 WGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVV 100
||||||||||||||||||||||||.|||||||||||||||||||||||||
57 WGQPHGGGWGQPHGGGWGQGGGTHNQWNKPSKPKTNMKHMAGAAAAGAVV 106
. . . . .
101 GGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQ 150
|||||||||||||||:||||.|||||||||||.||||||||||:|:|.||
107 GGLGGYMLGSAMSRPLIHFGNDYEDRYYRENMYRYPNQVYYRPVDQYNNQ 156
. . . . .
151 NNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERES 200
||||||||||||||||||||||||||||||||||||||||||||||||||
157 NNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERES 206
. .
201 QAYYQRGSSMVLFSSPPVILLISFLI 226
||||||||||||||||||||||||||
207 QAYYQRGSSMVLFSSPPVILLISFLI 232
humprp.pep check: 414 from: 1 to: 231
SW:GRP1_PETHY check: 1852 from: 68 to: 384
P09789 petunia hybrida (petunia). glycine-rich cell wall structural protein
1 pr
Gaps: 6 Quality: 89.5 Ratio: 0.441 Score: 72 Width: 30 Limits: +/-31
. . . . .
4 PKPGGWNTGGSRYPGQGSPGGNRYPPQG..GGGWGQPHGGGWGQPHGGGW 51
. .||:..||: .|.|..||. ....| |||:|.. ||| |.. |||
78 GAGGGLGGGGGLGGGGGAGGGGGLGGGGGAGGGFGGGAGGGAGGGLGGGG 127
. . . . .
52 GQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVG 101
| . ||| |.. ||| |.|:|. :.:. .: ::|:|:||:.||
128 GLGGGGGGGAGGGGGVGGGAGSGGGFGAGGG.......VGGGAGAGGGVG 170
. . . . .
102 GLGGYMLGSAMSRPIIHFGSDYEDRY.YRENMHRYPNQVYYRPMDEYSNQ 150
| ||: |:: : : ||:.:: : :.: :.... .::: :..
171 GGGGFGGGGGGG...VGGGSGHGGGFGAGGGVGGGAGGGLGGGVGGGGGG 217
. . . . .
151 NNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERES 200
.. . :.:. .....|:.|...:. :: .|:. . .. : ::
218 GSGGGGGIG........GGSGHGGGFGAGGG..VGGGVGGGAAGGGGGGG 257
. .
201 QAYYQRGSSMVLFSSPPVILLIS 223
.: . |:::. |:... : :
258 GGGGGGGGGLGGGSGHGGGFGAG 280
humprp.pep check: 414 from: 1 to: 231
SW:GRP1_PETHY check: 1852 from: 198 to: 384
P09789 petunia hybrida (petunia). glycine-rich cell wall structural protein
1 pr
Gaps: 7 Quality: 80.7 Ratio: 0.453 Score: 71 Width: 36 Limits: +/-37
. . . . .
4 PKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQ 53
. .||:..|.: .|.||.|| || | | .||||:|. ||| |.
202 GAGGGLGGGVGGGGGGGSGGG......GGIGGGSGHGGGFGA..GGGVGG 243
. . . . .
54 PHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAG.AAAAGAVVGG 102
. ||| :.. ||| |.|||. :.:. .| . ..: :| :::|::.|||
244 GVGGGAAGGGGGGGGGGGGGGGGLGGGSGHGGGFGAGGGVGGGAAGGVGG 293
. . . . .
103 LGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNN 152
||: |:: : : ||:.:: : ....
294 GGGFGGGGGGG...VGGGSGHGGGF.....................GAGG 319
. . . . .
153 FVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQA 202
| :... .: ...... |:.:.:... :: .|: ...: : .|
320 GVGGGAGGGLG..GGGGAGGGGGIGGGHGGGFGVGVG....IGIGVGVGA 363
.
203 YYQRGSSMVLFSSPPV 218
.:| ::. |:...
364 GAGHGVGVGSGSGSGG 379
humprp.pep check: 414 from: 1 to: 231
SW:GRP_ARATH check: 6121 from: 137 to: 338
P27483 arabidopsis thaliana (mouse-ear cress). glycine-rich cell wall struc
tural
Gaps: 2 Quality: 84.9 Ratio: 0.433 Score: 69 Width: 2 Limits: +/-3
. . . . .
4 PKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQ 53
. .|| ..||: .|:|:. |. ...:|||:|..|||| |.. |||.|.
137 GAGGGAGGGGGLGGGHGGGIGGGAGGGAGGGLGGGHGGGIGGGAGGGSGG 186
. . . . .
54 PHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMA....GAAAAGAV 99
. ||| |.. ||| |.|||. :. . .: . ..: | |::|:|:.
187 GLGGGIGGGAGGGAGGGGGAGGGGGLGGGHGGGFGGGAGGGLGGGAGGGT 236
. . . . .
100 VGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSN 149
.||:|| |:| : : ||:: :: : .... : ::::.
237 GGGFGGGAGGGAGGGAGGGFGGGAGGGAGGGFGGGAGGGAGGGAGGGFGG 286
. . . . .
150 QNNFVH.DCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYER 198
... .| :.|. .: . .... . |.. ...:. : ..:. . ..::
287 GAGGGHGGGVGGGFGGGSGGGFGGGAGGGAGGGAGGGFGGGGGAGGGFGG 336
199 E 199
:
337 G 337
humprp.pep check: 414 from: 1 to: 231
SW:GRP_ARATH check: 6121 from: 133 to: 338
P27483 arabidopsis thaliana (mouse-ear cress). glycine-rich cell wall struc
tural
Gaps: 3 Quality: 86.8 Ratio: 0.430 Score: 69 Width: 3 Limits: +/-4
. . . . .
4 PKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQ 53
. .|| ..||: .|:|:. |. ...:|||:|..|||| |.. |||.|.
137 GAGGGAGGGGGLGGGHGGGIGGGAGGGAGGGLGGGHGGGIGGGAGGGSGG 186
. . . . .
54 PHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGL 103
. ||| |.. ||| |.|||. :. . .: . ..: ||:: :|:..||
187 GLGGGIGGGAGGGAGGGGGAGGGGGLGGGHGGGFGGGAGGGLGGGAGGGT 236
. . . . .
104 GGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNF 153
|| : |:| : : |::::: :. :... : :: :....
237 GGGFGGGAGGGAGGGAGGGFGGGAGGGAGGGFGGG....AGGGAGGGAGG 282
. . . . .
154 VHDCVNITIKQHTVT...TTTKGENFTETDVKMMERVVEQMCITQYERES 200
. :... . .. .|. ....|:.|.:.. : ..:. . ..:: ::
283 GFGGGAGGGHGGGVGGGFGGGSGGGFGGGA....GGGAGGGAGGGFGGGG 328
.
201 QAYYQRGSSM 210
.| . |:::
329 GAGGGFGGGF 338
humprp.pep check: 414 from: 1 to: 231
SW:GRP1_PHAVU check: 4254 from: 66 to: 252
P10495 phaseolus vulgaris (kidney bean) (french bean). glycine-rich cell wa
ll st
Gaps: 5 Quality: 78.5 Ratio: 0.444 Score: 68 Width: 7 Limits: +/-8
. . . . .
4 PKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQ 53
. .:|:..:|: ..|.|: ||. ...:|||:|.. | | |:..|| |.
76 GAGAGYGAAGGGHGGGGGNGGGGGGGADGGGYGGGAGKGGGEGYGG..GG 123
. . . . .
54 PHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGL 103
::|||:|. |||.|.|||. .. . .| .. ||:: :|| .||
124 ANGGGYGG..GGGSGGGGGGGAG.GAGSGYGGGEGSGAGGGYGGANGGGG 170
. . . . .
104 GGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNF 153
|| |:: :.. | |:. :: |. . :...| . .: :....
171 GGNGGGGGGGSGGAHGGGAAGG...GEGAGQGAGGGYGGGAAGGGGRGSG 217
. . . .
154 VHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCIT 194
. :..... ....:|..:.:.:.. : . :. :..
218 GGGGGGYG......GGGARGSGYGGGGGSGEGGGHGGGYYP 252
humprp.pep check: 414 from: 1 to: 231
SW:GRP_ARATH check: 6121 from: 104 to: 338
P27483 arabidopsis thaliana (mouse-ear cress). glycine-rich cell wall struc
tural
Gaps: 5 Quality: 88.9 Ratio: 0.415 Score: 68 Width: 8 Limits: +/-9
. . . . .
6 PGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGW.... 51
.|| ..| : :|.||.|| . .||:| |.. |||:|..||||
111 GGGHGGGIGGGAGGGSGGGLGGGIGGGAGGGAGGGGGLGGGHGGGIGGGA 160
. . . . .
52 GQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVG 101
|.. |||:|..|||| |.|:|. |. . .: .. ||::::::..|
161 GGGAGGGLGGGHGGGIGGGAGGGSGGGLGGGIGGGAGGGAGGGGGAGGGG 210
. . . . .
102 GLG...GYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYS 148
||| | :|:: : .: ::: .: : :. :... : ::::
211 GLGGGHGGGFGGGAGGGLGGGAGGGTGGGFGGGAGGGAGGG...AGGGFG 257
. . . . .
149 NQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYER 198
.... . ....:. . ...... |:.|.:... : .|: ..::
258 GGAG.GGAGGGFGGGAGGGAGGGAGGGFGGGAGGGHGGGVG....GGFGG 302
. .
199 ESQAYYQRGSSMVLFSSPPVIL 220
:|.: :. |.: . :::.. :
303 GSGGGFGGGAGGGAGGGAGGGF 324
humprp.pep check: 414 from: 1 to: 231
SW:PRIO_CHICK check: 9640 from: 35 to: 273
P27177 gallus gallus (chicken). major prion protein homolog precursor (pr-l
p) (a
Gaps: 11 Quality: 150.4 Ratio: 0.674 Score: 67 Width: 35 Limits: +/-36
. . . . .
8 GWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQ....PHGGGWGQ 53
||..|: | |: ....| .. |. . . | ||..|:.: ||..|:.|
35 GWGAGSHRQPSYPRQPGYPHNPGYPHNPGYPHNPGYPHNPGYPHNPGYPQ 84
. . . . .
54 ....PHGGGWGQPHGGGWGQGGGTHSQWNKPSK.PKTNMKHMAGAAAAGA 98
||..|: .. |.|:..::|. : .||.| ||||:||:||||||||
85 NPGYPHNPGY.PGWGQGYNPSSGGSYHNQKPWKPPKTNFKHVAGAAAAGA 133
. . . . .
99 VVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYS 148
|||||||| :|..|| .||:|. | |::.|| ||||.|||| :||
134 VVGGLGGYAMGRVMSGMNYHFDSPDEYRWWSENSARYPNRVYYR...DYS 180
. . . . .
149 N...QNNFVHDCVNITIKQHTVTTTTK........GENFTETDV..KMME 185
. |: || ||.|||:.:..:....| :.| ||.:: |::.
181 SPVPQDVFVADCFNITVTEYSIGPAAKKNTSEAVAAANQTEVEMENKVVT 230
. . . .
186 RVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIV 230
:|: :||: || || |. :|:: | |: .:| :|::|..
231 KVIREMCVQQY.RE....YRLASGIQLH..PADTWLAVLLLLLTT 268
///////////////////////////////////////////////////////////////////////////
humprp.pep check: 414 from: 1 to: 231
SW:LORI_MOUSE check: 8318 from: 165 to: 481
P18165 mus musculus (mouse). loricrin. 5/92
Gaps: 4 Quality: 83.3 Ratio: 0.397 Score: 60 Width: 3 Limits: +/-4
. . . . .
4 PKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQ 53
...|| ..||:.|.|.:|.||.. ...||:| |. ||| | . |||::.
169 SSCGGGSGGGGSYCGGSSGGGSSGGCGGGSGGGKYSGGGGGSSCGGGYSG 218
. . . . .
54 PHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGL 103
. |:: | . |||::.|||. :. . .: . .. . :|:..:|:..|:
219 GGGSSGGSSCGGGYSGGGGSSCG.GGGGYSGGGGTSCGGGSSGGGGGGSS 267
. . . . .
104 GGYM...LGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQ 150
..| .|:: | ||.::: | :. .:... |.. :: | .
268 QQYQCQSYGGGSSG.....GSSCGGGYSGGGGSSCGGG..YSGGGGSSCG 310
. . . . .
151 NNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERES 200
.. :. ... .. .. ... |:. .:.. : . ..| . ..
311 GGSSGGGSSCGGSGGGGYSGGGGGSCGGGSSGGGGGYYSSQQTSQTSCAP 360
.
201 QAYYQRGSSMVLFSSPPV 218
|. |. ||| . |:...
361 QQSYGGGSSGGGGSCGGG 378
It's
interesting to note that most of the non-prion hits in the WordSearch output
are glycine-rich cell wall proteins, whereas the other approaches found this
protein plus several other unrelated ones. WordSearch does have problems with
the low-complexity, repeat types of sequence that we were able to filter out
with BLAST, here that predominance of glycine repeats. Also notice that
multiple segments for many sequences are found. WordSearch has largely been
superseded by FastA and BLAST approaches but does remain online as an
alternative. As stated earlier, it is good to gather as much different data as
you can. One situation where WordSearch does excell is in finding difficult
DNA alignments such as aligning cDNA to genomic DNA. Peter Rice, formerly at
EMBL, Heidelberg, and now at the Sanger Centre, U.K., wrote the following
concerning this aspect of WordSearch:
To: info-gcg@net.bio.net From: rice@embl-heidelberg.de (Peter Rice) Subject: Re: Way to gap around a 100 bp insertion? Date: 11 Sep 93 13:57:19 GMT In article <CD3st6.B28@watserv2.uwaterloo.ca>, kowalski@sciborg.uwaterloo.ca (Paul Kowalski) writes: > Is there some clever method to gap "around" a 100 bp insertion in genomic > DNA? Lowering weights doesn't seem to help. Sure. There is a program that does just that. Remember WORDSEARCH? The program everyone used to use before FASTA . . . came along for GCG. WORDSEARCH will find any decent sized exon (18 bases or so will get lost in the noise of course). SEGMENTS is simply a BESTFIT run tied to each of the hits. . . . Use WORDSEARCH with a list size of 2 if you only have 2 exons. Use the cDNA as the search sequence, and the genomic sequence as the "database". I use a higher list size for safety, then edit out the lower scores from the .WORD output file. Use SEGMENTS to do the alignments. WORDSEARCH is also wonderful for understanding fragments or contigs that fail to overlap in Fragment Assembly. I still wouldn't recommend WORDSEARCH for database searches, but for just a few sequences (fragment assembly) or just one with several hits (as above) it beats any other program. You can lower the gap weights in SEGMENTS if you like. I often do for Fragment Assembly matching, where you can expect a large number of 1-base gaps (insertions or deletions) so you need a very low gap weight.
Now take three completely different, relatively high scoring sequences, one each, from each of the programs BLAST, FastA, and WordSearch, which do not obviously belong to the same gene family as your protein. These should be those sequences that score the highest of all the dissimilar matches but not be the same thing as your selected molecule. Be sure that each is a different type of entry from each of the searches. There are no "correct" answers here, we just want to see some interesting comparisons. Use typedata -reference to read about each sequence, if you are interested in what they are. We are going to experiment with various methods of analyzing similarity significance with these sequences. Relevant lines from my three example search files follow.
From the BLAST search, the top-most, non-obious match:
pir||S14078 amyloid protein, 11K - human 115 1.1e-09 1
This one actually does turn out to be a misnamed prion protein, but I will leave it in the list in order to show the difference between significant and insignificant match-ups.
From the FastA search, something completely different:
sw:rb87_drome Begin: 245 End: 356
! P48810 drosophila melanogaster (fru... 140 140 217 226.3
4.4e-06
And from the WordSearch output another totally different protein:
SW:GRP1_PETHY + 68 72 30 P09789 petunia hybrida (petunia). glycine-rich cell wall structural protein 1 pr
6) Dot matrix methods.
Dot matrix analysis is one of the few ways to identify other elements beyond what dynamic programming algorithms show to be similar between two sequences. GCG implements dot matrix methods with two programs. Compare generates the data that serves as input to DotPlot which actually draws the matrix. Compare your query sequence to each of its three near neighbors (as described above) using these methods. Start the program by typing compare; supply it with the appropriate sequence names. In general, put the longer sequence along the horizontal axis. Just as in all windowing algorithms, you want to use a window size of approximately the same size as the feature that you're trying to recognize. Leave the window at its default setting of 30 for these runs, unless one of your sequence is too short for this size of a window to find much. Rerun the program increasing the stringency of the comparisons until the number of points generated is of the same order of magnitude as the length of the longest sequence being compared. Below I found that the default stringency score of 33 out of a window of 30 resulted in 516 points -- the same magnitude as the sequences' lengths being compared. In this case, wherever the average of match scores within the window is equal to or exceeds 33, a point will be drawn at the middle of the window, then the window will be slid over one position at which point the process is repeated. You will be saving your three best output files from this program -- one comparing your protein sequence to an interesting non-family member found by BLAST, one to an analogous find by FastA, and finally the best non-Selected Molecule match found by WordSearch. I will illustrate two examples:
% compare
Compare compares two protein or nucleic acid sequences and creates a
file of the points of similarity between them for plotting with
DotPlot. Compare finds the points using either a
window/stringency or a word match criterion. The word comparison
is 1,000 times faster than the window/stringency comparison, but
somewhat less sensitive.
COMPARE what horizontal sequence ? humprp.pep
Begin (* 1 *) ? <rtn>
End (* 231 *) ? <rtn>
to what vertical sequence (* humprp.pep *) ? pir:s14078
Begin (* 1 *) ? <rtn>
End (* 253 *) ? <rtn>
What comparison window size (* 30 *) ? <rtn>
What stringency (* 33 *) ? <rtn>
What should I call the output file (* humprp.pnt *) ? humprp-s14708.pnt
Number of points: 516
Writing ..
After
you have worked out appropriate stringency parameters, provide the .pnt
file as input to DotPlot. Type dotplot and provide the correct
filename; accept all program defaults. One exception would be if you are
comparing a very long sequence to a very short one, then decreasing the
plotting density will spread the graphic onto more than one page allowing you
to 'zoom in' on the section of interest. The prion cross human amyloid protein comparison example follows:
% dotplot
DotPlot makes a dot-plot with the output file from Compare,
FoldRNA, or StemLoop.
Process set to plot with VERSATERM-TEK4105 attached to term:
using the tekd graphic interface.
DOTPLOT what point file ? humprp-s14708.pnt
humprp-s14078.pnt contains COMPARE results of
Axis Name Check Start End Dir
Horizontal humprp.pep 414 1 231 for
Vertical s14078 8781 1 253 for
Window . . . . . . . . . 30
Stringency . . . . . . . 33.0
Number of points . . . . 516
Percent of possible . . 0.883
The minimum density for a one-page plot is
289.8 bases/100 platen units on each axis.
What point density would you like (* 289.8 *) ? <rtn>
DOTPLOT will take 1 pages. Would you like to: <rtn>
P)lot the points
D)ifferent density
G)et another point file to plot
Q)uit
Please select one (* P *): <rtn>
When your VERSATERM-TEK4105 attached to tty is ready, press <Return>. <rtn>
The human prion cross human amyloid protein dotplot follows:

P)lot the points
D)ifferent density
G)et another point file to plot
Q)uit
Please select one (* Q *): <rtn>
Notice
that running the comparison at an appropriate stringency, in this case 33 in a
window of 30, produces a very clean plot with very little confusing noise.
Still, sometimes interpreting a dotplot can be a major accomplishment in itself
-- just remember that diagonals are regions of similarity between the two
sequences and that any diagonal off the main center line is indicative of
regions which do not correspond in linear placement between the two sequences
yet are still similar. The regions of similarity between the human prion
protein and the human amyloid protein are very clear in the dotplot. In fact,
the amyloid protein turns out to be a variant prion molecule; the extensive
full-length similarity confirms this. Also notice the strong direct repeat
region; a sequence located near the beginning of the proteins occurs at least
six times in each sequence. Columns and rows of diagonals always mean direct
repeat sequences.To contrast the extensive similarity seen above, here's the dotplot of the best non-prion match-up from FastA, the ribonucleoprotein RB87 from Drosophila. I ran this comparison at a stringency of 40 with the default window size of 30 to generate 719 points:

Here similarity is restricted to an area of direct repeats -- a doublet section near the beginning of the prion repeats itself several times in the carboxy half of RB87. When running all the dotplots, take notes of those particular regions in the proteins that appear significantly similar. For example, as noted in the above plot, the region of prion sequence from about residue 20 through 60 is quite similar to and repeats itself in RB87 sequence from around residue 210 through 320. We will need these numbers in the next section. Rerun all three of your DotPlots with appropriate stingencies, however, add the option -figure=(your name).dotplot1, 2, or 3 respectively to the command line of each.
7) The dynamic programming alignment algorithms: use the right one for the right job -- Gap and BestFit.
You need to understand the difference between these two algorithms! Gap is more a 'global' alignment scheme and BestFit more 'local.' Using one versus the other implies that you are looking for distinctly different relationships. Know what they mean. If you already know that the full length of two sequences are pretty close, that they probably belong to the same family, then Gap is the program for you; if you only suspect an area of one is similar to another, then you should use BestFit. To force BestFit to be even more local, try specifying a more stringent alternative symbol comparison table, such as pam250.cmp or blosum100.cmp located in the logical directory GenMoreData. Both programs can generate 'gapped' output files in standard sequence formats; this can be very handy as direct input to other GCG routines -- particularly multiple sequence analysis programs.
What is significant? GCG provides a way to estimate significance in these two programs. When running them specify the option randomizations on the command line and the second input sequence will be jumbled a set number of times after the initial alignment is produced. Comparing the quality scores of the randomized alignments to the initial alignment can help you get a feeling for the relative meaning of the scores. An old rule-of-thumb that people often use is, if the actual score is much more than three standard deviations above the mean of the randomized scores, the analysis may be significant, if it is more than five, then it most probably is significant, if it's above around nine, then certainly so. This distance above the mean is often known as a "Z score" and can be calculated with the following formula:
Z score = [ ( actual score ) - ( mean of randomized scores ) ]
% bestfit -random=100
BestFit makes an optimal alignment of the best segment of similarity
between two sequences. Optimal alignments are found by inserting gaps to
maximize the number of matches using the local homology algorithm of
Smith and Waterman.
BESTFIT of what sequence 1 ? humprp.pep
Begin (* 1 *) ? 10
End (* 231 *) ? 220
to what sequence 2 (* humprp.pep *) ? pir:s14078
Begin (* 1 *) ? 30
End (* 253 *) ? 240
What is the gap creation penalty (* 12 *) ? <rtn>
What is the gap extension penalty (* 4 *) ? <rtn>
What should I call the paired output display file (* humprp.pair *) ? humprp-s
14078.pair
Aligning ..........-.
Aligning ..........-..
Gaps: 0
Quality: 1195
Quality Ratio: 5.718
% Similarity: 100.000
Length: 209
Randomized alignment Quality
1 46
2 49
3 41
4 45
5 31
6 54
7 60
8 54
9 36
10 43
11 36
12 32
13 49
/////////////////////////////////////////
96 52
97 43
98 47
99 53
100 55
Average quality based on 100 randomizations: 44.2 +/- 7.7
The
output file is shown below. In this example's case the sequences are identical
within the stretch being compared and the randomized quality scores are all
much lower than the actual score. The Z score comes out at almost 150;
therefore, the interpretation is that the similarity is extremely
significant.
BESTFIT of: humprp.pep check: 414 from: 10 to: 220
TRANSLATE of: humprp.genbank check: 4378 from: 116 to: 808
(mature protein only: PrP 33-35-C)
generated symbols 1 to: 231.
LOCUS HUMPRP 2415 bp ss-mRNA PRI 20-MAY-1987
DEFINITION Human prion protein (PrP) mRNA, complete cds.
ACCESSION M13899 . . .
to: s14078 check: 8781 from: 30 to: 240
P1;UJHU - major prion protein precursor - human
N;Alternate names: 11K amyloid protein; 27-30K sialoglycoprotein; PrP 27-30;
PrP 33-35C; scrapie prion protein
C;Species: Homo sapiens (man)
C;Date: 25-Oct-1987 #sequence_revision 12-Apr-1996 #text_change 12-Apr-1996
C;Accession: A24173; A40372; A05017; S14078
R;Kretzschmar, H.A.; Stowring, L.E.; Westaway, D.; Stubblebine, W.H.; Prusiner,
S.B.; Dearmond, S.J
Symbol comparison table: /disk2/usr/local/soft/seq/gcgf9/gcgcore/data/rundata/b
losum62.cmp
CompCheck: 6430
Gap Weight: 12 Average Match: 2.912
Length Weight: 4 Average Mismatch: -2.003
Quality: 1195 Length: 209
Ratio: 5.718 Gaps: 0
Percent Similarity: 100.000 Percent Identity: 100.000
Average quality based on 100 randomizations: 44.2 +/- 7.7
Match display thresholds for the alignment(s):
| = IDENTITY
: = 2
. = 1
humprp.pep x s14078 November 25, 1996 13:28 ..
. . . . .
10 NTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGW 59
||||||||||||||||||||||||||||||||||||||||||||||||||
32 NTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGW 81
. . . . .
60 GQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLG 109
||||||||||||||||||||||||||||||||||||||||||||||||||
82 GQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLG 131
. . . . .
110 SAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVN 159
||||||||||||||||||||||||||||||||||||||||||||||||||
132 SAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVN 181
. . . . .
160 ITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSS 209
||||||||||||||||||||||||||||||||||||||||||||||||||
182 ITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSS 231
210 MVLFSSPPV 218
|||||||||
232 MVLFSSPPV 240
It
turns out that the amyloid protein that we were testing is an aberrant prion
particle, in spite of it's name, most often associated with Alzheimer's disease
associated proteins, and this is why the significance level is so high. This
underescores the importance of investigating any entries that appear
interesting. The weird thing is how far down the BLAST output list S14078
was.However, often a seemingly good alignment will not be significant upon further inspection -- do not blindly accept the output of any computer program! Always investigate further for similarities can be strictly artifactual. The second comparison that I chose, sw:rb87_drome against the human prion protein, turned out to be entirely insignificant. From the previous DotPlot analysis, I saw that the region of humprp.pep from about residue 20 through 60 repeats strongly in RB87 around residue 210 through residue 320. A Monte Carlo analysis of this comparison is shown in an abridged fashion below. The Z score is -0.3:
% bestfit -random=100 humprp.pep sw:rb87_drome
BestFit makes an optimal alignment of the best segment of similarity
between two sequences. Optimal alignments are found by inserting gaps to
maximize the number of matches using the local homology algorithm of
Smith and Waterman.
Begin (* 1 *) ? 20
End (* 231 *) ? 60
Begin (* 1 *) ? 210
End (* 386 *) ? 320
What is the gap creation penalty (* 12 *) ? <rtn>
What is the gap extension penalty (* 4 *) ? <rtn>
What should I call the paired output display file (* humprp.pair *) ? humprp-r
b87.pair
Aligning .....-.
Aligning .....-.
BESTFIT of: humprp.pep check: 414 from: 20 to: 60
to: rb87_drome check: 2689 from: 210 to: 320
ID RB87_DROME STANDARD; PRT; 386 AA.
DE HETEROGENEOUS NUCLEAR RIBONUCLEOPROTEIN 87F (HRP36.1 PROTEIN) (P11 . . .
Symbol comparison table: /disk2/usr/local/soft/seq/gcgf9/gcgcore/data/rundata/b
losum62.cmp CompCheck: 6430
Gap Weight: 12 Average Match: 2.912
Length Weight: 4 Average Mismatch: -2.003
Quality: 78 Length: 41
Ratio: 1.902 Gaps: 0
Percent Similarity: 41.463 Percent Identity: 41.463
Average quality based on 100 randomizations: 80.3 +/- 8.0
Match display thresholds for the alignment(s):
| = IDENTITY
: = 2
. = 1
humprp.pep x rb87_drome November 25, 1996 13:55 ..
. . . .
20 GSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWG 60
| | . ||| || ||| || .| ||| |
223 GGGWGGQNRQNGGGNWGGAGGGGGFGNSGGNFGGGQGGGSG 263
The
extreme glycine richness of both sequences is what the programs found, yet the
Monte Carlo test suggests that this similarity is not at all significant, it is
merely the result of compositional bias. The program Seg is now available
outside of BLAST for prefiltering your sequences. This may be particularly
prudent in situations such as the prion molecule where you know that a lot of
repeat and low complexity sequence compostion has the potential to confound
search algorithms.8) And finally -- interpreting Profile analysis -- why even bother?
The Profile method enables the researcher to recognize features that may be otherwise invisible. The greatly enhanced information content of a Profile over individual sequences has the potential to find similar motifs in sequences which may be only distantly related. Even though ProfileSearches do require some work to setup and run -- a meaningful multiple sequence alignment must be assembled and refined, ProfileMake needs to be run, and the search job itself takes quite a long time to run -- it is worth it.
What PROFILE analysis can show us.
When your ProfileSearch finally finishes, take a look at the output. There is a strong chance that some of the sequences in it were not found by other search algorithms. The abridged output file from the example ProfileSearch is shown below:
(Peptide) PROFILESEARCH of: /disk2/usr/local/people/thompson/courses/BC578/EX8/p
rion.prf Length: 251 to: SwissProt:*
Scores are corrected for composition effects
Gap Weight: 33.00
Gap Length Weight: 0.37
Sequences Examined: 49571
CPU time (seconds): 9826
* * * * * * * * * * * * * * * *
Profile information:
(Peptide) PROFILEMAKE v4.50 of: prion.msf{*} Length: 251
Sequences: 23 MaxScore: 1416.57 November 20, 1996 11:38
Gap: 1.00 Len: 1.00
GapRatio: 0.33 LenRatio: 0.10
prion.msf{PRIO_BOVIN} From: 11 To: 261 Weight: 1.00
prion.msf{PRIO_TRAST} From: 11 To: 261 Weight: 1.00 . . .
* * * * * * * * * * * * * * * *
Normalization: November 21, 1996 19:09
Curve fit using 42 length pools
0 of 42 pools were rejected
Normalization equation:
Calc_Score = 276.50 * ( 1.0 - exp(-0.0103*SeqLen + 0.4126) )
Correlation for curve fit: 0.902
Z score calculation:
Average and standard deviation calculated using 49534 scores
37 of 49571 scores were rejected
Z_Score = ( Score/Calc_Score - 1.017 ) / 0.175
Sequence Strd ZScore Orig Length ! Documentation ..
SW:PRIO_CALMO + 26.89 1378.13 241 ! P40248 callicebus moloch (dusky
titi). major prion protein precursor (prp) (prp27-3
SW:PRIO_MANSP + 26.79 1373.70 241 ! P40255 mandrillus sphinx (mandr
ill) (papio sphinx). major prion protein precursor (
SW:PRIO_AOTTR + 26.68 1364.99 239 ! P40245 aotus trivirgatus (night
monkey) (douroucouli). major prion protein precurso
SW:PRIO_COLGU + 26.60 1388.79 253 ! P40251 colobus guereza. major p
rion protein precursor (prp) (prp27-30) (prp33-35c).
////////////////////////////////////////////////////////////////////////////////
SW:PRP2_TRAST + 25.94 1366.14 256 ! P40243 tragelaphus strepsiceros
(greater kudu). major prion protein 2 precursor (pr
SW:PRIO_TRAST + 25.89 1377.09 264 ! P40242 tragelaphus strepsiceros
(greater kudu). major prion protein 1 precursor (pr
SW:PRIO_MESAU + 25.84 1358.16 254 ! P04273 mesocricetus auratus (go
lden hamster). major prion protein precursor (prp) (
SW:PRIO_RAT + 24.99 1266.69 226 ! P13852 rattus norvegicus (rat).
major prion protein (prp) (fragment). 2/95
SW:YACG_ECOLI + 12.96 86.72 50 ! P36681 escherichia coli. hypothe
tical 5.8 kd protein in mutt-guac intergenic region
SW:POR_GLUOX + 11.67 80.79 50 ! P80354 gluconobacter oxydans. po
lyol:nadp oxidoreductase (ec 1.1.1.-) (fragment). 1
SW:INS_GADCA + 11.64 88.42 51 ! P01336 gadus callarias (baltic c
od). insulin. 2/95
SW:YKQ2_CAEEL + 11.52 80.07 50 ! P34297 caenorhabditis elegans. h
ypothetical 5.0 kd protein c06e1.2 in chromosome ii
SW:YKQ5_CAEEL + 11.35 79.31 50 ! P34300 caenorhabditis elegans. h
ypothetical 5.0 kd protein c06e1.5 in chromosome ii
SW:YP55_ECOLI + 11.26 94.04 52 ! P14505 escherichia coli. hypothe
tical 5.5 kd protein in replication origin region.
SW:INS2_BATSP + 11.16 78.40 50 ! P01338 batrachoididae sp. (toadf
ish). insulin 2. 2/95
SW:YCUA_LACCU + 11.15 85.94 51 ! P35923 lactobacillus curvatus. h
ypothetical 6.1 kd protein in cura 3'region (orfb).
SW:MYX4_CRODU + 10.90 84.67 51 ! P24334 crotalus durissus terrifi
cus (south american rattlesnake). myotoxin 4 precur
SW:INS_MYOSC + 10.79 76.70 50 ! P07453 myoxocephalus scorpius (s
horthorn sculpin) (daddy sculpin). insulin. 2/95
SW:YF69_HAEIN + 10.77 76.62 50 ! P44258 haemophilus influenzae. h
ypothetical protein hi1569. 11/95
SW:YIAZ_ECOLI + 10.68 76.21 50 ! P37305 escherichia coli. hypothe
tical 6.0 kd protein in cspa-glys intergenic region
SW:PND2_ECOLI + 10.67 76.17 50 ! P16477 escherichia coli. pnd pro
tein. 8/91
SW:PND1_ECOLI + 10.62 75.93 50 ! P11902 escherichia coli. pnd pro
tein. 11/95
SW:VG38_BPML5 + 10.57 75.69 50 ! Q05248 mycobacteriophage l5. gen
e 38 protein (gp38). 2/94
SW:INS_ONCGO + 10.46 75.18 50 ! P23187 oncorhynchus gorbuscha (p
ink salmon) (humpback salmon), and oncorhynchus kis
//////////////////////////////////////////////////////////////////////////////
ProfileSearch
Z scores are normalized and reflect the significance of the results. Here,
rather than randomizing sequences to evaluate the Z score as is done in the
above Monte Carlo method, it is calculated, as in BLAST, based on all of the
nonsimilar sequences of the database search. As before, generally Z scores
below 3 are not worth considering, those from 4 to 7 may be interesting, and
those above 7 should definitely be checked out further. The program
ProfileSegments makes BestFit style alignments of the results of a
ProfileSearch. A great new option in ProfileSegments allows you to prepare a
multiple sequence alignment of the ProfileSearch segments. Take advantage of
the -msf option in your run below. The full information content of the
profile is utilized in these alignment procedures. When your ProfileSearch has
finished running, use pico to edit its output file the same as you did the
.word file (remember -- keep a couple of reference sequences and several
"interesting, twilight zone" comparisons and get rid of everything else, except
the reference header material). Next, make alignments off of it with
ProfileSegments. Be sure to create enough alignments to cover your revised
list size. Give the output pairs file the extension .profilepairs and the
output msf file an appropriate name. The abridged prion example follows:
% profilesegments -check
ProfileSegments makes optimal alignments showing the segments of
similarity found by ProfileSearch.
Minimal Syntax: % profilesegments [-INfile=]hsp70.pfs -Default
Prompted Parameters:
-SEQLimit=15 limit the number of alignments
[-OUTfile=]hsp70.pairs output file for the alignments
Local Data Files: None
Optional Parameters:
-LOCal aligns the best segment of similarity between
the sequence and profile (local alignment is
the default)
-GLObal aligns the whole sequence and profile (global
alignment)
-NOAVErage does not adjust alignment scores for sequence
composition
-ENDWeight penalizes end gaps like other gaps
-LIMit1=20 lets you set a gap shift limit for the sequence
Press q to quit or <Return> for more:
-LIMit2=20 lets you set a gap shift limit for the profile
-OUTfile2=hs7t_rat.gap new file for sequence 1 with gaps added
-OUTfile3=hsp70.gap new file for the profile consensus with gaps added
-MSF[=hsp70.msf] new MSF file containing alignment of all the
sequences with the profile consensus
-PAIr=1.0,0.5,0.1 thresholds for displaying '|', ':', and '.'
-NOMONitor suppresses the screen summary for each alignment
Add what to the command line ? -msf
PROFILESEGMENTS version 4.50 November 26, 1996 11:11
(Local) PROFILESEGMENTS of what PROFILESEARCH output file ? prion.pfs
Stop after how many alignments (* 15 *) ? 50
What should I call the paired output display file (* prion.pairs *) ? prion.pr
ofilepairs
What should I call the multiple sequence output file (* prion.msf *) ? prion-p
rofile.msf
The following levels will be marked in the alignments:
Bar: 3.21
Colon: 2.34
Dot: 1.47
Aligning ............-............
Sequence: 1 prio_calmo
SW:PRIO_CALMO
Gaps: 4
Quality: 1378.12
Quality Ratio: 5.77
Length: 251
Aligning ............-..
Sequence: 2 yacg_ecoli
SW:YACG_ECOLI
Gaps: 2
Quality: 86.72
Quality Ratio: 3.10
Length: 34
/////////////////////////////////////////////////////////////////////
Display
the results of your ProfileSegments run and notice how much different the
alignments, after the obvious first ones, are from the previous examples. In
my case, the proline, glutamine richness of certain areas of the UV15 protein
was found similar to the prion, and the string of alanines in certain chorion
proteins is considered important by the algorithm. No previous program
recognized either of these similarities. "Clustering" is also much more
critical to Profile analyses than any other method. This is because of profile
analysis' variable gap penalties where conserved areas are not allowed to gap
and variable regions are free to. The abridged prion ProfileSegments example
output file follows:
% more prion.profilepairs
(Local) PROFILESEGMENTS of: PRIO_CALMO check: 8361 from: 1 to: 241
//////////////////////////////////////////////////////////////////////////////
Profile: prion.prf
Gap Weight: 33.000 Average Match: 2.340
Length Weight: 0.370 Average Mismatch: -1.705
Quality: 1378.12 Length: 251
Ratio: 5.77 Gaps: 4
prio_calmo x prion.prf November 26, 1996 11:11 ..
. . . . .
S 2 LVLFVATWSDLGLCKKRPKP.GGWNTGGSRYPGQGSPGGNRYPPQG.... 46
:.:|::.|.| |:||||||| ||||||||||||||||||||||||:
P 1 LVLFVATWSBLGLCKKRPKPGGGWNTGGSRYPGQGSPGGNRYPPQGGGGW 50
. . . . .
S 47 ....GGSWGQPHGGGWGQPHGGGWGQPHGGGWGQPH.GGGWGQGGGTHNQ 91
|| ||||||||||||||||||||||||||||| ||||||:|||| |
P 51 GQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGGWGQGGGTHNQ 100
. . . . .
S 92 WNKPSKPKTNMKHVAGAAAAGAVVGGLGGYMLGSAMSRPLIHFGNDYEDR 141
|:||:||||:||| ||||||||||||||||||||||:||.:|||:|||||
P 101 WNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPLIHFGNBYEBR 150
. . . . .
S 142 YYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNITIKQHTVTTTTKGENF 191
||||||:||||||||||.|||.|||||||||||:|.||||||||||||||
P 151 YYRENMYRYPNQVYYRPVBQYSNQNNFVHBCVNITIKQHTVTTTTKGENF 200
. . . . .
S 192 TETDVKMMERVVEQMCITQYEKESQAYY..QRGSSMVLFSSPPVILLISF 239
||||.|:|||||||||.|||..|||||| :||.:..:||:|||||||||
P 201 TETBVKMMERVVEQMCITQYEKESQAYYBGQRGSSMVLFSSPPVILLISF 250
S 240 L 240
|
P 251 L 251
(Local) PROFILESEGMENTS of: YACG_ECOLI check: 6061 from: 1 to: 50
//////////////////////////////////////////////////////////////////////////////
DE HYPOTHETICAL 5.8 KD PROTEIN IN MUTT-GUAC INTERGENIC REGION. . .
//////////////////////////////////////////////////////////////////////////////
Profile: prion.prf
Gap Weight: 33.000 Average Match: 2.340
Length Weight: 0.370 Average Mismatch: -1.705
Quality: 86.72 Length: 34
Ratio: 3.10 Gaps: 2
yacg_ecoli x prion.prf November 26, 1996 11:11 ..
. . .
S 3 WGEISPFRPFCSKRCQLIDLGEWAAEEKRIPSSG 36
| | || | | | | |
P 8 WSBLG....LCKKRPK..PGGGWNTGGSRYPGQG 35
//////////////////////////////////////////////////////////////////////////////
(Local) PROFILESEGMENTS of: MYX4_CRODU check: 9780 from: 1 to: 51
//////////////////////////////////////////////////////////////////////////////
DE MYOTOXIN 4 PRECURSOR (CROTAMINE 4) (FRAGMENT). . . .
//////////////////////////////////////////////////////////////////////////////
Profile: prion.prf
Gap Weight: 33.000 Average Match: 2.340
Length Weight: 0.370 Average Mismatch: -1.705
Quality: 84.64 Length: 67
Ratio: 2.02 Gaps: 4
myx4_crodu x prion.prf November 26, 1996 11:11 ..
. . . . .
S 1 FLS............EP.......GNAYKQCHKKGGHCFPKEKICIPPSS 31
| | | | || :| .
P 4 FVATWSBLGLCKKRPKPGGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQP 53
.
S 32 DFGKMDCRWR......W 42
| | |
P 54 HGGG....WGQPHGGGW 66
(Local) PROFILESEGMENTS of: INS_MYOSC check: 6776 from: 1 to: 50
//////////////////////////////////////////////////////////////////////////////
DE INSULIN. . . .
//////////////////////////////////////////////////////////////////////////////
Profile: prion.prf
Gap Weight: 33.000 Average Match: 2.340
Length Weight: 0.370 Average Mismatch: -1.705
Quality: 76.70 Length: 6
Ratio: 12.78 Gaps: 0
ins_myosc x prion.prf November 26, 1996 11:11 ..
S 31 IVEQCC 36
:||| |
P 211 VVEQMC 216
(Local) PROFILESEGMENTS of: YF69_HAEIN check: 6428 from: 1 to: 50
//////////////////////////////////////////////////////////////////////////////
DE HYPOTHETICAL PROTEIN HI1569. . . .
//////////////////////////////////////////////////////////////////////////////
Profile: prion.prf
Gap Weight: 33.000 Average Match: 2.340
Length Weight: 0.370 Average Mismatch: -1.705
Quality: 76.62 Length: 24
Ratio: 3.19 Gaps: 0
yf69_haein x prion.prf November 26, 1996 11:11 ..
. .
S 8 YTFIRQGKMLVVKILPKEIFVLSF 31
| | . | | .||
P 227 YYBGQRGSSMVLFSSPPVILLISF 250
(Local) PROFILESEGMENTS of: YIAZ_ECOLI check: 4447 from: 1 to: 50
//////////////////////////////////////////////////////////////////////////////
DE HYPOTHETICAL 6.0 KD PROTEIN IN CSPA-GLYS INTERGENIC REGION. . . .
//////////////////////////////////////////////////////////////////////////////
Profile: prion.prf
Gap Weight: 33.000 Average Match: 2.340
Length Weight: 0.370 Average Mismatch: -1.705
Quality: 76.21 Length: 22
Ratio: 4.48 Gaps: 4
yiaz_ecoli x prion.prf November 26, 1996 11:11 ..
. .
S 17 LLFF..TWMIRD.SLCELHIKQ 35
: | .| | :| |
P 1 LVLFVATW..SBLGLCK...KR 17
(Local) PROFILESEGMENTS of: PND2_ECOLI check: 9184 from: 1 to: 50
//////////////////////////////////////////////////////////////////////////////
DE PND PROTEIN. . . .
//////////////////////////////////////////////////////////////////////////////
Profile: prion.prf
Gap Weight: 33.000 Average Match: 2.340
Length Weight: 0.370 Average Mismatch: -1.705
Quality: 76.16 Length: 35
Ratio: 2.93 Gaps: 5
pnd2_ecoli x prion.prf November 26, 1996 11:11 ..
. . .
S 10 LIVICVTILCFV..WMVRD.SLCGLRLQQ...GNT 38
: |: | | :| | ||
P 1 LVL.......FVATW..SBLGLCKKRPKPGGGWNT 26
//////////////////////////////////////////////////////////////////////////////
(Local) PROFILESEGMENTS of: VG38_BPML5 check: 8056 from: 1 to: 50
//////////////////////////////////////////////////////////////////////////////
DE GENE 38 PROTEIN (GP38). . . .
//////////////////////////////////////////////////////////////////////////////
Profile: prion.prf
Gap Weight: 33.000 Average Match: 2.340
Length Weight: 0.370 Average Mismatch: -1.705
Quality: 75.64 Length: 72
Ratio: 1.76 Gaps: 4
vg38_bpml5 x prion.prf November 26, 1996 11:11 ..
. . . . .
S 1 MKKTIAA.AIIGLAATLSLSAC...................EGGTGSSDY 30
. :: |: | .| | |
P 1 LVLFVATWSBLGL........CKKRPKPGGGWNTGGSRYPGQGSPGGNRY 42
. .
S 31 TPSVPVFI.........QMPGG 43
| | ||
P 43 PPQGGGGWGQPHGGGWGQPHGG 64
//////////////////////////////////////////////////////////////////////////////
All
of the non-prion segment entries that I retained on my .pfs file had
ProfileSearch Z scrores between 10 and 13. Are any of these seven biologically
significant? It may be worth further investigation; some of the motifs look
quite interesting. The multiple sequence alignment output that we created with
the msf option follows below in abridged form:
% more prion-profile.msf
!!AA_MULTIPLE_ALIGNMENT 1.0
Multiple alignment of prion.prf and
prio_calmo yacg_ecoli por_gluox ins_gadca
ykq2_caeel ykq5_caeel yp55_ecoli ins2_batsp
ycua_laccu myx4_crodu ins_myosc yf69_haein
yiaz_ecoli pnd2_ecoli pnd1_ecoli vg38_bpml5
ins_oncgo lha6_rhoac disi_erima ins1_batsp
y476_haein p100_human yftr_metth atpe_rat
ins_progu orne_plaor znfp_lycvp ins_katpe
prion-profile.msf MSF: 360 Type: P November 26, 1996 11:11 Check: 6266 ..
Name: prio_calmo Len: 360 Check: 5192 Weight: 1.00
Name: yacg_ecoli Len: 360 Check: 2335 Weight: 1.00
Name: por_gluox Len: 360 Check: 9283 Weight: 1.00
Name: ins_gadca Len: 360 Check: 7065 Weight: 1.00
Name: ykq2_caeel Len: 360 Check: 9842 Weight: 1.00
Name: ykq5_caeel Len: 360 Check: 6224 Weight: 1.00
Name: yp55_ecoli Len: 360 Check: 8947 Weight: 1.00
Name: ins2_batsp Len: 360 Check: 3627 Weight: 1.00
Name: ycua_laccu Len: 360 Check: 9583 Weight: 1.00
Name: myx4_crodu Len: 360 Check: 71 Weight: 1.00
Name: ins_myosc Len: 360 Check: 319 Weight: 1.00
Name: yf69_haein Len: 360 Check: 997 Weight: 1.00
Name: yiaz_ecoli Len: 360 Check: 8336 Weight: 1.00
Name: pnd2_ecoli Len: 360 Check: 927 Weight: 1.00
Name: pnd1_ecoli Len: 360 Check: 6285 Weight: 1.00
Name: vg38_bpml5 Len: 360 Check: 6103 Weight: 1.00
Name: ins_oncgo Len: 360 Check: 319 Weight: 1.00
Name: lha6_rhoac Len: 360 Check: 5467 Weight: 1.00
Name: disi_erima Len: 360 Check: 3404 Weight: 1.00
Name: ins1_batsp Len: 360 Check: 7387 Weight: 1.00
Name: y476_haein Len: 360 Check: 1904 Weight: 1.00
Name: p100_human Len: 360 Check: 6781 Weight: 1.00
Name: yftr_metth Len: 360 Check: 8238 Weight: 1.00
Name: atpe_rat Len: 360 Check: 8540 Weight: 1.00
Name: ins_progu Len: 360 Check: 1891 Weight: 1.00
Name: orne_plaor Len: 360 Check: 5069 Weight: 1.00
Name: znfp_lycvp Len: 360 Check: 7948 Weight: 1.00
Name: ins_katpe Len: 360 Check: 8166 Weight: 1.00
Name: prion.prf Len: 360 Check: 6016 Weight: 1.00
//
1 50
prio_calmo LV..L..... ..F....... .VAT...... W......... ..........
yacg_ecoli .......... .......... .......... W......... ..........
por_gluox .......... .......... .......... .......... ..........
ins_gadca .......... .......... .......... .......... ..........
ykq2_caeel .......... .......... .......... .......... ..........
ykq5_caeel .......... .......... .......... .......... ..........
yp55_ecoli .......... .......... .......... .......... ..........
ins2_batsp .......... .......... .......... .......... ..........
ycua_laccu LI..IILR.. ..F....... .LIV...... W......... ..........
myx4_crodu .......... ..F....... .LS....... .......... ..........
ins_myosc .......... .......... .......... .......... ..........
yf69_haein .......... .......... .......... .......... ..........
yiaz_ecoli LL..F..... ..F....... ...T...... WMI....... ..........
pnd2_ecoli LI..VICVTI LCF....... .V........ WMV....... ..........
pnd1_ecoli LI..VVCVTI LCF....... .V........ WMV....... ..........
vg38_bpml5 MK..K..... ..T....... .IAA...... .......... ..........
ins_oncgo .......... .......... .......... .......... ..........
lha6_rhoac LI..IH.... ..F....... .IALSTDRFN W......... ..........
disi_erima RC..K..... ..FKRAGKVC RVARGD.... W......... ..........
ins1_batsp .......... .......... .......... .......... ..........
y476_haein .......... .......... .......... .......... ..........
p100_human .......... .......... .......... .......... ..........
yftr_metth .......... .......... .......... .......... ..........
atpe_rat .......... .......... .VAY...... W......... ..........
ins_progu .......... .......... .......... .......... ..........
orne_plaor .L..L..... ..Y....... .CGE...... F......... ..........
znfp_lycvp PL..N..... ..C....... .KSC...... WQKFDSLVRC HDHYLCRHCL
ins_katpe LVEAL..... ..Y....... .L........ .......... ..........
prion.prf LV..L..... ..F....... .VAT...... W......... ..........
//////////////////////////////////////////////////////////////////////////////
301 350
prio_calmo TTTTKGENFT ETDVKMMERV VEQMCITQYE KESQAYY..Q RGSSMVLFSS
yacg_ecoli .......... .......... .......... .......... ..........
por_gluox .......... .......... .......... .......... ..........
ins_gadca .......... .......... .......... .......... ..........
ykq2_caeel .......... .......... .......... .......... ..........
ykq5_caeel .......... .......... .......... .......... ..........
yp55_ecoli .......... .......... .......... .......... ..........
ins2_batsp .......... .......... .......... .......... ..........
ycua_laccu .......... .......... .......... .......... ..........
myx4_crodu .......... .......... .......... .......... ..........
ins_myosc .......... .........I VEQCC..... .......... ..........
yf69_haein .......... .......... .......... .....YTFIR QGKMLVVKIL
yiaz_ecoli .......... .......... .......... .......... ..........
pnd2_ecoli .......... .......... .......... .......... ..........
pnd1_ecoli .......... .......... .......... .......... ..........
vg38_bpml5 .......... .......... .......... .......... ..........
ins_oncgo .......... .........I VEQCC..... .......... ..........
lha6_rhoac .......... .......... .......... .......... ..........
disi_erima .......... .......... .......... .......... ..........
ins1_batsp .......... .......... .......... .......... ..........
y476_haein .......... .......... .......... .......... ..........
p100_human .......... .......... .......... .......... ..........
yftr_metth .......... .......... .......... .......... ..........
atpe_rat .......... .......... .......... .......... ..........
ins_progu .......... .......EGI VDQCC..... .......... ..........
orne_plaor .......... .......... .......... .......... ..........
znfp_lycvp .......... .......... .......... .......... ..........
ins_katpe .......... .......... .......... .......... ..........
prion.prf TTTTKGENFT ETBVKMMERV VEQMCITQYE KESQAYYBGQ RGSSMVLFSS
351
prio_calmo PPVILLISFL
yacg_ecoli ..........
por_gluox ..........
ins_gadca ..........
ykq2_caeel ..........
ykq5_caeel ..........
yp55_ecoli ..........
ins2_batsp ..........
ycua_laccu ..........
myx4_crodu ..........
ins_myosc ..........
yf69_haein PKEIFVLSF.
yiaz_ecoli ..........
pnd2_ecoli ..........
pnd1_ecoli ..........
vg38_bpml5 ..........
ins_oncgo ..........
lha6_rhoac ..........
disi_erima ..........
ins1_batsp ..........
y476_haein ..........
p100_human ..........
yftr_metth ..........
atpe_rat ..........
ins_progu ..........
orne_plaor ..........
znfp_lycvp ..........
ins_katpe ..........
prion.prf PPVILLISFL
Notice
the extreme 'gappiness' of the alignment due to the profile method used. In
this case, I doubt whether it is a biologically meaningful alignment, but you
should be able to see the power of the technique illustrated, had the sequences
actually belonged together. This can be a very handy way to pregap sequences
in order to introduce them into existing alignments.8) Conclusions -- do you have any?
This has been a long one; I do apologize. However, database searching analysis is one of the most commonly misunderstood areas in computational molecular biology and bioinformatics today. There is a tremendous amount of confusion in this field and anything that can be done to try and clear up some of the mess is entirely worthwhile. One point that still needs to be made is that the previous techniques were performed largely using GCG's suggested defaults. This usually will work for you, but it is a good idea to think about what these default values imply and adjust them accordingly, especially if the results seem inappropriate after running through a first pass with the default parameters intact.
The files and the renaming conventions for each that you are to rcp to the teacher account on ribozyme follow (files are copied from $GRAD_DIR/week8 and copied into teacher@ribozyme:receive just as previously):
% cp $GRAD_DIR/week8s/* .
% rcp lastname.* teacher@ribozyme:receive
This concludes your computing session for the week. Log off of ribozyme. Get out of the emulator and back to the Desktop. Do not turn off the computer.
Next week we will enter the controversial world of protein secondary structure prediction. This is where computer analysis "goes way out on a limb" and becomes quite subjective. Realize that most prediction schemes fair 60-75% at best and work better on some proteins than others. However, valuable insights can often be gained -- the techniques are worth learning and like all computer methods, the better you understand them, the more of them that you run, and the more you practice, the better they can work for you!
Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, 2013-2018.
Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis Primer. W.H. Freeman and Company, New York,. NY
Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences USA 89, 10915-10919.
Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology 48, 443-453.
Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio 1994. Nucleic Acids Research 22, 3470-3473.
Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences USA 85, 2444-2448.
Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, (M.O. Dayhoff editor) 5, Suppl. 3, 353-358, National Biomedical Research Foundation, Washington D.C.
Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied Mathematics 2, 482-489.
Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Sciences USA 80, 726-730.