Unknown DNA: Rational Probe Design and Analysis -- the "Guessmer." How to design and analyze oligonucleotide probes and primers for discovering genes in organisms where they have not been identified when the gene's encoded protein sequence is known in other organisms.
Author:
Steven M. Thompson
Present-day economic, automated synthesis and the ready availability of oligonucleotides have revolutionized molecular biology. Coupled with polymerase chain reaction (PCR) (Saiki, et al., 1988) techniques and ultra sensitive hybridization screenings, oligonucleotide probes have allowed the "fishing out" of thousands of genes from complex genomes which would have previously been extremely difficult to ever even find, yet alone sequence. Additionally, easily available oligonucleotides have enabled the development of methods for the introduction of site-specific mutations into known sequences. Because of the high specificity and adjustable stringency of oligonucleotide hybridization, the sequence knowledge of a relatively short stretch of unique DNA is sufficient to rapidly isolate and clone the corresponding gene. In the absence of nucleotide data amino acid sequences can be backtranslated to provide the necessary probe.
Several strategies can be imagined for the design of oligonucleotide probes. If an exact nucleotide sequence is known, then a single oligonucleotide probe for hybridization or a pair of primers for PCR of a defined sequence can be synthesized. In the absense of a defined DNA sequence, sometimes a group of similar DNA sequences can be aligned, however, this is often not possible because DNA is very, very difficult to align. Because of these reasons the luxury of having DNA to work with is often not available. In many cases one is forced to work off of either a small portion of a protein sequence from a Edman degradation reaction or, as will be illustrated in this exercise, a consensus pattern from a group of related proteins. Using amino acid sequence information requires one to backtranslate the sequence. This is not a trivial chore because of the degeneracy of the genetic code. There are 64 possible codons for 20 amino acids. Because of this, different backtranslated probe techniques have been employed. Two are, either utilizing large pools of short oligonucleotides whose sequences are highly degenerate or using small pools, or even just one, longer oligonucleotide of lesser or no degeneracy. All organisms have preferential biases in codon usages and this information can be used to our advantage in deciding which codons to synthesize out of all of the possible choices. The type of strategy which will be employed in this exercise, that of choosing the longest defined stretch of unambiguous peptide and backtranslating it to the most probable oligonucleotide, is known as designing a "guessmer."
Guessmers contain the combination of codons most likely to match the authentic gene. Guessmers work because the decrease in hybridization stability caused by mismatched bases is offset by the increase in stability from longer hybrid sequences. In most cases mismatches will occur in only the third position of incorrect codon choices and, therefore, at least two of the three bases will still be matched. Naturally, the biggest constraint on utilizing this type of strategy is that a relatively long stretch of amino acid sequence is required. Because of this, hybridization guessmers are particularly appropriate when (a) strong and sufficiently long consensus element(s) can be discovered in a protein family. They should be at least about 30 nucleotides in length in order to insure sufficient hybridization despite potential mismatches, yet it is not worth the extra effort and bother to synthesize them longer than 70 bases. For very good descriptions of the factors involved in guessmer design and analysis and references to primary literature see Sambrook et al. (1990) and Wood (1987). Similar strategies can be used to design paired PCR primers, although they are seldom designed as long as hybridization probes. For good general reviews of PCR methodology see Mullis (1990), White et al. (1989), and Cherfas (1990).
In order to discover possible consensus patterns within a known protein family for the design of a guessmer, the individual members must be maximally aligned and then a consensus must be created. Alignment is often achieved through an automated progressive, pairwise alignment procedure, here the GCG program PileUp, which inserts gaps to align the full length of its members. Other automated alignment methods are also available such as Higgins' ClustalW (1994), Smith and Smith's PIMA (version 1.4, 1995), and Gupta et al.'s MSA (version 2.0, 1995), as are several different manual alignment editors. Consensus sequences can then be created from the alignment. Many methods merely rely on the positional frequency of individual symbols, however, some utilize much more information. Profile analysis (Gribskov et al., 1989) is one of these. You will be seeing more of profile analysis in exercise 8. Profile analysis takes advantage of the BLOSUM62 (Henikoff and Henikoff, 1992) Dayhoff style scoring matrix (Schwartz and Dayhoff, 1979) that utilizes the relative conservation of various amino acid substitutions within the alignment. Therefore, the resultant consensus residues are the most evolutionarily conserved rather than just statistically the most frequent. This can mean much more to us than an ordinary consensus and is especially appropriate in the design of the type of guessmer that we will be simulating -- that is, a situation in which much sequence information for the protein of interest is known in other organisms but not in the one we are studying.
I will illustrate the use of most programs in the sequence analysis exercise series by running through them with the human prion protein as an example. You should not repeat the prion analyses in your own accounts; you are to use your "Selected Molecule" to run the analyses yourself. The prion molecule used as an example is responsible for a debilitating disease and yet is encoded by the organism's own DNA; the gene is expressed in both normal and afflicted cells. Large amounts of proteinaceous plaques aggregate and are deposited in the brains of afflicted animals. The prion protein has an unknown natural function but is found in very high quantities in the brain of animals infected with the degenerative neurological diseases scrapie and Bovine Spongiform Encephalopathy, in wild stock, and kuru, Creutzfeldt-Jacob Disease, or Gerstmann-Straussler Syndrome in humans. It is also involved in Fatal Familial Insomnia and recently gained notoriety as the harbinger of "Mad-Cow Disease." In human beings the gene maps to position 20p12-pter and the disease can be inherited in an autosomal dominant fashion. Seventeen pathologic allelic variants are listed in OMIM (1995). One of the most peculiar aspects of the prion is no infective nucleotide entity has ever been found, yet the protein particle itself is highly infectious. Somehow the infectious protein particle induces a posttranslational, pathological change in the host's normal protein to convert it to the aberrant isoform. The primary amino acid sequence is not changed, only the structural conformation of the protein is different. Dr. Stanley B. Prusiner of the University of California, San Francisco, won the 1997 Stockholm's Karolinska Institute Nobel Prize in physiology or medicine because of his work on this system. For further information see Dr. Prusiner's article in Science, available on the World Wide Web at: http://www.sciencemag.org/feature/data/prusiner/245.shl.
The following molecules are again listed to remind you of the choices for use in all subsequent exercises. Each entry has had at least one representative structure and genomic sequence solved. Please use the one molecule that interests you the most for the remainder of the exercise series:
At this point, in previous exercises, you should have already chosen and explored one of the proteins off of the "Selected Molecule List" to work with for the duration of the exercises. All necessary commands in the sequence analysis exercise series are typed in boldprint; screen traces are in Courier font; and "////////////" indicates abriged data. Really important statements are underlined. In all subsequent examples, through the entire exercise set, substitute your chosen "Selected Molecule" for the example used in my screen traces, here, and in most of my portion of the exercise series, the human prion protein.
How to design the oligo(s)? One way -- the guessmer:
start from known protein sequences and find strong consensus elements within
them;
BackTranslate the consensus elements to yield consensus DNA sequences;
use Prime to locate candidate primers within the conserved DNA regions; test
candidate primers suitability with FindPatterns and Prime.
Preliminary Preparations:
After logging on to ribozyme, initialize the Wisconsin Sequence Analysis Package from the Genetics Computer Group by typing gcg at the system prompt. You'll see the GCG welcome screen describing the various databases and providing the literature reference to the package.
Now create a new directory for this exercise with the command mkdir week5 and move down into it (Remember cd week5, or down week5 if you have GCG activated.). Finally copy all the necessary files from ribozyme to the new directory with the standard command. Complete this procedure just the same as you've done in previous exercises by using the environment variable name $GRAD_DIR and the appropriate path, here /week5s:
% cp $GRAD_DIR/week5s/* .
1) StringSearch the protein database.
To find entries of interest in sequence databases we need to know their proper database names or accession codes. Database text searching programs are often the easiest way to do this. There are several methods; the Entrez program used previously is one of the most powerful. Here we will use GCG's StringSearch program because it creates an output file that can be used as an input list file to other GCG programs. Type stringsearch -check; the check option is really handy in all GCG programs as it allows you a chance to select command line options that you would not otherwise have a chance of using. You'll see the following screen:
% stringsearch -check StringSearch identifies sequences by searching for character patterns such as "globin" or "human" in the sequence documentation. Minimal Syntax: % stringsearch [-INfile=]GenEMBL:* [-STRings=]Pseudo -Default Prompted Parameters: -MENu=A A=definitions, B=complete records [-OUTfile=]genembl.strings output file (of sequence names) Optional Parameters: -MATch=Or finds entries with any of the patterns specified -WIDth=100 limits length of documentation in the output file -NOHEAding suppresses the heading in the output file -BATch submits the program to run in the batch queue -NOSCReen suppresses the screen output -NOMONitor suppresses the '.'s in the screen trace Add what to the command line ? -match=or
Notice the -MATch=Or option. You may want to take advantage of this so that any "hit" out of a list of terms which you provide the program will produce a find, therefore, if it is appropriate in your case, at the "?" prompt type -match=or. Otherwise the default action of the program is match=and; all terms specified must be present. The rest of the screen will scroll and prompt you for interactive replies:
STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? sw:*
Do you want to search through:
A) definitions
B) complete sequence annotation
Please choose one (* A *): <rtn>
Search for what text patterns ? prion,scrapie
What should I call the output file (* sw.strings *) ? prion.strings
Either answer sw:* to search through the protein sequences of the Swiss Protein database or alternatively search the PIR/NBRF database, pir:*. Accept the default definitions search, "A". Then come up with terms (separate them with commas and no spaces) which the program will look for in the specified sequences and provide a suitable output file name. If you used the -match=or option, any one match out of the list of terms will yield a result. The program will next display the results of the search as they are found, for the example this abridged screen:
*** Sw:Cyb_Negbr ***
P34872 negaprion brevirostris (shark). cytochrome b (ec 1.10.2.2). 2/94 381aa
*** Sw:Cyb_Prigl ***
P34873 prionace glauca (blue shark). cytochrome b (ec 1.10.2.2). 2/94 381aa
*** Sw:Prio_Aottr ***
P40245 aotus trivirgatus (night monkey) (douroucouli). major prion protein precu
rsor (prp) (prp27-30) (prp33-35c) (fragment). 2/95 239aa
///////////////////////////////////////////////////////////////////////////////
*** Sw:Soma_Prigl ***
P34006 prionace glauca (blue shark). somatotropin (growth hormone). 2/94 183aa
Sequences searched: 52205
Sequences with matches: 31
Patterns sought: prion scrapie
Output file: prion.strings
The abridged output file is displayed below:
% cat prion.strings !!SEQUENCE_LIST 1.0 ! STRINGSEARCH from: sw:* November 19, 1996 11:39 ! searching for: "prion" "scrapie" Matches: 1/2 .. Sw:Cyb_Negbr P34872 negaprion brevirostris (shark). cytochrome b (ec 1.1 0.2.2). 2/94 381aa Sw:Cyb_Prigl P34873 prionace glauca (blue shark). cytochrome b (ec 1.10. 2.2). 2/94 381aa Sw:Prio_Aottr P40245 aotus trivirgatus (night monkey) (douroucouli). majo r prion protein precursor (prp) (prp27-30 Sw:Prio_Atege P40246 ateles geoffroyi (black-handed spider monkey). major prion protein precursor (prp) (prp27-30) Sw:Prio_Bovin P10279 bos taurus (bovine). major prion protein 1 precursor (prp) (major scrapie-associated fibril p /////////////////////////////////////////////////////////////////////////////// Sw:Prp2_Trast P40243 tragelaphus strepsiceros (greater kudu). major prion protein 2 precursor (prp) (major scrapie Sw:Rbl_Barpr P28382 barleria prionitis. ribulose bisphosphate carboxylas e large chain (ec 4.1.1.39) (fragment). 1 Sw:Sodc_Prigl P11418 prionace glauca (blue shark). superoxide dismutase ( cu-zn) (ec 1.15.1.1). 10/94 152aa Sw:Soma_Prigl P34006 prionace glauca (blue shark). somatotropin (growth h ormone). 2/94 183aa ! Sequences searched: 52205
Notice that some of the proteins included in the above list are not appropriate. For instance RuBisCO is not a prion yet Barleria prionitis contains the substring "prion." Be careful! This will likely occur in your searches also. In order to fix these problems you will have to edit the output file. Use the pico editor to comment out or remove any lines that do not refer to your particular protein. You can do this by either placing an exclamation point "!" in front of the unwanted lines or just cutting them out with Ctrl-k. Furthermore, since these proteins are so well studied, the list will be quite huge. To save time in the next step, you may want to pare down your list to not more than about twenty complete (no fragments) and most representative (closest phylogenetically) entries. The resultant file will serve as input to the next step.
2) PileUp the hits and evaluate the results.
Next we want to align all of these proteins to determine the most conserved areas suitable for locating a probe. Type pileup -check to run GCG's multiple sequence alignment program with the check `super' option. You will see the following screen:
% pileup -check
PileUp creates a multiple sequence alignment from a group of related
sequences using progressive, pairwise alignments. It can also plot a
tree showing the clustering relationships used to create the alignment.
Minimal Syntax: % pileup -[INfile=]@Hsp70.List -Default
Prompted Parameters:
-GAPweight=3.0 gap creation penalty
-LENgthweight=0.1 gap extension penalty
-DENsity=20.0 number of sequences per 100 pu in the dendrogram
[-OUTfile1=]hsp70.msf output file for multiple sequence alignment
Local Data Files:-MATRix=blosum62.cmp scoring matrix for peptides
-MATRix=pileupdna.cmp scoring matrix for nucleic acids
Optional Parameters:
-BEGin=1 sets beginning position for every sequence to be aligned
-END=100 sets ending position for every sequence to be aligned
-INSitu realign a portion of an existing alignment
-ENDWeight penalizes end gaps like other gaps
-HIGhroad selects "top" alignment path for equally optimal gaps
Press q to quit or <Return> for more: <rtn>
-LOWroad selects "bottom" alignment path for equally optimal gaps
-MAXSeg=5000 sets maximum segment length for every input sequence
-MAXGap=2000 sets maximum combined length of all gaps added to a sequence
-NOSORt presents output sequences in the same order as input
-LINesize=50 sets the number of sequence symbols per line
-BLOcksize=10 sets the number of sequence symbols per block
-DEGap removes gap characters ('.') from the input sequences
-NOPLOt suppresses plot of clustering relationships
-NOMONitor suppresses screen trace of each alignment
-NOSUMmary suppresses screen summary at the end of the program
-BATch submits program to the batch queue
All GCG graphics programs accept these and other switches. See the Using
Graphics chapter of the USERS GUIDE for descriptions.
-FIGure[=FileName] stores plot in a file for later input to FIGURE
-FONT=3 draws all text on the plot using font 3
-COLor=1 draws entire plot with pen in stall 1
-SCAle=1.2 enlarges the plot by 20 percent (zoom in)
-XPAN=10.0 moves plot to the right 10 platen units (pan right)
-YPAN=10.0 moves plot up 10 platen units (pan up)
-PORtrait rotates plot 90 degrees
Add what to the command line ? <rtn>
PileUp of what sequences ? @prion.strings
We do not need to take advantage of any of the options at this point;
I just wanted you to see all that are available. Depending on the level of
divergence in your data set, better multiple sequence alignments can sometimes
be generated by using alternate scoring matrices with the -matrix= option
(specifiying the desired matrix from the GCG logical directory GenMoreData)
and/or different gap penalties. Beginning with GCG version 9.0, the BLOSUM62
(Henikoff and Henikoff, 1992) matrix file, blosum62.cmp, is used as the default
symbol comparison table. Furthermore, appropriate suggested gap creation and
extension penalties are now coded directly into the matrix rather than the
program. This is a greatly impoved situation over the normalized Dayhoff PAM
250 table and program encoded penalty values that GCG formerly used. The
BLOSUM table seems quite a bit more robust at handling a wider range of
sequence divergence than the PAM table. Gap penalties can still be adjusted as
desired but the defaults usually work quite well. Here I answer
@prion.strings to indicate that I want PileUp to work on a listing of
sequence names called prion.strings rather than individual sequences within my
directory. Give PileUp your StringSearch output filename with the @
designation. Remember, any time that you want GCG to read a list file the @
designator is required. Next the program finds the sequences in the databases
and brings them over to work with; it then asks some parameter questions for
which you can accept the default answers in this run. Finally it compares each
sequence to every other sequence in a pairwise fashion and then progressively
aligns them from most to least similar. That process is shown in an abridged
fashion on the following screen:
1 PRIO_AOTTR 239 aa
2 PRIO_ATEGE 232 aa
3 PRIO_BOVIN 264 aa
4 PRIO_CALJA 252 aa
5 PRIO_CALMO 241 aa
6 PRIO_CEBAP 252 aa
7 PRIO_CERAE 245 aa
8 PRIO_COLGU 253 aa
9 PRIO_GORGO 253 aa
10 PRIO_HUMAN 253 aa
11 PRIO_MACFA 253 aa
12 PRIO_MANSP 241 aa
13 PRIO_MESAU 254 aa
14 PRIO_MOUSE 254 aa
15 PRIO_MUSVI 257 aa
16 PRIO_ODOHE 256 aa
17 PRIO_PANTR 253 aa
18 PRIO_PONPY 253 aa
19 PRIO_PREFR 253 aa
20 PRIO_RAT 226 aa
21 PRIO_SAISC 260 aa
22 PRIO_SHEEP 256 aa
23 PRIO_TRAST 264 aa
What is the gap creation penalty (* 12 *) ? <rtn>
What is the gap extension penalty (* 4 *) ? <rtn>
This program can display the clustering relationships graphically.
Do you want to:
A) Plot to a FIGURE file called "pileup.figure"
B) Plot graphics on VERSATERM-TEK4105 attached to term
C) Suppress the plot
Please choose one (* A *): <rtn>
The minimum density for a one-page plot is 16.7 sequences/100 platen units.
What density do you want (* 16.7 *) ? <rtn>
What should I call the output file name (* prion.msf *) ? <rtn>
Determining pairwise similarity scores...
1 x 2 5.36
1 x 3 5.05
1 x 4 5.54
1 x 5 5.53
1 x 6 5.51
1 x 7 5.19
1 x 8 5.51
1 x 9 5.45
1 x 10 5.44
1 x 11 5.51
1 x 12 5.49
1 x 13 5.21
1 x 14 5.21
1 x 15 5.15
1 x 16 5.12
1 x 17 5.43
1 x 18 5.49
1 x 19 5.49
1 x 20 5.11
1 x 21 5.41
1 x 22 5.15
1 x 23 5.01
2 x 3 4.91
2 x 4 5.38
2 x 5 5.38
2 x 6 5.36
2 x 7 5.48
2 x 8 5.36
2 x 9 5.33
2 x 10 5.31
///////////////////////////////////////////////////////
18 x 19 5.60
18 x 20 5.28
18 x 21 5.44
18 x 22 5.17
18 x 23 5.05
19 x 20 5.27
19 x 21 5.45
19 x 22 5.15
19 x 23 5.04
20 x 21 5.12
20 x 22 5.10
20 x 23 4.99
21 x 22 4.93
21 x 23 5.10
22 x 23 5.32
Aligning...
1 ............-..
2 ............-..
3 ............-.
............-..
4 ............-.
............-..
5 ............-..
6 ............-.
////////////////////////////////////////
19 ............-.
............-..
20 ............-.
............-..
21 ............-.
............-..
22 .............-.
.............-..
FIGURE instructions are now being written into pileup.figure.
Total sequences: 23
Alignment length: 267
CPU time: 23.31
Output file:/disk2/usr/local/people/thompson/courses/BC578/EX5/prion
.msf
Give the appropriate responses in your own case and then display the
output file to see what these MSF files look like. My abridged example output
file follows the more prion.msf command below. Notice the interleaved
character of the sequences, yet they all still have unique identities,
addressable by using their MSF filename together with their own name in braces,
e.g. "prion.msf{PRIO_BOVIN}":
% more prion.msf
!!AA_MULTIPLE_ALIGNMENT 1.0
PileUp of: @prion.strings
Symbol comparison table: GenRunData:blosum62.cmp CompCheck: 6430
GapWeight: 12
GapLengthWeight: 4
prion.msf MSF: 267 Type: P November 19, 1996 16:37 Check: 3853 ..
Name: PRIO_BOVIN Len: 267 Check: 6713 Weight: 1.00
Name: PRIO_TRAST Len: 267 Check: 7884 Weight: 1.00
Name: PRIO_ODOHE Len: 267 Check: 5943 Weight: 1.00
Name: PRIO_SHEEP Len: 267 Check: 5848 Weight: 1.00
Name: PRIO_MUSVI Len: 267 Check: 6483 Weight: 1.00
Name: PRIO_MESAU Len: 267 Check: 5224 Weight: 1.00
Name: PRIO_RAT Len: 267 Check: 3036 Weight: 1.00
Name: PRIO_MOUSE Len: 267 Check: 7047 Weight: 1.00
Name: PRIO_ATEGE Len: 267 Check: 2962 Weight: 1.00
Name: PRIO_CERAE Len: 267 Check: 2694 Weight: 1.00
Name: PRIO_GORGO Len: 267 Check: 5032 Weight: 1.00
Name: PRIO_HUMAN Len: 267 Check: 4924 Weight: 1.00
Name: PRIO_PANTR Len: 267 Check: 5092 Weight: 1.00
Name: PRIO_PONPY Len: 267 Check: 5991 Weight: 1.00
Name: PRIO_PREFR Len: 267 Check: 5746 Weight: 1.00
Name: PRIO_MACFA Len: 267 Check: 5311 Weight: 1.00
Name: PRIO_MANSP Len: 267 Check: 6831 Weight: 1.00
Name: PRIO_COLGU Len: 267 Check: 5901 Weight: 1.00
Name: PRIO_CALMO Len: 267 Check: 7346 Weight: 1.00
Name: PRIO_AOTTR Len: 267 Check: 9716 Weight: 1.00
Name: PRIO_CALJA Len: 267 Check: 5464 Weight: 1.00
Name: PRIO_CEBAP Len: 267 Check: 5611 Weight: 1.00
Name: PRIO_SAISC Len: 267 Check: 7054 Weight: 1.00
//
1 50
PRIO_BOVIN MVKSHIGSWI LVLFVAMWSD VGLCKKRPKP GGGWNTGGSR YPGQGSPGGN
PRIO_TRAST MVKSHIGSWI LVLFVAMWSD VALCKKRPKP GGGWNTGGSR YPGQGSPGGN
PRIO_ODOHE MVKSHIGSWI LVLFVAMWSD VGLCKKRPKP GGGWNTGGSR YPGQGSPGGN
PRIO_SHEEP MVKSHIGSWI LVLFVAMWSD VGLCKKRPKP GGGWNTGGSR YPGQGSPGGN
PRIO_MUSVI MVKSHIGSWL LVLFVATWSD IGFCKKRPKP GGGWNTGGSR YPGQGSPGGN
PRIO_MESAU ~~MANLSYWL LALFVAMWTD VGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
PRIO_RAT ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~GGWNTGGSR YPGQGSPGGN
PRIO_MOUSE ~~MANLGYWL LALFVTMWTD VGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
PRIO_ATEGE ~~~~~~~~~M LVLFVATWSD LGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
PRIO_CERAE ~~MANLGCWM LVVFVATWSD LGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
PRIO_GORGO ~~MANLGCWM LVLFVATWSD LGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
PRIO_HUMAN ~~MANLGCWM LVLFVATWSD LGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
PRIO_PANTR ~~MANLGCWM LVLFVATWSD LGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
PRIO_PONPY ~~MANLGCWM LVLFVATWSN LGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
PRIO_PREFR ~~MANLGCWM LVLFVATWSD LGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
PRIO_MACFA ~~MANLGCWM LVLFVATWSD LGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
PRIO_MANSP ~~~~~~~~~M LVLFVATWSD LGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
PRIO_COLGU ~~MANLGCWM LVLFVATWSD LGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
PRIO_CALMO ~~~~~~~~~M LVLFVATWSD LGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
PRIO_AOTTR ~~~~~~~~~M LVLFVATWSD LGLCKKRPKP .GGWNTGGSR YPGQSSPGGN
PRIO_CALJA ~~MANLGCWM LFLFVATWSD LGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
PRIO_CEBAP ~~MANLGCWM LVLFVATWSD LGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
PRIO_SAISC ~~MANLGCWM LVLFVATWSD LGLCKKRPKP .GGWNTGGSR YPGQGSPGGN
51 100
PRIO_BOVIN RYPPQGGGGW GQPHGGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPHGGGG
PRIO_TRAST RYPSQGGGGW GQPHGGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPHGGGG
PRIO_ODOHE RYPPQG.... ....GGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPHGGGG
PRIO_SHEEP RYPPQG.... ....GGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPHGGGG
PRIO_MUSVI RYPPQG.... ....GGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPHGGGG
PRIO_MESAU RYPPQG.... ....GGTWGQ PHGGGWGQPH GGGWGQPHGG GWGQPH.GGG
PRIO_RAT RYPPQS.... ....GGTWGQ PHGGGWGQPH GGGWGQPHGG GWGQPH.GGG
PRIO_MOUSE RYPPQ..... ....GGTWGQ PHGGGWGQPH GGSWGQPHGG SWGQPH.GGG
PRIO_ATEGE RYPPQ..... .......... ..GGGWGQPH GGGWGQPHGG GWGQPH.GGG
PRIO_CERAE RYPPQG.... .......... ..GGGWGQPH GGGWGQPHGG GWGQPH.GGG
PRIO_GORGO RYPPQG.... ....GGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPH.GGG
PRIO_HUMAN RYPPQG.... ....GGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPH.GGG
PRIO_PANTR RYPPQG.... ....GGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPH.GGG
PRIO_PONPY RYPPQG.... ....GGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPH.GGG
PRIO_PREFR RYPPQG.... ....GGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPH.GGG
PRIO_MACFA RYPPQG.... ....GGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPH.GGG
PRIO_MANSP RYPPQG.... ....GGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPH.GGG
PRIO_COLGU RYPPQG.... ....GGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPH.GGG
PRIO_CALMO RYPPQG.... ....GGSWGQ PHGGGWGQPH GGGWGQPHGG GWGQPH.GGG
PRIO_AOTTR RYPPQS.... ....GG.WGQ PHGGGWGQPH GGGWGQPHGG GWGQPH.GGG
PRIO_CALJA RYPPQG.... ....GG.WGQ PHGGGWGQPH GGGWGQPHGG GWGQPH.GGG
PRIO_CEBAP LYPPQG.... ....GG.WGQ PHGGGWGQPH GGGWGQPHGG SWGQPH.GGG
PRIO_SAISC RYPPQG.GGW GQPHGGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPH.GGG
//////////////////////////////////////////////////////////////////
201 250
PRIO_BOVIN VTTTTKGENF TETDIKMMER VVEQMCITQY QRESQAYY.. QRGASVILFS
PRIO_TRAST VTTTTKGENF TETDIKMMER VVEQMCITQY QRESEAYY.. QRGASVILFS
PRIO_ODOHE VTTTTKGENF TETDIKMMER VVEQMCITQY QRESQAYY.. QRGASVILFS
PRIO_SHEEP VTTTTKGENF TETDIKIMER VVEQMCITQY QRESQAYY.. QRGASVILFS
PRIO_MUSVI VTTTTKGENF TETDMKIMER VVEQMCVTQY QRESEAYY.. QRGASAILFS
PRIO_MESAU VTTTTKGENF TETDIKIMER VVEQMCTTQY QKESQAYYDG RRSSA.VLFS
PRIO_RAT VTTTTKGENF TETDVKMMER VVEQMCVTQY QKESQAYYDG RRSSA.VLFS
PRIO_MOUSE VTTTTKGENF TETDVKMMER VVEQMCVTQY QKESQAYYDG RRSSSTVLFS
PRIO_ATEGE VTTTTKGENF TETDVKMMER VVEQMCITQY ERESQAYY.. QRGSSMVLFS
PRIO_CERAE VTTTTKGENF TETDVKMMER VVEQMCITQY EKESQAYY.. QRGSSMVLFS
PRIO_GORGO VTTTTKGENF TETDVKMMER VVEQMCITQY ERESQAYY.. QRGSSMVLFS
PRIO_HUMAN VTTTTKGENF TETDVKMMER VVEQMCITQY ERESQAYY.. QRGSSMVLFS
PRIO_PANTR VTTTTKGENF TETDVKMMER VVEQMCITQY ERESQAYY.. QRGSSMVLFS
PRIO_PONPY VTTTTKGENF TETDVKMMER VVEQMCITQY ERESQAYY.. QRGSSMVLFS
PRIO_PREFR VTTTTKGENF TETDVKMMER VVEQMCITQY EKESQAYY.. QRGSSMVFFS
PRIO_MACFA VTTTTKGENF TETDVKMMER VVEQMCITQY EKESQAYY.. QRGSSMVLFS
PRIO_MANSP VTTTTKGENF TETDVKMMER VVEQMCITQY EKESQAYY.. QRGSSMVLFS
PRIO_COLGU VTTTTKGENF TETDVKMMER VVEQMCITQY EKESQAYY.. QRGSSMVLFS
PRIO_CALMO VTTTTKGENF TETDVKMMER VVEQMCITQY EKESQAYY.. QRGSSMVLFS
PRIO_AOTTR VTTTTKGENF TETDVKIMER VVEQMCITQY EKESQAYY.. QRGSSMVLFS
PRIO_CALJA VTTTTKGENF TETDVKMMER VVEQMCITQY EKESQAYY.. QRGSSMVLFS
PRIO_CEBAP VTTTTKGENF TETDVKMMER VVEQMCITQY ERESQAYY.. QRGSSMVLFS
PRIO_SAISC VTTTTKGENF TETDVKMMER VVEQMCITQY EKESQAYY.. QRGSSMVLFS
251 267
PRIO_BOVIN SPPVILLISF LIFLIVG
PRIO_TRAST SPPVILLISF LIFLIVG
PRIO_ODOHE SPPVILLISF LIFLIVG
PRIO_SHEEP SPPVILLISF LIFLIVG
PRIO_MUSVI PPPVILLISL LILLIVG
PRIO_MESAU SPPVILLISF LIFLMVG
PRIO_RAT SPPVILLISF LIFLIVG
PRIO_MOUSE SPPVILLISF LIFLIVG
PRIO_ATEGE SPPVILLISF LI~~~~~
PRIO_CERAE SPPVILLISF LIFLIVG
PRIO_GORGO SPPVILLISF LIFLIVG
PRIO_HUMAN SPPVILLISF LIFLIVG
PRIO_PANTR SPPVILLISF LIFLIVG
PRIO_PONPY SPPVILLISF LIFLIVG
PRIO_PREFR SPPVILLISF LIFLIVG
PRIO_MACFA SPPVILLISF LIFLIVG
PRIO_MANSP SPPVILLISF LI~~~~~
PRIO_COLGU SPPVILLISF LIFLIVG
PRIO_CALMO SPPVILLISF LI~~~~~
PRIO_AOTTR SPPVILLISF L~~~~~~
PRIO_CALJA SPPVILLISF LIFLIVG
PRIO_CEBAP SPPVILLISF LIFLIVG
PRIO_SAISC SPPVILLISF LIFLIVG
Notice the listing of sequence names near the top of the file. This listing contains an important number called the checksum. All GCG sequence programs utilize this number as a unique sequence identifier. If any of your sequences turn out to be identical, their checksum numbers will be the same. Are any of them? If they are, use the pico editor to place an exclamation point, "!" at the start of the checksum line in which the duplicate sequence occurs. Exclamation points are interpreted by GCG as remark delineators, therefore, the duplicate sequence will be ignored in subsequent programs. Another important parameter of the checksum block is the Weights category. These can be adjusted to even out the contribution of different sequences to a profile. This can be very helpful when you have several very similar sequences aligned to a few more disparate sequences, as we'll discuss in Week 8's exercise.
Additionally, we created a dendrogram of the similarity between the sequences when we ran PileUp. In order to run any of the GCG graphics programs, you must first tell ribozyme what type of a graphics device you are using. This is known as initializing the graphics configuration. An easy way to do this is to type the command setplot; a menu of various graphics devices will appear and you choose the appropriate one from the list:
% setplot
SETPLOT allows you to choose a plotting configuration from a menu of
available graphics devices at your site.
+-------------> displaying all of 6 option(s) <--------------+
|tek_plot Color Tektronix 4107 |
|versa_plot VersaTerm-PRO Tektronix 4105 |
|tekbw_plot Black/White Tektronix 4014 |
|x_plot Color XWindow graphics |
|ps_plot Encapsulated PostScript File |
|hp_plot Hewlett Packard LaserJet File |
| |
+--------------------------------------------------------------+
Enter a command. Choices are:
<up-arrow> and <down-arrow> scroll the list
<return> makes GCG use the selected device
Q quits without doing anything
C creates and edits a new device
(you can't delete from the site file)
V views the selection (use C to edit a copy)
Done
Plotting Configuration set to:
Language: tekd
Device: VERSATERM-TEK4105
Port or Queue: term:
Use the arrow key to choose versa_plot since we are using VersaTerm-PRO communication/emulation software in this lab for Tektronix color terminal display. Ribozyme will return with a confirmation of the setting. Then in order to see the dendrogram use GCG's program Figure. Type figure pileup.figure to plot the dendrogram to the screen. Ribozyme will return a brief description and ask if you are ready:
% figure pileup.figure Figure makes figures and posters by drawing graphics and text together. You can include output from other GCG graphics programs as part of a figure. Process set to plot with VERSATERM-TEK4105 attached to term using the tekd graphic interface. When your VERSATERM-TEK4105 attached to tty is ready, press <Return>. <rtn>

This similarity dendrogram can be very helpful in adjusting the Weight category of sequences in an alignment. The length of the vertical lines is proportional to the difference between the sequences. However, realize that this tree is not an evolutionary tree. No phylogenetic inference algorithms, such as maximum likelihood or parsimony, or correction models, such as Jukes-Cantor or Kimura, are used in its construction. You will learn more of these concepts in Week 10's exercise. PileUp's dendrogram merely indicates the relative similarity of the sequences and, therefore, the clustering order in which the alignment was built.
3) Determine areas of maximal conservation.
Next we need to decide what portions of the alignment to find probes in by determining which areas of it are the most highly conserved. To design a hybridization probe, one, most highly conserved section is chosen; to design paired PCR primers, two flanking, highly conserved areas are chosen. In order to easily visualize the positional conservation of a multiple sequence alignment we can utilize the GCG graphics program PlotSimilarity. This program draws a graph of the running similarity along a group of aligned sequences. Type plotsimilarity -expand to run PlotSimilarity with its expand option and reply to the "...what sequence(s)" query with your MSF filename followed by an asterisk enclosed by a pair of braces, {*}, to indicate that you want to use all of the sequences included in your particular MSF file. Accept all the suggested default parameters. For the prion example the following screen appears:
% plotsimilarity -expand
PlotSimilarity plots the running average of the similarity among the
sequences in a multiple sequence alignment.
Process set to plot with VERSATERM-TEK4105 attached to term
using the tekd graphic interface.
PLOTSIMILARITY between what sequence(s) ? prion.msf{*}
prion.msf{PRIO_BOVIN}
prion.msf{PRIO_TRAST}
prion.msf{PRIO_ODOHE}
prion.msf{PRIO_SHEEP}
prion.msf{PRIO_MUSVI}
prion.msf{PRIO_MESAU}
prion.msf{PRIO_RAT}
prion.msf{PRIO_MOUSE}
prion.msf{PRIO_ATEGE}
prion.msf{PRIO_CERAE}
prion.msf{PRIO_GORGO}
prion.msf{PRIO_HUMAN}
prion.msf{PRIO_PANTR}
prion.msf{PRIO_PONPY}
prion.msf{PRIO_PREFR}
prion.msf{PRIO_MACFA}
prion.msf{PRIO_MANSP}
prion.msf{PRIO_COLGU}
prion.msf{PRIO_CALMO}
prion.msf{PRIO_AOTTR}
prion.msf{PRIO_CALJA}
prion.msf{PRIO_CEBAP}
prion.msf{PRIO_SAISC}
What window to average (* 10 *) ? <rtn>
The minimum density for this plot is 232.2 residues/100 platen units.
What density do you want (* 232.2 *) ? <rtn>
When your VERSATERM-TEK4105 attached to tty is ready, press <Return>. <rtn>
Upon pressing <Return> the screen will disappear and the
graphic will be drawn. The prion.msf similarity plot follows:

This example shows a great deal of sequence similarity. Strong peaks can be seen centered around positions 40, 85, and 155-195, with a small one at 255. The ordinate scale here is dependent on the scoring matrix used by the program, by default the BLOSUM62 table in which amino acid identities vary from 4 to 11. The dashed line across the middle shows the average similarity value for the entire alignment. Try to find similar areas of high conservation within your group of sequences and take notes of them. Also note where the overall similarity of the alignment falls off at the beginning and end; here around positions 10 and 260. Decide on whether you want to design a single hybridization probe or paired PCR primers based on the observed similarity. Either case will do for this exercise. I will illustrate paired PCR guessmers. Rerun PlotSimilarity the same as before, only this time add the option -figure to the command line. This will create a Figure file in your directory for later use and instructor evaluation purposes.
4) Use ProfileMake to create a consensus.
Next we need to generate a consensus of the sequence alignment. The most powerful method that I am aware of toward this goal is the Profile algorithm. A profile, and its inherent consensus, is created with the program ProfileMake. Type profilemake with the option -seqout=(your molecule).cons to generate a normal GCG sequence file of the consensus in addition to the profile file. Also specify the beginning and ending coordinates for the profile by excluding the most divergent amino and carboxy termini, as determined above, with the -begin= and -end= options. In response to "ProfileMake of what aligned sequences ?" answer with the same MSF convention as above, the MSF filename followed by an asterisk enclosed with a pair of braces, {*}. The following screen will appear:
% profilemake -seqout=prion.cons -begin=11 -end=261
PROFILEMAKE Version 4.50 November 20, 1996 11:38
ProfileMake creates a position-specific scoring table, called a
profile, that quantitatively represents the information from a
group of aligned sequences. The profile can then be used for database
searching (ProfileSearch) or sequence alignment (ProfileGap).
PROFILEMAKE of what aligned sequences ? prion.msf{*}
prion.msf{PRIO_BOVIN}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_TRAST}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_ODOHE}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_SHEEP}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_MUSVI}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_MESAU}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_RAT}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_MOUSE}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_ATEGE}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_CERAE}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_GORGO}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_HUMAN}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_PANTR}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_PONPY}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_PREFR}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_MACFA}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_MANSP}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_COLGU}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_CALMO}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_AOTTR}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_CALJA}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_CEBAP}, begin: 11 end: 261 len: 251 weight: 1.00
prion.msf{PRIO_SAISC}, begin: 11 end: 261 len: 251 weight: 1.00
What should I call the output profile (* profile.prf *) ? prion.prf
Take a look at your consensus sequence and notice that all positions are filled; there are no gaps. This is because the Profile algorithm will decide on the most conserved residue for each position, regardless. Also notice that the header contains information relating to the sequence's creation through ProfileMake; this can be valuable. The abridged prion profile consensus sequence follows:
% more prion.cons
!!AA_SEQUENCE 1.0
(Consensus) (Peptide) PROFILEMAKE v4.50 of: prion.msf{*} Length: 251 Sequences
: 23 MaxScore: 1416.57 November 20, 1996 11:38
Gap: 1.00 Len: 1.00
GapRatio: 0.33 LenRatio: 0.10
prion.msf{PRIO_BOVIN} From: 11 To: 2
61 Weight: 1.00
prion.msf{PRIO_TRAST} From: 11 To: 2
61 Weight: 1.00
prion.msf{PRIO_ODOHE} From: 11 To: 2
61 Weight: 1.00
prion.msf{PRIO_SHEEP} From: 11 To: 2
61 Weight: 1.00
prion.msf{PRIO_MUSVI} From: 11 To: 2
61 Weight: 1.00
prion.msf{PRIO_MESAU} From: 11 To: 2
61 Weight: 1.00
////////////////////////////////////////////////////////////////////////////////
prion.msf{PRIO_CALJA} From: 11 To: 2
61 Weight: 1.00
prion.msf{PRIO_CEBAP} From: 11 To: 2
61 Weight: 1.00
prion.msf{PRIO_SAISC} From: 11 To: 2
61 Weight: 1.00
Symbol comparison table: GenRunData:blosum62.cmp FileCheck: 9412
Relaxed treatment of non-observed characters
Exponential weighting of characters
Length: 251 November 20, 1996 11:38 Type: P Check: 4796 ..
1 LVLFVATWSB LGLCKKRPKP GGGWNTGGSR YPGQGSPGGN RYPPQGGGGW
51 GQPHGGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPHGGGG WGQGGGTHNQ
101 WNKPSKPKTN MKHMAGAAAA GAVVGGLGGY MLGSAMSRPL IHFGNBYEBR
151 YYRENMYRYP NQVYYRPVBQ YSNQNNFVHB CVNITIKQHT VTTTTKGENF
201 TETBVKMMER VVEQMCITQY EKESQAYYBG QRGSSMVLFS SPPVILLISF
251 L
You are also welcome to take a look at your resultant .prf file. It is a huge table of numbers that doesn't make a whole lot of sense to us mere mortals, however, it is a tremendously powerful tool in subsequent analysis steps. Other programs can read and interpret all of those numbers to perform very sensitive database searches and alignments by utilizing the information within it which penalizes misalignments in phylogenetically conserved areas more than in variable regions. We will come back to this later in the database searching exercise. Do not delete your .prf file; you will be using it later in the semester.
5) Trim down your consensus sequence(s) with SeqEd.
Next, look at your MSF file, your consensus file, and your similarity plot to decide which sequences correspond to your peaks. If designing a single hybridization probe, choose the single, longest, least ambiguous sequence you can find based on all the information you have. If designing PCR primers, choose two highly conserved stretches that bracket the longest portion of the alignment possible. Within these stretches we will eventually be locating a nucleotide hybridization guessmer of 30 to 70 bases, i.e. a peptide of 10 to 24 residues, or PCR primers around 20 to 30 bases in length, i.e. about 7 to 10 residues long. (Although, in `real life,' if you do not know your template's exact DNA sequence, PCR primers may need to be even longer to maximize annealing potential.) However, at this point do not delete too much; make these test sequences as long as possible, at least 20 to 30 amino acids. A later step will isolate the best primers within them. Also be careful with your numbering scheme -- the coordinates to delete will not be the same in the MSF file as in the profile consensus since we trimmed the length of the profile when we created it. First copy your consensus sequence into either a single .pep file or two .pep files depending on whether you are designing one hybridization or two paired PCR guessmers and then use SeqEd to trim them down to the desired size. My copy examples follow:
% cp prion.cons prion1.pep % cp prion.cons prion2.pep
To use SeqEd for this purpose, type seqed followed by your filename.ext and then, when the file has loaded, press Ctrl-d to enter command mode. Next enter the range of residues you want to delete, separated by a comma, starting with the carboxy terminal end, followed by the word delete. Press <Return> and repeat for the other deletion. A screen trace from one of my example .pep files is shown below:
% seqed prion1.pep
SeqEd is an interactive editor for entering and modifying sequences
and for assembling parts of existing sequences into new genetic
constructs. You can enter sequences from the keyboard or from a
digitizer.
prion1.pep ***** K E Y B O A R D ***** seqed
: (Consensus) (Peptide) PROFILEMAKE v4.50 of: prion.msf{*} Length: 258 :
: :
: Gap: 1.00 Len: 1.00 :
: GapRatio: 0.33 LenRatio: 0.10 :
KMMERVVEQMCITQYEKESQAYYBGQRGSSMVLFSSPPVILLISFL
....|.........|.........|.........|.........|.........|.........|.........|....
210 220 230 240 250 260 270 280
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
|......|......|......|......|......|......|......|......|......|......|
0 50 100 150 200 250 300 350 400 450 500
: 46,251 delete <rtn>
: 1,21 delete <rtn>
: exit <rtn>
"prion1.pep" 24 residues
Before leaving SeqEd check to be sure that ProfileMake didn't include any of the ambiguity residues B or Z in your sequence range. If so, replace them with the more appropriate code for your situation. BackTranslate will `trip' over these otherwise in the next step. Type exit to leave SeqEd and write the file to your directory. Repeat with the second peptide sequence if you are designing paired PCR primers. Display your results to check your work. Your new peptide probe sequences should follow the old profile header information; the first prion example, without the profile header, follows:
prion1.pep Length: 24 November 20, 1996 11:58 Type: P Check: 3745 ..
1 GGWNTGGSRY PGQGSPGGNR YPPQ
Your peptide probe sequence(s) may not be as long as my examples. I was exceptionally lucky to find such strong consensus elements in the prion protein. Regardless of what length sequences you come up with, they are still peptide sequences and oligonucleotide probes are necessary for both hybridization and PCR methodology. Backtranslation is not simple because of the degeneracy of the genetic code. GCG has addressed this problem with their program BackTranslate. Alternate codons are indicated in the output along with their order of preference, based on the codon usage table that you specify, for each amino acid of the sequence. You can choose from them; the program generates either the most probable or the most ambiguous sequence.
6) Using BackTranslate.
In order to use BackTranslate you must decide which codon usage table you want the program to utilize. By default BackTranslate will use a frequency table designed from highly expressed E. coli genes. Therefore, if you're working with an E. coli gene, the program's default is appropriate. However, if your protein comes from anything else, you will want to use an alternate table. GCG provides alternate data files in a public data library with the GCG logical name GenMoreData. Therefore, in response to BackTranslate's "Use what codon frequency file?" query, you will want to answer "genmoredata:(the appropriate choice).cod." The available tables, in addition to the default codon useage table, ecohigh.cod, are: celegans_high.cod, celegans_low.cod, drosophila_high.cod, human_high.cod, maize_high.cod, and yeast_high.cod. Even more tables are available at several of the molecular biology data servers such as IUBIO. IUBIO is preset as one of your bookmarks in ribozyme's Gopher client application. Additionally, if you are not satisfied with the available options, GCG has a program, CodonFrequency, that enables you to create your own codon frequency table from known coding sequences.
Choose whichever GCG table makes the most sense for you to use based on the organism in which your protein occurs. Start the program and choose "a" for the back-translation table and most probable sequence output. Give the output file a .probe filename extension:
% backtranslate prion1.pep
BackTranslate backtranslates an amino acid sequence into a
nucleotide sequence. The output helps you recognize minimally
ambiguous regions that might be good for constructing synthetic probes.
Begin (* 1 *) ? <rtn>
End (* 24 *) ? <rtn>
Would you like to see:
a) table of back-translations and most probable sequence
b) table of back-translations and most ambiguous sequence
c) most probable sequence only
d) most ambiguous sequence only
Please choose one (* b *): a
Use what codon frequency file (* GenRunData:ecohigh.cod *) ? genmoredata:human_high.cod
What should I call the output file (* prion1.seq *) ?
prion1.probe
Display the resultant .probe file and notice how each codon is listed. The first prion example probe sequence data file is shown following the more command:
% more prion1.probe
!!NA_SEQUENCE 1.0
BACKTRANSLATE of: : prion1.pep check: 3745 from: 1 to: 24
(Consensus) (Peptide) PROFILEMAKE v4.50 of: prion.msf{*} Length: 251
Sequences: 23 MaxScore: 1416.57 November 20, 1996 11:38
////////////////////////////////////////////////////////////////////////////////
Using codon frequencies from: /disk2/usr/local/soft/seq/gcgf9/gcgcore/data/more
data/human_high.cod
////////////////////////////////////////////////////////////////////////////////
Gly Gly Trp Asn Thr Gly Gly
GGC 0.50 GGC 0.50 TGG 1.00 AAC 0.78 ACC 0.57 GGC 0.50 GGC 0.50
GGG 0.24 GGG 0.24 AAT 0.22 ACG 0.15 GGG 0.24 GGG 0.24
GGA 0.14 GGA 0.14 ACA 0.14 GGA 0.14 GGA 0.14
GGT 0.12 GGT 0.12 ACT 0.14 GGT 0.12 GGT 0.12
195 222 222 111 48 31 47
8 - 14
Ser Arg Tyr Pro Gly Gln Gly
AGC 0.34 CGC 0.37 TAC 0.74 CCC 0.48 GGC 0.50 CAG 0.88 GGC 0.50
TCC 0.28 CGG 0.21 TAT 0.26 CCT 0.19 GGG 0.24 CAA 0.12 GGG 0.24
TCT 0.13 AGG 0.18 CCG 0.17 GGA 0.14 GGA 0.14
AGT 0.10 AGA 0.10 CCA 0.16 GGT 0.12 GGT 0.12
TCG 0.09 CGT 0.07
TCA 0.05 CGA 0.06
45 66 156 106 75 72 41
15 - 21
Ser Pro Gly Gly Asn Arg Tyr
AGC 0.34 CCC 0.48 GGC 0.50 GGC 0.50 AAC 0.78 CGC 0.37 TAC 0.74
TCC 0.28 CCT 0.19 GGG 0.24 GGG 0.24 AAT 0.22 CGG 0.21 TAT 0.26
TCT 0.13 CCG 0.17 GGA 0.14 GGA 0.14 AGG 0.18
AGT 0.10 CCA 0.16 GGT 0.12 GGT 0.12 AGA 0.10
TCG 0.09 CGT 0.07
TCA 0.05 CGA 0.06
41 94 72 107 103 63 150
22 - 24
Pro Pro Gln
CCC 0.48 CCC 0.48 CAG 0.88
CCT 0.19 CCT 0.19 CAA 0.12
CCG 0.17 CCG 0.17
CCA 0.16 CCA 0.16
0 0 0
prion1.probe Length: 72 November 20, 1996 13:14 Type: N Check: 1278 ..
1 GGCGGCTGGA ACACCGGCGG CAGCCGCTAC CCCGGCCAGG GCAGCCCCGG
51 CGGCAACCGC TACCCCCCCC AG
The final, resultant nucleotide sequence is the most likely coding sequence for the peptide we specified using the codon frequency chart we chose. Repeat the procedure with your other .pep file, if you are working with paired PCR primers. A viable alternative, often utilized, is to prepare a mixture of oligo's containing various codons for those positions that are particularly ambiguous, here the serines and arginines. A few more analyses are appropriate before running off to synthesize our new probes, however. We need to decide which portions of the consensus elements that we have identified are most apporpriate for primers. And of those portions, we need to determine if they have significant internal complementation such that strong `hairpin' structures would be formed, and we should also check for self- and primer-dimer complementation. The GCG program Prime can be used for all these tests. We also need to run a DNA database search to make sure that only the type of genes that we are interested in are `found.' GCG's program FindPatterns is probably best for this type of search because it does not allow gapping.
7) Use Prime to locate `good' primers within your probe
sequences.
The GCG program Prime can locate acceptable primers within a DNA template sequence. The program is quite powerful and contains many, many options to maximize flexibility. We will use it at this point in a somewhat different manner -- merely to find the best primers within a defined stretch of sequence that we have already identified as the best place to locate a primer based on sequence similarity. Launch Prime with the -check `super'-option to see all of its built-in options. The options that we are most concerned with on the first run through the program are -forwardprimers for the 5' probe and -reverseprimers for the 3' probe if you are trying to create paired PCR primers. If you are looking for only the one best hybridization primer, then you will only want to use the -forwardprimers option. You should also probably expand the size of the primer to be found from 18 through 22 to 30 through 50 if you are designing a single hybridization guessmer, and you may even want to increase the length of PCR primers some to help take into account potential mismatches introduced in the backtranslation step. Accept appropriate program defaults but suppress the plot with each pass through the program for each probe sequence. Specify each of your .probe sequences in turn as the sequence to be searched. My 5' example follows:
% prime -check
Prime selects oligonucleotide primers for a template DNA sequence. The
primers may be useful for the polymerase chain reaction (PCR) or for DNA
sequencing. You can allow Prime to choose primers from the whole template
or limit the choices to a particular set of primers listed in a file.
The Polymerase Chain Reaction (PCR) process for amplifying nucleic acids is
covered by U.S. Patent Nos. 4,683,195 and 4,683,202 owned by Hoffmann La
Roche. A license for research may be obtained through the purchase and use
of authorized reagents and thermocyclers from Perkin-Elmer Corp., or by
otherwise negotiating a license with Perkin-Elmer. No license to use PCR is
granted by the purchase or use of the Wisconsin Package(TM).
Minimal Syntax: % prime [-INfile=]ggamma.seq -Default
Prompted Parameters:
-BEGin1=1 -END1=1700 range of interest
-MINPROduct=100 minimum PCR product length
-MAXPROduct=300 maximum PCR product length
-MINPRImer=18 minimum primer length
-MAXPRImer=22 maximum primer length
[-OUTfile1=]ggamma.prime output file name
Press q to quit or <Return> for more: <rtn>
Local Data Files:
-DATa1=prime.cmp scoring matrix for annealing tests
-DATa2=dnastack.ds entropies for DNA melting temperature determination
-DATa3=dnastack.dh enthalpies for DNA melting temperature determination
Optional Parameters:
-LIStsize=25 maximum number of output primers or PCR
products shown
-BEGin2=500 -END2=750 target range for PCR amplification
-INClude=60.0 minimum % of specified PCR target range
range to be included in PCR products
-FORwardprimers select forward primers, only
-REVerseprimers select reverse primers, only
-NOPROducts suppress selection of PCR products
-NOUNIque permit duplicate primer binding sites on template
-PRImers=myfile.dat input file of primers to consider
-REPeats=@mylist.list repeated sequences to check for false priming
-DNAconcentration=50.0 primer DNA concentration (nM)
-SALtconcentration=50.0 salt concentration (mM)
-CLAmp=S specify primer 3' clamp (using IUB ambiguity codes)
-GCMINPRImer=40.0 minimum primer % G+C
Press q to quit or <Return> for more: <rtn>
-GCMAXPRImer=55.0 maximum primer % G+C
-TMMINPRImer=50.0 minimum primer melting temperature (Celsius)
-TMMAXPRImer=65.0 maximum primer melting temperature (Celsius)
-ENDANNEALPrimer=8.0 maximum primer-primer 3' annealing score
-ENDWGTPrimer=2.0 relative weight of primer-primer 3' annealing
score
-ALLANNEALPrimer=14.0 maximum primer-primer annealing score
-ALLWGTPrimer=1.0 relative weight of primer-primer annealing
score
-GCMINPROduct=40.0 minimum product % G+C
-GCMAXPROduct=55.0 maximum product % G+C
-TMMINPROduct=50.0 minimum product melting temperature (Celsius)
-TMMAXPROduct=65.0 maximum product melting temperature (Celsius)
-TMDIFference=2.0 maximum difference between melting temperatures
of two primers in PCR
-ENDANNEALTemplate=16.0 maximum primer-template 3' annealing score
(primer-template annealing is ignored by default)
-ENDWGTTemplate=0.5 relative weight of primer-template 3' annealing
score
-ALLANNEALTemplate=28.0 maximum primer-template annealing score
primer-template annealing is ignored by default)
-ALLWGTTemplate=0.25 relative weight of primer-template annealing
score
Press q to quit or <Return> for more: <rtn>
-DENsity=1700 number of bases per 100 platen units in the plot
-SPAcing=1.6 number of platen units per line in the plot
-NOPLOt suppresses plot of primer sites
-NOMONitor suppresses screen trace of program progress
-NOSUMmary suppresses screen summary at the end of the program
-BATch submits program to the batch queue
All GCG graphics programs accept these and other switches. See the Using
Graphics chapter of the USERS GUIDE for descriptions.
-FIGure[=FileName] stores plot in a file for later input to FIGURE
-FONT=3 draws all text on the plot using font 3
-COLor=1 draws entire plot with pen in stall 1
-SCAle=1.2 enlarges the plot by 20 percent (zoom in)
-XPAN=10.0 moves plot to the right 10 platen units (pan right)
-YPAN=10.0 moves plot up 10 platen units (pan up)
-PORtrait rotates plot 90 degrees
Add what to the command line ? -forward
Prime of what sequence ? prion1.probe
Begin (* 1 *) ? <rtn>
End (* 72 *) ? <rtn>
Minimum primer length (* 18 *) ? <rtn>
Maximum primer length (* 22 *) ? <rtn>
What should I call the output file name (* prion1.prime *) ? <rtn>
This program can display the primer binding sites graphically.
Do you want to:
A) Plot to a FIGURE file called "prime.figure"
B) Plot graphics on HP7550 attached to /dev/tty15
C) Suppress the plot
Please choose one (* A *): c
Searching for forward primers
INPUT SUMMARY
-------------
Input sequence: prion1.probe
*** PRIME is set to search for primers on forward strand only. ***
Primer constraints:
primer size: 18 - 22
primer 3' clamp: S
primer sequence ambiguity: NOT ALLOWED
primer GC content: 40.0 - 55.0%
primer Tm: 50.0 - 65.0 degrees Celsius
primer self-annealing. . .
3' end: < 8.0 (weight: 2.0)
total: < 14.0 (weight: 1.0)
unique primer binding sites: required
primer-template and primer-repeat annealing. . .
3' end: ignored
total: ignored
repeated sequences screened: none specified
PRIMER SUMMARY
--------------
forward reverse
Number of primers considered: 225 0
Number of primers rejected for . . .
primer 3' clamp: 10 0
primer sequence ambiguity: 0 0
primer GC content: 215 0
primer Tm: 0 0
non-unique binding sites: 0 0
primer self-annealing: 0 0
primer-template annealing: 0 0
primer-repeat annealing: 0 0
Number of primers accepted: 0 0
Number of primers saved: 0 0
Output file: prion1.prime
CPU time: 0.26 seconds
As is often the case, Prime did not find any primers. Prime can sometimes be quite frustrating to run, for its parameters can be too stringent to find anything at all. However, based on the above report we can see that the parameter that most prevented its success in this case is GC content. Therefore, repeat the run using a less stringent GC content (or whatever parameters you are having troubles with). Sometimes it will take many passes through the program adjusting different parameters each time in order to finally get something acceptable. From the -check summary above we can see that GC content is required to be between 40 and 50 percent whereas my prion1.probe sequence appears to be quite a bit higher than that. The GCG program Composition can give you an exact count of nucleotide content if you need it. I will increase my gcmaxprimer parameter and see what happens. I also ended up having to increase my tmmaxprimer parameter in order to find some successful primers on prion1.probe. The command line and the abridged results of one of my sucessful Prime run follows; notice the parameters in bold that I had to change in each case. Play with the options in your own case to find at least one best hybridization probe/primer per consevered probe sequence.
% prime -forward -gcmaxprimer=73 -tmmaxprimer=68 prion1.probe
PRIME of: prion1.probe ck: 1278 from: 1 to: 72 November 20, 1996 13:56
INPUT SUMMARY
-------------
Input sequence: prion1.probe
*** PRIME is set to search for primers on forward strand only. ***
Primer constraints:
primer size: 18 - 22
primer 3' clamp: S
primer sequence ambiguity: NOT ALLOWED
primer GC content: 40.0 - 73.0%
primer Tm: 50.0 - 68.0 degrees Celsius
primer self-annealing. . .
3' end: < 8 (weight: 2.0)
total: < 14 (weight: 1.0)
unique primer binding sites: required
primer-template and primer-repeat annealing. . .
3' end: ignored
total: ignored
repeated sequences screened: none specified
PRIMER SUMMARY
--------------
forward reverse
Number of primers considered: 225 0
Number of primers rejected for . . .
primer 3' clamp: 10 0
primer sequence ambiguity: 0 0
primer GC content: 209 0
primer Tm: 5 0
non-unique binding sites: 0 0
primer self-annealing: 0 0
primer-template annealing: 0 0
primer-repeat annealing: 0 0
Number of primers accepted: 1 0
Number of primers saved: 1 0
--------------------------------------------------------------------------------
Primer: 1
[DNA] = 50.000 nM [salt] = 50.000 mM
5' 3'
forward strand primer (18-mer): 55 AACCGCTACCCCCCCCAG 72
primer %GC: 72.2
primer Tm (degrees Celsius): 66.5
annealing score: 16
--------------------------------------------------------------------------------
The output files describe the conditions used in the run and list each accepted primer with their corresponding melting temperature in a ranked order based on their annealing scores (the lower the annealing score the better; see the GCG Program Manual description).
8) Will your primers only `find' the correct genes?
Next, to insure that our primers will not hybridize to completely the wrong type of sequence, they need to be checked against the DNA databases. This step can also point out, and allow you to correct, errors in your primer sequence created in the backtranslation step, if enough DNA sequences are availiable in the database to allow a comparison. To find any candidate DNA sequences we will use GCG's program FindPatterns. This program will not allow gapping and it will accept more than one query per run so it is very appropriate for this type of search. The DNA databases are huge, therefore, the search should be done in batch mode. GCG has made this an easy chore by providing a -batch option to many of their cpu intensive programs. The easiest way to run FindPatterns is to provide it with your primers as an input file rather than typing them in interactively. However, FindPatterns reads a specially formatted file containing input patterns known as a Pattern.Dat file structured after GCG's restriction enzyme data files. Also, since Prime can be used for evaluating a list of potential primers, if we give it a perfectly complementary DNA template sequence, we may locate an appropriate full-length template DNA sequence for further primer evaluation with a FindPatterns run. To have Prime test a list of suggested primers in this manner they also need to be in the Pattern.Dat format. Therefore, we need to duplicate (one of) our .prime file(s), modify it with the pico editor into the proper format, and include any other properly formatted primers to be tested. Type cp (one of your primer files' name).prime primer.dat:
% cp prion1.prime primer.dat
Now edit your new primer.dat file with the pico editor changing it into the following format. Delete all the old reference information and replace it with an appropriate explanation to yourself. Make sure that the necessary two periods ".." separate your header information from the data below; they are very important to all GCG programs! Use the Mark Set (Ctrl-Shift-6) and Cut Text (Ctrl-k) features of the pico editor to remove unwanted text. Leave each accepted primer sequence intact. Each entire pattern needs to be on one line; it needs to be prefaced with a name, the offset number 1, and followed with the overhang number 0. If data overruns the edge of your screen, a dollar sign will occur at the end of the line. If this happens, merely moving your cursor past the "$" will cause the line to scroll to the left allowing you to edit the unseen portion; moving your cursor back restores the screen. To include the other primer patterns, if you are working with paired PCR primers, place your cursor where the other patterns will lay and then press Ctrl-r for "reading" an external file. You will be given a chance to type in a file name and the other file will be placed at the cursor's position. Be sure to repeat the format editing with the new sequences also! The finished prion example is illustrated below:
primer.dat: The pattern data file for searching the DNA databases with FindPatterns using candidate prion primers from Prime. Name Offset Pattern Overhang .. F1 1 AACCGCTACCCCCCCCAG 0 R1 1 TGATCAGCAGGATCACGG 0 R2 1 TGCTGAACAGCACCATGC 0 R3 1 TGAACAGCACCATGCTGC 0
The exact column in which the various fields appear is not important, but the order of the fields is vital! Once you've made the proper changes to your .dat file, exit pico by pressing Ctrl-x; save your changes under the same file name.
Now you are ready to run FindPatterns. Type findpatterns -check to get a chance to use the available options. My session below shows the options and parameters that I chose:
% findpatterns -check FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal. Minimal Syntax: % findpatterns [-INfile=]Genbank:Humig* -Default Prompted Parameters: -PATterns=GAATTC,RGGAY patterns to be found [-OUTfile=]findpatterns.find the output file name Local Data Files: -DATa=pattern.dat a file with a set of patterns Optional Parameters: -MISmatch=1 allows mismatches in the search for your subsequence -NAMes makes an output file in "file of filenames" format -ONEstrand searches only the top strand of nucleotide sequences -SIXbase searches only for patterns with six or more symbols -CIRcular searches all sequences as if they were circular Press q to quit or <Return> for more: <rtn> -ALL does an "overlapping-set" search in nucleotide sequences -PERFect looks only for perfect matches -APPend appends the pattern data file to the output file -SHOw shows every file searched even if there are no finds -TERminal writes output to the terminal screen instead of a file -NOMONitor suppresses the screen trace showing each file -ONCe limits finds to patterns found a maximum of 1 time -MINCuts=1 limits finds to patterns found a minimum of 1 time -MAXCuts=3 limits finds to patterns found a maximum of 3 times -EXCLude=n1,n2 excludes patterns found between positions n1 and n2 -SINce=6.90 limits search to sequences dated on or after June 1990 -BATch Submits the program to run in the batch queue Add what to the command line ? -data=primer.dat -mismatch=2 -batch
It is very important to provide the correct answers at this point! You need to add the appropriate data file, a realistic mismatch level, and the batch option all to the command line. Give your data filename after the -data= qualifier and a mismatch level of less than 20% the length of your shortest sequence. Do not forget to add the -batch option! The less than 20% mismatch cut-off level is a "rule-of-thumb" because that is the number of expected mismatchs if all codon choices were made on a completely random basis. In this example I will use a mismatch level of about 10%. The program will then ask you which sequences you want to find your pattern in. These are not your primer sequences, these are the sequences you want to search your primer patterns against. Therefore, answer with the appropriate subdivision of GenBank. Since I am dealing with prions from human beings, the primate portion of GenBank is most relevant and I answer gb_pr:*, which means that I want to search all of the sequences in the primate subdivision of GenBank. (See the GenHelp User's Guide chapter Using Sequences, topic Using Database Sequences, subtopic Nucleic Acid Database tables, if this still confuses you.) Finally, give the output file an appropriate name and the program will submit itself to the batch queue:
FINDPATTERNS in what sequence(s) ? gb_pr:*
What should I call the output file (* findpatterns.find *) ? prion.finds
** findpatterns will run as a batch or at job.
** findpatterns was submitted using the command:
" batch "
warning: commands will be executed using /bin/sh
job 848528595.b at Wed Nov 20 14:23:15 1996
This will take a while to run so don't worry about its output at this time. The output file will appear in your week5 directory when the job has finished and a message will be e-mailed to you by the system reporting any errors encountered during the run.
Because you won't have immediate results from this FindPatterns search, and because we do need to proceed with the remainder of the exercise today, we are going to `cheat' on this next step. Even though we are pretending that the system we are working on has never been sequenced in our particular organism, you are to use an actual mRNA sequence for your selected molecule in this next step. You should have found at least one case where your selected molecule has at least one genomic and one mRNA/cDNA entry that correspond in your previous database searching efforts. Use that mRNA sequence for this step. If this was reality and you really did not have a known mRNA template sequence, then you would have to wait until the above FindPatterns search was finished in order to use the best matching template sequence available for further primer testing and evaluation.
Therefore, rerun FindPatterns on your selected mRNA sequence using all the same options as above except do not use -batch. If the mismatch level that you specified above is not large enough to find any matchups on your mRNA sequence on the first attempt, then rerun the program with a higher value. Since GCG's Prime program will not tolerate any ambiguities or mismatches, we need to do this in order to correct any mismatches between the template and primer sequences for the subsequent and final round of primer testing. My example with the human prion mRNA follows; here I had to increase my mismatch level from 2 to 3:
% findpatterns -data=primer.dat -mismatch=3
FindPatterns identifies sequences that contain short patterns like
GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow
mismatches. You can provide the patterns in a file or simply type them
in from the terminal.
FINDPATTERNS in what sequence(s) ? gb:humprp
What should I call the output file (* humprp.find *) ? <rtn>
HUMPRP len: 2,415
FINDPATTERNS in what sequence(s) ? <rtn>
Total finds: 2
Total length: 2,415
Total sequences: 1
CPU time: 01.07
Output file: /disk2/usr/local/people/thompson/BC578/EX5/humprp.find
The result, given below, points out those exact bases where our backtranslated guessmer got it wrong. Notice that all sequences are listed in the forward direction only. Therefore, to correct the mismatches on your reverse primers, if you are working with paired PCR primers, you will have to manually calculate the reverse, complement in your head to affect the changes in the primer.dat file.
! FINDPATTERNS on gb:humprp allowing 3 mismatches
! Using patterns from: primer.dat November 20, 1996 15:11 ..
HUMPRP ck: 4378 len: 2,415 ! M13899 Human prion protein (PrP) mR
NA, complete cds. 1/95
F1 AACCGCTACCCCCCCCAG
188: GAGGC aaccgctacccacctcag GGCGG mis=2
R1 /Rev CCGTGATCCTGCTGATCA
765: TCCAC ctgtgatcctcctgatct CTTTC mis=3
Databases searched:
gb_pr, Release 97.0, Released on 0Oct96, Formatted on 0Oct96
Total finds: 2
Total length: 2,415
Total sequences: 1
CPU time: 01.07
Print out a copy of your mRNA FindPatterns result (lpr filename if you are in the teaching lab) to assist in the correction of your primer.dat file. Use pico to make the necessary changes in those primers that the program found on your mRNA sequence. Comment out the remainder by placing the GCG remark delineator ! in front of them. My revised primer.dat file follows; I've indicated revisions with lower case but this is not necessary:
primer.dat: The pattern data file for searching the DNA databases with FindPatterns using candidate prion primers from Prime. Name Offset Pattern Overhang .. F1 1 AACCGCTACCCaCCtCAG 0 R1 1 aGATCAGgAGGATCACaG 0 ! R2 1 TGCTGAACAGCACCATGC 0 ! R3 1 TGAACAGCACCATGCTGC 0
9) Final testing using our revised primer.dat file.
For our final primer evaluation we will rerun GCG's program Prime; however, this time we will not be using it to locate appropriate primers on conserved sequences as we did earlier. Rather, this time we will use it to evaluate our final list of suggested primers against a known sequence, in its entirity. (In this case the known sequence is your selected molecule's actual mRNA sequence; however, if this was reality, the known sequence would be the best match found by the previous FindPatterns search of all of GenBank.) Your primer.dat file should now contain only those primers that have passed all previous tests and that have had their mismatches corrected. Relaunch Prime with the same `magic' options that you were forced to use previously. Use a combination of the the more relaxed parameters, if different settings were used for your forward and reverse primers previously. If you are dealing with paired PCR primeres, then you do not want to check forward or reverse primers only; this time you want to check your entire suggested list, so do not use the forward or reverse only options, but do use the -primers= option. Specify your known mRNA sequence as that to be used in the run and accept the defaults up to the "Maximum product length" query. Here specify the entire length of your mRNA sequence. Again, suppress the plot. My example screen trace follows:
% prime -gcmaxprimer=73 -tmmaxprimer=68 -primers=primer.dat
Prime selects oligonucleotide primers for a template DNA sequence. The
primers may be useful for the polymerase chain reaction (PCR) or for DNA
sequencing. You can allow Prime to choose primers from the whole template
or limit the choices to a particular set of primers listed in a file.
The Polymerase Chain Reaction (PCR) process for amplifying nucleic acids is
covered by U.S. Patent Nos. 4,683,195 and 4,683,202 owned by Hoffmann La
Roche. A license for research may be obtained through the purchase and use
of authorized reagents and thermocyclers from Perkin-Elmer Corp., or by
otherwise negotiating a license with Perkin-Elmer. No license to use PCR is
granted by the purchase or use of the Wisconsin Package(TM).
Prime of what sequence ? gb:humprp
Begin (* 1 *) ? <rtn>
End (* 2415 *) ? <rtn>
Minimum primer length (* 18 *) ? <rtn>
Maximum primer length (* 22 *) ? <rtn>
Minimum product length (* 100 *) ? <rtn>
Maximum product length (* 300 *) ? 2415
What should I call the output file name (* humprp.prime *) ? <rtn>
This program can display the primer binding sites graphically.
Do you want to:
A) Plot to a FIGURE file called "prime.figure"
B) Plot graphics on VERSATERM-TEK4105 attached to term
C) Suppress the plot
Please choose one (* A *): c
Selecting primers from list
..
Selecting primer pairs
INPUT SUMMARY
-------------
Input sequence: gb_pr:humprp
Input primer list: primer.dat with 2 primers.
Primer constraints:
primer size: 18 - 22
primer 3' clamp: S
primer sequence ambiguity: NOT ALLOWED
primer GC content: 40.0 - 73.0%
primer Tm: 50.0 - 68.0 degrees Celsius
primer self-annealing. . .
3' end: < 8 (weight: 2.0)
total: < 14 (weight: 1.0)
unique primer binding sites: required
primer-template and primer-repeat annealing. . .
3' end: ignored
total: ignored
repeated sequences screened: none specified
Product constraints:
product length: 100 - 2415
product GC content: 40.0 - 55.0
product Tm: 70.0 - 95.0 degrees Celsius
duplicate primer endpoints: NOT ALLOWED
difference in primer Tm: < 2.0 degrees Celsius
primer-primer annealing. . .
3' end: < 8 (weight: 2.0)
total: < 14 (weight: 1.0)
PRIMER SUMMARY
--------------
forward reverse
Number of primers considered: 1 1
Number of primers rejected for . . .
primer 3' clamp: 0 0
primer sequence ambiguity: 0 0
primer GC content: 0 0
primer Tm: 0 0
non-unique binding sites: 0 0
primer self-annealing: 0 0
primer-template annealing: 0 0
primer-repeat annealing: 0 0
Number of primers accepted: 1 1
PRODUCT SUMMARY
---------------
Number of products considered: 1
Number of products rejected for. . .
product length: 0
product GC content: 0
product Tm: 0
product position: 0
duplicate primer endpoints: 0
difference in primer Tm: 1
primer-primer annealing: 0
Number of products accepted: 0
Number of products saved: 0
Output file: humprp.prime
CPU time: 0.49 seconds
Again, the results may be frustratingly negative. This time, in my case, no products were found because of too large of a difference in primer Tm's. As frustrating as Prime can be, it certainly can point out the exact conditions that must be altered from standard PCR reactions in order to have any success in the wet lab. In my case, rerunning the program with tmdifference set to 10[ring] and gcmaxproduct set to correspond with gcmaxprimer did the trick. Whether this is a totally impossible PCR condition is not indicated by the program, so do not blindly accept the results! On your last run through the program you may want to accept the plot to see the graphic. This may all seem like a genuine pain just to get a couple of primers for PCR, however, realize that successful primers found in this manner will most likely work with all similar organisms for this gene. You will not have to repeat the experience until you are given a totally different system to work on. The results from my successful final run are shown below:
PRIME of: gb_pr:humprp ck: 4378 from: 1 to: 2415 November 20, 1996 16:35
INPUT SUMMARY
-------------
Input sequence: gb_pr:humprp
Input primer list: primer.dat with 2 primers.
Primer constraints:
primer size: 18 - 22
primer 3' clamp: S
primer sequence ambiguity: NOT ALLOWED
primer GC content: 40.0 - 73.0%
primer Tm: 50.0 - 68.0 degrees Celsius
primer self-annealing. . .
3' end: < 8 (weight: 2.0)
total: < 14 (weight: 1.0)
unique primer binding sites: required
primer-template and primer-repeat annealing. . .
3' end: ignored
total: ignored
repeated sequences screened: none specified
Product constraints:
product length: 100 - 2415
product GC content: 40.0 - 73.0
product Tm: 70.0 - 95.0 degrees Celsius
duplicate primer endpoints: NOT ALLOWED
difference in primer Tm: < 10.0 degrees Celsius
primer-primer annealing. . .
3' end: < 8 (weight: 2.0)
total: < 14 (weight: 1.0)
PRIMER SUMMARY
--------------
forward reverse
Number of primers considered: 1 1
Number of primers rejected for . . .
primer 3' clamp: 0 0
primer sequence ambiguity: 0 0
primer GC content: 0 0
primer Tm: 0 0
non-unique binding sites: 0 0
primer self-annealing: 0 0
primer-template annealing: 0 0
primer-repeat annealing: 0 0
Number of primers accepted: 1 1
PRODUCT SUMMARY
---------------
Number of products considered: 1
Number of products rejected for. . .
product length: 0
product GC content: 0
product Tm: 0
product position: 0
duplicate primer endpoints: 0
difference in primer Tm: 0
primer-primer annealing: 0
Number of products accepted: 1
Number of products saved: 1
--------------------------------------------------------------------------------
Product: 1
[DNA] = 50.000 nM [salt] = 50.000 mM
PRIMERS
-------
forward primer: F1
reverse primer: R1
5' 3'
forward primer (18-mer): 188 AACCGCTACCCACCTCAG 205
reverse primer (18-mer): 782 AGATCAGGAGGATCACAG 765
forward reverse
primer %GC: 61.1 50.0
primer Tm (degrees Celsius): 58.7 50.1
PRODUCT
-------
product length: 595
product %GC: 57.5
product Tm: 82.5 degrees Celsius
difference in primer Tm: 8.6 degrees Celsius
annealing score: 53
optimal annealing temperature: 57.9 degrees Celsius
--------------------------------------------------------------------------------
Here's the plot produced by the run; not very impressive in this case:

10) Evaluations and exercise credit.
Before leaving for the day you need to send some of the files in your week5 directory to the teacher account so we can see how you've done. However, you must first change their filenames to reflect your name so that we can tell who sent what. Follow the examples shown below, renaming and sending only those files indicated. First rename, one at a time, the two figure files from the multiple sequence alignment portion of the exercise into your last name with the extension .pileup and .plotsim respectively, where lastname represents your last name (i.e. replace lastname with your last name).
% mv pileup.figure lastname.pileup % mv plotsimilarity.figure lastname.plotsim
Next rename the following files in a similar manner. Do not rename all of your files in the directory or else you may interfere with your FindPatterns batch job!
% mv (just your edited, final version).strings lastname.strings % mv *.msf lastname.msf % mv primer.dat lastname.dat
The sequence analysis exercise 5 report form, week5s.week5s, and a confirmation report form, confirm.confirm, were copied to your directory at the beginning of the exercise. We want you to confirm the general nature of the type of project that you want to do in the course. Rename these files giving them your last name as a filename, just as above, and then use the pico editor to answer the questions on the forms.
Finally, send all these lastname.* files to teacher using the remote copy command rcp (while in your week5 directory). This is the only way that you can get credit for completing the exercise, so don't space it out! Please, only send the requested files, not your entire directory.
% rcp lastname.* teacher@ribozyme:receive
By now, the fifth week of the semeseter, you should have a pretty good handle on what you will be doing for your final project. You should have already had a conference with Dr. Dunker concerning your project and you should know whether you will be working more in sequence analysis or more in molecular modeling for the project. You should know the general scope of your project and you should be aware of all the available journal literature relevant. You should also have a feel for the limitations of computational methods in addressing your desired goals. Above all you should have already explored all databases, bibliographic, sequence, and structure, using text based approaches for relevant information to your system. One of the best tools for this exploration is Entrez, as illustrated in earlier exercises. If you've been real ambitious, you may have started exploring similarity based searching methods even though they have not yet been covered in the lab. We encourage you to have as much of this preliminary project text-based searching done by now as possible, because the remainder of the semester, starting with this present lab, is much more `labor intensive' and requires considerably more computer time committment to finish the assigned exercises.
After you have completely finished everything in today's lab exercise, logout of ribozyme and exit the VersaTerm-Pro terminal emulation program. This will return you to the initial Apple Desktop. Do not shut off the machine. This concludes the exercise for the week.
Next week's session will show how to assemble your newly discovered sequencing data into meaningful sequences using GCG's Fragment Assembly System (FAS).
Back in the wet lab you would have synthesized oligo's (and labeled them, if doing hybridization), performed the PCR reaction or hybridization screen, and isolated the products with plaque/colony purification or direct PCR purification, as appropriate.
After you found a candidate sequence; what next: often it's restriction mapping.
The unknown stretch of DNA is restriction digested with various enzymes and agarose gel electrophoresed; the resultant fragment sizes are extrapolated from migration distances. From this information a tentative restriction map can be hypothesized.
This type of restriction mapping, i.e. reconstructing a physical map based on overlaps without having an actual sequence, is computationally very difficult. Few automated solutions exist.
Alternative stragegies include subcloning the peices into a manageable vector and then sequencing those fragments or direct PCR product sequencing.
Anyway, after generating some sequence data, the other type of restriction mapping, that where you do know the sequence and you merely want to know where all the varous restriction enzymes may cut, can be very helpful. The GCG programs, Map, MapPlot, MapSort and PlasmidMap can all assist in guiding and illustrating this process. Once all cut sites have been mapped SeqEd can be used to actually perform the subcloning operation on the computer before doing it in the wet lab.
Cherfas, J. (1990). Genes Unlimited. New Scientist 14, 29-33.
Gribskov, M., Luethy, R., and Eisenberg, D. (1989). Profile Analysis. In Methods in Enzymology, 183, (pp146-159), Academic Press, San Diego, California, U.S.A.
Gupta, S. K., Kececioglu, J., and Schaffer, A. A (1995) Making the Shortest-Paths Approach to Sum-of-Pairs Multiple Sequence Alignment More Space Efficient in Practice, Proc. 6th Annual Combinatorial Pattern Matching conference (CPM `95).
Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A . 89, 10915-10919.
Mullis, K.B. (1990). The Unusual Origin of the Polymerase Chain Reaction. Scientific American April, 56-65.
Saiki, R.K., Gelfand, D.H., Stoffel, S., Scharf, S.J., Higuchi, R., Horn, G.T., Mullis, K.B., and Erlich, H.A. (1988). Primer-Directed Enzymatic Amplification of DNA with a Thermostable DNA Polymerase. Science 239, 487-491.
Sambrook, J., Fritsch, E.F., and Maniatis, T. (1989). Synthetic Oligonucleotide Probes. In Molecular Cloning A Laboratory Manual, 2nd ed. (pp 11.2-11.53), Cold Spring Harbor Laboratory Press, New York, New York, USA.
Schwartz, R.M. and Dayhoff, M.O. (1979). Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, 5, Suppl. 3, (pp; 353-358), National Biomedical Research Foundation, Washington, D.C., U.S.A.
Smith, R.F. andSmith, T.F. (1992). Pattern-Induced Multi-sequence Alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for comparitive protein modelling. Protein Engineering 5:35-41.
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22: 4673-4680.
White, T.J., Arnheim, N., and Erlich, H.A. (1989). The Polymerase Chain Reaction. Trends in Genetics 5, 185-189.
Wood, W.I. (1987). Gene Cloning Based on Long Oligonucleotide Probes. In Methods in Enzymology , 152, (pp 443-447), Academic Press, San Diego, California, USA.
Software Used:
Edelman, I., Olsen, S., and Devereux, J. (1996) Program Manual for the Wisconsin Package, Version 9.1. Genetics Computer Group (GCG), Madison, Wisconsin, USA 53711.