Gene Finding Strategies: After the Sequencing's Done, What Next? How are coding sequences recognized in genomic DNA? Searching by signal versus searching by content -- transcriptional/translational regulatory sites and exon/intron splice sites; "nonrandomness" and codon usage -- understanding the concepts and differentiating between the approaches.
Author:
Steven M. Thompson
Current genome projects are generating billions of base pairs of data at an explosive rate. GenBank has doubled in size almost every 14 months since 1982 and now contains over a million sequencs. Given all this unknown DNA, how are encoded genes determined and positioned? Translating from all translational start codons to all "nonsense" chain terminating, stop codons in every frame provides a list of ORF's (Open Reading Frames), but which of them, if any, actually code for proteins? And if you are dealing with eukaryotic genomic DNA, then exons and introns considerably complicate the matter. Two general approaches to the gene finding problem can be imagined:
We can utilize both of these facts to help locate the position of genes in DNA. These methods are often known as "searching by signal" and "searching by content" respectively. Although neither method is absolutely reliable, one seldom has the luxury of knowing the complete amino acid sequence to the protein of interest and simply translating DNA until the correct pieces fall out. This is the only method that would be 100% positive. Since we are usually forced to discover just where these pieces are, especially with genomic DNA, computerized analysis becomes invaluable.
Signal searches look for transcriptional and translational features. Transcriptional regulatory sites such as promoters and other transcription factor and enhancer binding sequences can help identify the beginnings of genes, however, some of these motifs can be quite distant from the actual start of transcription. The prokaryote Shine-Dalgarno consensus and other eukaryote ribosome binding sites obviously relate to translation initiation, as does the methionine start codon. However, matters can be complicated by alternative start codons; AUG works in about 90% of cases, but there are exceptions in some prokaryotes and organellar genomes. Transcriptional terminator and attenuator sequences can help identify gene ends as do the termination (stop) codons, but, again, exceptions can be found, especially in some ciliated protists and due to eukaryote suppresser tRNA's. Furthermore, splice site donor-acceptor consensus sequences can point to intron-exon borders. All of these types of signals can help us recognize coding sequences. A major problem is simple consensus pattern type searching is often either overly or insufficiently stringent because of the variable and loosely defined nature of these types of sites. Weight matrix approaches are more powerful, but not nearly as simple to set up in most commercial sequence analysis packages.
There are two general strategies for finding coding regions of DNA based on the content of the DNA itself -- one is based on the local "nonrandomness" of a stretch and the other is based on the known codon usage of a particular life form. The first, the nonrandomness test, does not tell us anything about the particular strand or reading frame; however, it does not require a previously built codon usage table. The codon usage approach is based on the fact that different types of organisms use different frequencies of codons to code for particular amino acids. This approach requires a codon usage table built up from known translations; however, it also tells us the strand and reading frame as opposed to the former. Both of these content approaches rely on implicit biological constraints imposed on the genetic code, constraints which we can utilize to help discriminate structural genes from all the rest of the, some would incorrectly say, "junk" DNA found in most genomes. The content approach is often more accurate; it does not generate nearly as many false positives as signal type searches, but its answers aren't concise either.
All available methods should be used together to help reinforce and/or reject the others' findings. Furthermore, the similarity searching methods to be taught in Exercise 8 should be used for inference through homology. The chore of identifying coding sequences is far from trivial and is a long way from being solved in an unambiguous manner; however, it is extremely important anytime anyone starts sequencing genomic DNA and doesn't have the luxury of an available cDNA library.
Several powerful E-mail servers have been established which can also help with these types of analyses. Most combine many of the methods discussed above. In particular they have been designed to use neural net and artificial intelligence approaches to help in the "decision" process. A careful application and interpretation of the many resources at one's disposal can go a long way to increasing our understanding of gene structure and function. But, as always, carry a healthy dose of skepticism to, and be extremely wary, at any session with the terminal, as the naive can easily be misled into accepting inappropriate or downright wrong results.
The objective of this laboratory exercise is to gain familiarity with and to utilize many of the computerized methods available for trying to identify coding regions within genomic DNA. You will be using your selected molecule throughout the exercise and hopefully you will be able to apply the experience to your own research programs. This exercise can be particularly frustrating; hang in there. Its purpose is to show just how difficult of a problem this is. I don't expect you to get the "right" answer, just close answers and lots of insight.
Preliminary Preparations:
Log on to your account and initialize gcg. Create a new subdirectory for this exercise, move to it, and copy your consensus output file from last week's Fragment Assembly session into the new directory:
% cp ../week6/lastname.consensus .
Also copy the necessary files and the exercise report form from ribozyme to the new directory. Complete this procedure just the same as you've done in previous exercises by using the environment variable name $GRAD_DIR:
% cp $GRAD_DIR/week7s/* .
Prelude: Exercise 6 Housecleaning!
If you have finished Exercise 6, you need to return to that directory to delete the Fragment Assembly System's databases in order to free up some space on ribozyme. Issue the command gelstart -delete and answer the project query with your fragment assembly project's name. Confirm that you really do want to delete the entire project. The deletion process can take a few moments as the database is quite extensive. Also delete the Practice session's database by repeating the above procedure with its name. Move back over into your new Exercise 7 directory to proceed with the rest of the exercise.
1) Map your sequence consensus.
The first order of business is to utilize GCG's mapping program to translate all reading frames within your sequence. Again, I'll illustrate with the prion example, but you should use your own molecule of interest from the Selected Molecule list. Type map (your molecule's consensus filename.ext [from Exercise 6]) and a whole list of questions is presented. At this point we are not really interested in where the restriction sites are, therefore, we need to tell the program "no enzymes:" To do this press either the # sign or a <space> then <return> to select no enzymes. Realize that this program's primary purpose is to generate restriction maps and it is very powerful in this function. Next, we need to tell the program to select all six reading frames; select "s." This will generate a list of all Unidentified Reading Frames (URF's) as opposed to Open Reading Frames (ORF's) which by definition start with a start codon and end with a stop codon. Since many exons will not begin with a start codon (only the first will necessarily begin with one), all six frames is the more appropriate choice for genomic eukaryotic sequences. Accept the default output and display the file to the terminal after it's ready. The prion example session follows:
% map prion.consensus
Map maps a DNA sequence and displays both strands of the mapped sequence
with restriction enzyme cut points above the sequence and protein
translations below.
Begin (* 1 *) ? <rtn>
End (* 2415 *) ? <rtn>
Select the enzymes: Type nothing or "*" to get all enzymes. Type "?"
for help on what enzymes are available and how to select them.
Enzyme(* * *): <space><rtn>
What protein translations do you want:
a) frame 1 b) frame 2 c) frame 3
d) frame 4 e) frame 5 f) frame 6
t)hree forward frames s)ix frames o)pen frames only
n)o protein translation q)uit
Please select (capitalize for 3-letter) (* t *): s
What should I call the output file (* prion.map *) ? <rtn>
Mapping .
Writing ..........
MAP complete with:
Sequence Length: 2,415
Enzymes Chosen: 0
Cutsites found: 0
CPU time: 00.32
Output file(s): prion.map
Next, a small portion of the prion output file is displayed from the more command:
% more prion.map
(Linear) MAP of: prion.consensus check: 4378 from: 1 to: 2415
November 21, 1996 10:55 ..
CGGCGCCGCGAGCTTCTCCTCTCCTCACGACCGAGGCAGAGCAGTCATTATGGCGAACCT
1 ---------+---------+---------+---------+---------+---------+ 60
GCCGCGGCGCTCGAAGAGGAGAGGAGTGCTGGCTCCGTCTCGTCAGTAATACCGCTTGGA
a R R R E L L L S S R P R Q S S H Y G E P -
b G A A S F S S P H D R G R A V I M A N L -
c A P R A S P L L T T E A E Q S L W R T L -
1 ---------+---------+---------+---------+---------+---------+ 60
d A G R A E G R R V V S A S C D N H R V -
e R R S S R R E E R G L C L L * * P S G -
f P A A L K E E G * S R P L A T M I A F R -
TGGCTGCTGGATGCTGGTTCTCTTTGTGGCCACATGGAGTGACCTGGGCCTCTGCAAGAA
61 ---------+---------+---------+---------+---------+---------+ 120
ACCGACGACCTACGACCAAGAGAAACACCGGTGTACCTCACTGGACCCGGAGACGTTCTT
a W L L D A G S L C G H M E * P G P L Q E -
b G C W M L V L F V A T W S D L G L C K K -
c A A G C W F S L W P H G V T W A S A R S -
61 ---------+---------+---------+---------+---------+---------+ 120
d K A A P H Q N E K H G C P T V Q A E A L -
e Q S S S A P E R Q P W M S H G P G R C S -
f P Q Q I S T R K T A V H L S R P R Q L F -
///////////////////////////////////////////////////////////////////////////
Both strands of the DNA sequence are shown together with all of the URF's
below each block; asterisks indicate stop codons. Frames a through c are in
the forward direction and d through f are in the reverse. Use the lpr
command to make yourself a printout of your URF map; you'll need to refer to
it later on. At this point we can see that there are many potentially
translated stretches; so what? What can be done with them; how can we turn them into potential genes?
Typical signals to look for are promoter and terminator consensus stretches. GCG provides a searching program named Terminator for looking for the latter in prokaryotic rho-independent cases; however, promoter signals from both prokaryotes and eukaryotes are so varied that they do not have a `canned' search for them. An impressive eukaryotic transcription factor consensus sequence database has been assembled though, and prokaryotic promoter sequences are fairly well characterized. We can utilize the GCG program FindPatterns to look for these type of sites within our sequence. GCG also provides the ability to find short consensus patterns based on a family of related sequences using weight matrix analysis with the programs Consensus and FitConsensus. These can be used to form and search for specific promoters or other signals based on known sequences. Also, remember that many termination sites are accompanied by inverted repeats, and enhancer sequences are often strong direct repeats; because of these points, the GCG programs StemLoop and Repeat, as well as dotplot procedures, may be helpful.
I am forced to change my example molecule for this exercise as the previous prion example was a cDNA so it is not appropriate for illustrating gene splicing. Therefore, the human cytokine interleukin I beta gene, IL1B, will be used to illustrate the procedures throughout the remainder of this exercise. The much simpler prokaryote case will only be mentioned in these exercises. As always, use your own selected molecule, not the example in the following steps!
2) Look for potential regulatory sites by screening with FindPatterns.
You've already used FindPatterns to match up your probe to the database in Exercise 5; now we're going to use it to try to locate promoter regions in our sequences.
I have written and placed a prokaryote promoter consensus pattern based on the E. coli data of Hawley and McClure (1983) in the GCG logical directory location GenMoreData. You are welcome to screen your eukaryotic sequence with this pattern, although it is not required and probably will not be relevant since the pattern encompasses both the -35 and -10 regions (if it was just the -10 portion, some eukaryotic TATA boxes might be found with it). If desired, run FindPatterns on your consensus sequence specifying data= genmoredata:promoter.dat. If you run this search, use the append option to attach the data file to the end of your output so you can see what it looks like. Specify your consensus sequence's filename.extension in response to the "what sequence" query. The first part of an example screen trace follows:
% findpatterns -data=genmoredata:promoter.dat -mismatch=1 -append FINDPATTERNS identifies sequences with short pattern queries like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal.Notice that this example was run with a mismatch level of one. Remember that FindPatterns allows the user to set any mismatch level desired. The prokaryote promoter data pattern follows; it may be of help to you in other research work.
The standard E. coli RNA polymerase promoter "Pribnow" box file for the
program FINDPATTERNS. This pattern includes both the -35 & the -10 region.
For an incredibly extensive list of eukaryotic transcription factor
recognition sites see the GCG public datafile TFSites.Dat. To specify one of
these files use the command line option -data=_datafile_name.
Name Offset Pattern Overhang Documentation ..
Promoter 1 TTGACwx{15,21}TAtAaT 0 !Hawley & McClure (1983)
Now for the tougher case -- your eukaryote example. A huge public pattern
data file is available. This file (Ghosh, 1990) is called tfsites.dat and is
available in the GCG logical directory location GenMoreData. Run FindPatterns
specifying genmoredata:tfsites.dat as an alternate data file. Do not
use a mismatch level above zero, as way too many hits would result; also do
not use the append option, as this data file is really huge and too much space
in your account would be consumed by it. (You are welcome to take a look at
the Transcription Factor Sites database by using the GCG command name
to resolve the disk location of genmoredata.) Do use the onestrand
option to restrict your search to the forward direction only since these
sequences all transcibe in the forward direction only and to help restrict the
size of the output file. Specify your consensus sequence's
filename.ext in response to the program's query. The interleukin
example FindPatterns session follows:
% findpatterns -data=genmoredata:tfsites.dat -one
FindPatterns identifies sequences that contain short patterns like
GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow
mismatches. You can provide the patterns in a file or simply type them
in from the terminal.
FINDPATTERNS in what sequence(s) ? humil1.consensus
What should I call the output file (* humil1.find *) ? <rtn>
Humil1.Consensus len: 9,721 ...........................................
...............................
FINDPATTERNS in what sequence(s) ? <rtn>
Total finds: 705
Total length: 9,721
Total sequences: 1
CPU time: 35.02
Output file: /disk2/usr/local/people/thompson/BC578/EX7/il1/humil1.find
Notice the output is still really huge; 705 finds doesn't do us much good.
Scim through your output and see if anything stands out as significant. The
tfsites.dat file is available in printed form here in the lab and you can look
through it to note any interesting references concerning finds within your
sequence. Similarly, you can use the system level search utility grep
to search tfsites.dat (after you've resolved its disk location with
GCG's name program) for your entries of interest to get the
references.Other signals that could be looked for in a similar fashion are the Shine-Dalgarno prokaryote translational initiation site, (AGG,GAG,GGA)x{6,9}ATG (Stormo et al., 1982) and other types of eukaryotic ribosome binding sequences. These type of searches are based on complementarity to the 16s rRNA in prokaryotes; however, in eukaryotes ribosomes seem to initiate translation at the first AUG encountered following the modified guanosine 5' cap. Kozak (1984) has compiled a consensus at the start codon of cc(A,g)ccAUGg which doesn't seem to relate to 18s rRNA complementarity at all, yet seems to hold true in many cases. Find this pattern in your sequence by typing findpatterns; provide Kozak's pattern within the program when queried. Keep increasing the mismatch level until you find the pattern.
A main point of these consensus searches is "Don't believe everything your computer tells you!" A computer can provide guidance and insight but the limitations can sometimes be overwhelming as is all too evident in these promoter analyses. Oftentimes weight matrix analysis is more appropriate to this type of search. Unfortunately, they are not nearly as simple to set up.
3) Using the weight matrix program FitConsensus to find regulatory sites.
GCG has preassembled consensus weight matrices of the donor and acceptor site sequences at exon-intron splice junctions for use with FitConsensus available in their public data files. However, they do not provide any others; therefore, I have reformatted the four weight matrix descriptions of eukaryotic RNA polymerase II promoter elements reported by Bucher (1990) into a form appropriate for GCG's programs. Additionally, McLauchlan et al. (1985) assembled a eukaryotic terminator weight matrix which I have reformatted for GCG use.
Use all seven of these weight matrices in FitConsensus to help locate the transcription start, each subsequent exon's splice site, and termination signals. The GCG donor and acceptor consensus weight matrix files as well as the others mentioned above have all been placed in the GCG logical location GenMoreData. Start the program by typing fitconsensus (your consensus sequence's filename.ext); specify the genmoredata:donor.csn file matrix; generate lists of only the 20 best fits and give the output file your sequence's name but the extension .donor. The screen trace follows:
% fitconsensus humil1.consensus
FITCONSENSUS uses a consensus table written by CONSENSUS as a
probe to find the best examples of the consensus in a DNA sequence. You
can specify the number of fits you want to see, and FITCONSENSUS
tabulates them with their position, frame, and a statistical measure of
their quality.
Begin (* 1 *) ? <rtn>
End (* 9721 *) ? <rtn>
Using what consensus table file ? genmoredata:donor.csn
Your consensus table has position(s) with 100% certainty,
Are these necessary conditions for a fit (* Yes *) ? <rtn>
Show how many fits (* 40 *) ? 20
What should I call the output file (* humil1.fit *) ? humil1.donor
Run the same type of analysis with the acceptor.csn data file supplying
the appropriate responses to the program's queries and each of the other five
matrices. These have the file names tata.csn, cap.csn,
ccaat.csn, gc.csn, and terminator.csn. They are all
located in GenMoreData. After running each of these searches, display the
results of each to see what they look like. The interleukin donor output
follows:
% more humil1.donor
FITCONSENSUS of: humil1.consensus Check: 5335 from: 1 to: 9721
LOCUS HUMIL1BX 9721 bp ds-DNA PRI 15-DEC-1989
DEFINITION Human interleukin 1 beta gene.
ACCESSION X04500
KEYWORDS interleukin.
SOURCE Human (Homo sapiens) fetal liver DNA, clones
BDC[3,240,426,454,457]. . . .
Using Consensus: genmoredata:donor.csn
CONSENSUS from:
Splice site sequences
from Stephen Mount NAR 10(2) 459;472 figure 1 page 460
List-size: 20 Average quality: 37.03 September 26, 1991 08:22 ..
position: 731 1375 1442 1511 1561 2236 2524 2649 2857 3140 3899
frame: 2 1 2 2 1 1 1 3 1 2 2
quality: 52.42 49.83 47.17 47.75 50.67 56.17 55.50 49.50 49.33 56.08 52.75
position: 4389 5323 5580 6035 6472 7402 7534 8991 9332
frame: 3 1 3 2 1 1 1 3 2
quality: 48.42 49.83 50.25 53.83 47.42 57.00 49.92 47.25 49.67
Notice the output includes the position, frame, and "quality" of each match.
The position and quality are tremendously helpful. Position is where in your
sequence the specified weight matrix begins, perhaps not where the site that
you are most concerned is occurs, rather only where the matrix begins. This
is particularly important in the donor and acceptor matrices as these both
begin in front of the splice site, not at it. Quality is the percentage fit
to the matrix. The frame designation is troubling -- it is quite misleading
as it identifies the frame of the best fit to the matrix not to the coding
region -- so disregard it. Also note that none of the hits would have been
found by a normal FindPatterns type consensus search without mismatches
because the scores are all less than 100%. Later on in the exercise we will
see how well these predicted sites correlate with the actual splice sites.4) Help in locating the ends of genes: Terminator in prokaryotes; StemLoop, Repeat, and finding poly(A) signals in eukaryotes.
The GCG program Terminator will find about 95% of all prokaryotic factor-independent terminators. This is great odds for any computer algorithm; even its namesake Arnold Schwarzenegger would have a hard time matching this! This program is mentioned for those of you involved with prokaryotic data in your own research programs. However, realize that a main disadvantage of most signal searches, even a sophisticated two-dimensional approach like Terminator, is they find too many false positive sites, in other words they are not discriminatory enough. Just like Schwarzenegger, a few innocents always manage to get in the way.
The GCG programs StemLoop and Repeat may provide some regulatory insight with eukaryotic sequences since many eukaryotic terminators also have hairpin structures associated with them and some enhancer sequences contain strong direct repeats.
The sequence YGTGTTYY has been reported as a eukaryotic terminator consensus (McLauchlan et al., 1985 [this is the consensus from the weight matrix used above]) and the poly(A) adenylation signal AAUAAA is well conserved (Proudfoot and Brownlee, 1976); however, realize the inherent problems with consensus searches, as has been previously illustrated. Run the poly(A) search mentioned above; you should already have data regarding the termination site from the FitConsensus runs previously made. Type findpatterns to launch the program in the interactive mode; provide the poly(A) pattern within the program as you are queried. Again, just as with Kozak's pattern, rerun the program increasing the mismatch level until you find at least one poly(A) pattern.
You have now been exposed to many of the pitfalls of signal type searches. In general, the second type of gene-finding technique, "searching by content," is more reliable, at least it seems to be less fraught with false positive problems, however, it can not locate exact positions. Used in concert ,with the former, the two can be quite powerful tools.
Searching by content utilizes the fact that genes necessarily have many constraints placed upon them. This induces certain periodicities and patterns which we can use to help locate coding sequences as opposed to noncoding stretches of DNA. These constraints arise in a number of fashions -- the three base genetic code, the "wobble" hypothesis, an unequal use of synonymous codons, translational factors, the amino acid content of the encoded proteins themselves, and, possibly, because of remnants of an ancient genetic code. All together these factors create distinctly unique coding sequences; non- coding stretches do not exhibit this type of periodic compositional bias. This fact can serve as a gene finding tool in two manners.
5) Using methods based on sequence composition alone.
The first technique relies solely on the base compositional bias of every third position -- nonrandomness. A truly random sequence does not show any type of pattern at all and is not characteristic of any coding sequence. The program can estimate the probability that any stretch of DNA sequence is either coding or noncoding. It will not tell us the strand or the reading frame; however, it does not require any a priori assumptions as it relies exclusively on a statistical evaluation of the sequence itself. To run this program, and many of the subsequent ones, first make sure that your GCG graphics setting has been initialized because it produces graphical output. This only has to be done once per session, but it is essential. Remember the command:
% setplot then pick versa_plot for color VersaTerm-PRO tektronix emulation
Next type testcode (your consensus sequence's filename.ext) to launch the program. Accept the next series of default answers and the program will ask you if you are ready for it to plot the figure; press <return>. The emulator will then jump into graphics mode and the TestCode plot will be drawn. A screen trace from the interleukin example follows:
% testcode humil1.consensus
TESTCODE helps you identify protein coding sequences by plotting a
measure of the non-randomness of the composition at every third base.
The statistic does not require a codon frequency table.
Process set to plot with VERSATERM-TEK4105 attached to term
using the tekd graphic interface.
Begin (* 1 *) ? <rtn>
End (* 9721 *) ? <rtn>
Reverse (* No *) ? <rtn>
What window size in bp (* 200 *) ? <rtn>
The minimum density for a one page plot is: 9721.0 bases/page
A typical density is about 3000.0 bases/page
What density would you like (* 9721.0 *) ? <rtn>
When your VERSATERM-TEK4105 attached to tty is ready, press <rtn>. <rtn>
One limitation of this program is it is not designed to detect coding regions
shorter than 200 base pairs, hence the 200 bp window size. No claim is made
for significance with windows less than the default 200; therefore, smaller
eukaryotic exons may be missed. The resulting plot is shown below:

The plot is divided into three regions. The top and bottom areas predict coding and noncoding regions, respectively, to a confidence level of 95%, while the middle area claims no statistical significance. Diamonds and vertical bars above the graph denote potential stop and start codons respectively. Rerun TestCode with -figure=(your last name).testcode on the command line to generate a file to send to teacher for evaluation.
6) Running codon usage analysis programs.
Even more tables are available from various network file servers accessible through E-mail, annonymous ftp, Gopher, and the Web. GCG also provides a program called CodonFrequency for those who need to build custom codon usage tables. To make your own codon usage table gather all of the known and appropriate coding sequences for your organism and run them through this program. A strategy to achieve this objective, if the need ever arises, follows: To find the sequences in the database you could use the GCG searching program StringSearch. The resulting output file is then edited down to include only those sequences that refer to structural genes, excluding inappropriate organelle genomes, if that is of concern. The title lines of the sequence files aren't quite what you need though -- you need the sequences' references which list CDS regions. Use fetch -reference -out=filename.ext @your StringSearch filename to pull over the references from the database into your own directory. Then use the program CodonFrequency being very careful to only specify the coding regions for each gene from your list.
A sample Frames session with the human cytokine interleukin I beta gene example is illustrated below. However, since we are dealing with eukaryotic genomic sequence only the first exon will actually have a start codon. This program is more appropriate for sequences without exons such as cDNA or prokaryotic data. Therefore, do not perform a Frames analysis on your Selected Molecule FAS consensus sequence.
% frames -rare=genmoredata:human_high.cod
FRAMES shows open reading frames for the six translation frames of a
DNA sequence. FRAMES can superimpose the pattern of rare codon
choices if you provide it with a codon frequency table.
Process set to plot with VERSATERM-TEK4105 attached to TERM:
using the tekd graphic interface.
Plot FRAMES for what sequence ? humil1.consensus
Begin (* 1 *) ?
End (* 9721 *) ?
Mark rare codons at or below what threshold (* 0.0 *) ?
When your VERSATERM-TEK4105 attached to tty is ready, press <rtn>.

The plot shows all open reading frames and marks rare codon choices with a dot above each.
Now that you've seen Frames, run CodonPreference. Type codonpreference with the noframes option, since we know this is eukaryotic genomic DNA so ORF's are not relevant. Next, you will be asked for a sequence name; enter your consensus's name. The program then asks about beginning, ending and reversal; accept the defaults. The next question is very important! The codon usage table that you need to use will not be GenRunData:EcoHigh.Cod; type in the correct response for your sequence instead. Accept the next defaults and press return after the "are you ready?" prompt. The interleukin session is illustrated below:
% codonpreference -noframes
CODONPREFERENCE is a frame-specific gene finder that tries to
recognize protein coding sequences by virtue of the similarity of their
codon usage to a codon frequency table or by the bias of their
composition (usually GC) in the third position of each codon.
Process set to plot with VERSATERM-TEK4105 attached to term
using the tekd graphic interface.
CODONPREFERENCE for what sequence ? humil1.consensus
Begin (* 1 *) ? <rtn>
End (* 9721 *) ? <rtn>
Reverse (* No *) ? <rtn>
What codon frequency file (* genrundata:ecohigh.cod *) ? genmoredata:human_hig
h.cod
What codon preference window size (in codons) (* 25 *) ? <rtn>
The minimum density for a one-page plot is 318.93 bases/cm. <rtn>
What density would you like (* 318.93 *) ? <rtn>
When your VERSATERM-TEK4105 attached to tty is ready, press <Return>. <rtn>

The plot shows two curves, a codon preference curve and a third position bias curve, for each reading frame of the sequence in question. These are color coded on a color monitor. The curves rise above background scatter in areas of strong probability of coding potential. The horizontal lines within each plot are the average values of each attribute. CodonPreference moves its window in increments of three recalculating its statistic at each position to generate a continuous function so that each function defines an individual reading frame. An open reading frame display would accompany each panel with start codons represented as vertical lines rising above each box and stop codons shown as lines falling below the reading frame boxes had we not suppressed it with the noframes option. Rare codon choices are also shown for each frame as hash marks. The final output from this program helps in the interpretation of its significance:
Average codon preference for frame 1 = 0.7821 Average codon preference for frame 2 = 0.7721 Average codon preference for frame 3 = 0.7864 Average codon preference for a random sequence = 0.7224Calculations of average codon preference for each frame and the whole sequence randomized are shown which can be compared to the peaks of the plot. One must realize, however, that not all genes show particularly high codon usage preference. This is especially true of genes which are only weakly expressed. Therefore, you must, as always, interpret results with more than just a few grains of salt. Use as many sources of information as possible!
Rerun CodonPreference with the same parameters as before, however, specify the figure option on the command line to generate an output file with the name (your last name).codonpref.
7) E-mail servers for gene-finding analyses.
An additional source of information that should not be ignored are the powerful Internet E-mail servers available for these type of analyses. VADMS has installed a very nice utility, the MSU program, by Ranier Fuchs of EMBL, to take advantage of these servers in a user-friendly manner. More information on MSU can be obtained by reading the documentation file found in the directory specified by the environment variable $DOC_DIR. Four gene-finding servers available through MSU are GRAIL, GeneId, NetGene, and GenMark. GRAIL (Gene Recognition and Analysis Internet Link) is a neural net system for detecting exons (trained on human data). It looks for base composition that "appear" exon-like. It does not define boundaries. GeneId is an Artificial Intelligence system for analyzing genomic DNA and predicting exon and gene structure. NetGene predicts splice sites likelihood using neural net techniques. GenMark is based on a special type of Markov chain model of coding and noncoding sequences. Many of these services are also available through the web and can be found by searching for their name or at standard launching points.
MSU, as installed on ribozyme, prefers it's input sequences not to be in GCG format. It does readily accept PIR/NBRF, GenBank, EMBL, and FastA formats though. Therefore, first run GCG's program ToPIR in order to convert your sequence's format into PIR compatible form:
% topir humil1.consensus
ToPIR writes GCG sequence(s) into a single file in PIR format.
Begin (* 1 *) ? <rtn>
End (* 9721 *) ? <rtn>
Reverse (* No *) ? <rtn>
What should I call the output file (* humil1.pir *) ? <rtn>
HUMIL1 9721 characters.
Launch
the MSU program by typing msu at the prompt. An abridged screen trace
follows:
% msu **************** MSU (Mail Server Utility) ********************* Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library Sequence loaded: None Options: L - Load sequence H - Retrieve HELP file from server R - Register with service O - Other options Q - Quit (exit program) Enter L, H, R, O, or Q to quit: hType h to get HELP files from the gene finding servers. Specify at least the first three exon finding servers, numbers 13, 14, and 15 then type m to return to the main menu. The HELP files will be mailed back to you promptly.
Retrieve HELP file on service: (press RETURN for next page) 1 - EMBL BLITZ server 2 - BIOCCELERATOR 3 - FLASH 4 - NCBI BLAST 5 - EMBL FASTA server 6 - NBRF/PIR FASTA server 7 - EMBL QuickSearch server 8 - CBRG (ETH Zuerich) 9 - BLOCKS server 10 - MotifFinder Options: M - Back to Main Menu Enter 1-22 or M to go back to Main Menu: Retrieve HELP file on service: (press RETURN for next page) <rtn> 11 - ProteinPredict 12 - nnpredict 13 - NetGene 14 - Grail 15 - GeneID (less than 20kb) 16 - GenMark 17 - PYTHIA (Rpts) 18 - PYTHIA (Alu) 19 - GenomeNet BLAST 20 - GenomeNet FASTA Options: M - Back to Main Menu Enter 1-22 or M to go back to Main Menu: 13 Request mailed to netgene@virus.fki.dth.dk at Wed Sep 21 10:17:49 1994 The reply should soon arrive in your mailbox PRESS <RETURN> TO CONTINUE... <rtn> /////////////////////////////////////////////////////////////////////////////////// Enter 1-26 or M to go back to Main Menu: 14 Request mailed to grail@ornl.gov at Wed Sep 21 10:18:19 1994 The reply should soon arrive in your mailbox /////////////////////////////////////////////////////////////////////////////////// Enter 1-26 or M to go back to Main Menu: 15 Request mailed to geneid@darwin.bu.edu at Wed Sep 21 10:19:00 1994 The reply should soon arrive in your mailbox /////////////////////////////////////////////////////////////////////////////////// Enter 1-26 or M to go back to Main Menu: m Type m to return to the main menu. **************** MSU (Mail Server Utility) ********************* Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library Sequence loaded: None Options: L - Load sequence H - Retrieve HELP file from server R - Register with service O - Other options Q - Quit (exit program) Enter L, H, R, O, or Q to quit: lEnter l to load your PIR formatted consensus sequence into the utility, then reply with its name. After the sequence is loaded, tell the program to submit it to NetGene, number 8.
Enter file name: humil1.pir **************** MSU (Mail Server Utility) ********************* Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library Sequence loaded: humil1.consensus Nucleotide, 9721 residues, 1-AGAAA...TCGAC-9721 Services: (press RETURN for next page) 1 - BIOCCELERATOR 2 - NCBI BLAST 3 - EMBL FASTA server 4 - NBRF/PIR FASTA server 5 - EMBL QuickSearch server 6 - CBRG (ETH Zuerich) 7 - BLOCKS server 8 - NetGene 9 - Grail 10 - GeneID (less than 20kb) Options: L - Load sequence S - Set sequence limits (1 - 9721) H - Retrieve HELP file from server R - Register with service O - Other options Q - Quit (exit program) Enter 1-17, L, S, H, R, O, or Q to quit: <rtn> **************** MSU (Mail Server Utility) ********************* Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library Sequence loaded: humil1.consensus Nucleotide, 9721 residues, 1-AGAAA...TCGAC-9721 Services: (press RETURN for next page) 11 - GenMark 12 - PYTHIA (Rpts) 13 - PYTHIA (Alu) 14 - GenomeNet BLAST 15 - GenomeNet FASTA 16 - GenQuest (Q) 17 - ProDom Options: L - Load sequence S - Set sequence limits (1 - 9721) H - Retrieve HELP file from server R - Register with service O - Other options Q - Quit (exit program) Enter 1-17, L, S, H, R, O, or Q to quit: 8 Service NetGene Neural network prediction of splice sites in vertebrates Request mailed to netgene@virus.fki.dth.dk at Thu Sep 22 11:13:04 1994 The reply should soon arrive in your mailbox PRESS <RETURN> TO CONTINUE... <rtn>Next, submit your sequence to GeneID. It is number 10 on the menu:
/////////////////////////////////////////////////////////////////////////////////// Enter 1-18, L, S, H, R, O, or Q to quit: 10 Service GeneID (less than 20kb) Pediction of exons and gene structure. Suppress automatic comparison of best model with protein database (Y/N or ?) [N] : <rtn>Press <return> to accept the default protein comparison.
Request mailed to geneid@darwin.bu.edu at Thu Sep 22 11:13:53 1994 The reply should soon arrive in your mailbox PRESS <RETURN> TO CONTINUE... <rtn>GRAIL is special in that it requires registration; therefore, if you consider this type of service worthwhile to your own research program, you may wish to register. If so, return to the main menu by replying m. You are not required to register with or use the services of GRAIL.
Enter 1-18, L, S, H, R, O, or Q to quit: m **************** MSU (Mail Server Utility) ********************* Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library Sequence loaded: humil1.consensus Nucleotide, 9721 residues, 1-AGAAA...TCGAC-9721 Options: L - Load sequence H - Retrieve HELP file from server R - Register with service O - Other options Q - Quit (exit program) Enter L, H, R, O, or Q to quit: rIf you are registering, then answer the prompt with r and then specify 2 to enable registration at GRAIL. My registration response follows:
Register with service: (press RETURN for next page) 1 - FLASH 2 - Grail 3 - GenMark Options: M - Back to Main Menu Enter 1-3 or M to go back to Main Menu: 2 Service Grail Gene finding service. Registration required at grail@ornl.gov Your name (or ?): Steve Thompson Your affiliation (or ?): Washington State University VADMS Center Your postal address (or ?): WSU VADMS Pullman WA 99164-4660 Your email address (or ?): thompson@ribozyme.vadms.wsu.edu Your phone number (or ?): 509-335-3179 Request mailed to grail@ornl.gov at Thu Sep 22 11:57:45 1994 The reply should soon arrive in your mailbox PRESS <RETURN> TO CONTINUE... <rtn> Register with service: (press RETURN for next page) 1 - FLASH 2 - Grail 3 - GenMark Options: M - Back to Main Menu Enter 1-3 or M to go back to Main Menu: mFinally enter q to quit the program.
Enter L, H, R, O, or Q to quit: qIf you decided to register with GRAIL, you may want to reenter MSU after you have gotten your user ID number back from GRAIL and then submit your sequence there following the prompts.
Abridged results of the MSU exon finding analyses for the human interleukin I gene follow below in the order that they were submitted. Refer to your new HELP files for interpretations. Based on your readings of the HELP files, some analyses may be more appropriate than others for your particular sequence, for instance, netgene is optimized for vertebrate DNA and thus may infer incorrect splice sites for other types of sequences.
From: SMTP%"server@cbs.dth.dk"
Subj: humil1.consensus: NetGene splice site prediction
------------------------------------------------------------------------
NetGene
Neural Network Prediction of Splice Sites
Reference:
Brunak, S., Engelbrecht, J., and Knudsen, S. (1991). Prediction of
Human mRNA donor and acceptor sites from the DNA sequence. Journal of
Molecular Biology 220:49-65.
------------------------------------------------------------------------
Report ERRORS to Jacob Engelbrecht engel@virus.fki.dth.dk.
Potential splice sites are assigned by combining output from a local and
a global network. The prediction is made with two cutoffs: 1) Highly
confident sites (no or few false positives, on average 50% of the true
sites detected); 2) Nearly all true sites (more false positives - on
average of all positions 0.1% false positive donor sites and 0.4% false
positive acceptor sites, at 95% detection of true sites). The network
performance on sequences from distantly related organisms has not been
quantified. Due to the non-local nature of the algorithm sites closer
than 225 nucleotides to the ends of the sequence cannot be assigned.
Column explanations, field identifiers:
POSITION in your sequence (either first or last base in intron).
Joint CONFIDENCE level for the site (relative to the cutoff).
EXON INTRON gives 20 bases of sequence around the predicted site.
LOCAL is the site confidence from the local network.
GLOBAL is the site confidence from the global network.
------------------------------------------------------------------------
The sequence: humil1.consensus contains 8521 bases, and has the following compos
ition:
A 2293 C 2068 G 1834 N 2 T 2324
1) HIGHLY CONFIDENT SITES:
==========================
ACCEPTOR SITES:
POSITION CONFIDENCE INTRON EXON LOCAL GLOBAL
DONOR SITES:
POSITION CONFIDENCE EXON INTRON LOCAL GLOBAL
2) NEARLY ALL TRUE SITES:
=========================
ACCEPTOR SITES:
POSITION CONFIDENCE INTRON EXON LOCAL GLOBAL
7766 0.11 TTCGCTGCAG^AGTGTAGATC 0.21 0.63
4944 0.05 CTCTCCGCAG^TGCTCCTTCC 0.11 0.69
2243 0.01 ATTTACTCAG^AGATCATTTC 0.46 0.11
DONOR SITES:
POSITION CONFIDENCE EXON INTRON LOCAL GLOBAL
From: geneid@bir.cedb.uwf.edu (GeneID server)
Subject: humil1.consensus
>------------------------------ GENE-ID OUTPUT -------------------------------<
_______________________ Announcing Promoter Prediction: _______________
1. GeneID now scans for vertebrate PolII promoters as well and outputs
predicted promoter regions along with their score and likelihood.
For more information, send an email with the keyword "help" to
geneid@bir.cedb.uwf.edu or check the gopher server at:
name=Bioinformatics Resource Gopher
host=dna.cedb.uwf.edu
type=+
____________________________________________________________________________
GENEID SERVER
Bioinformatics Resource, The University owf West Florida, U.S.A.
Developed by Steen Knudsen, Roderic Guigo, Kathleen Klose and Temple Smith
at MBCRR, Harvard University, BMERC, Boston University, and BIR, UWF.
References:
Guigo,R., Knudsen,S., Drake,N., and Smith,T. (1992)
Prediction of Gene Structure. Journal of Molecular Biology, 245:45-56.
This server uses the NCBI BLAST Network Service:
Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers and
David J. Lipman (1990). Basic local alignment search tool. J. Mol.
Biol. 215:403-410.
-------------------------------------------------------------------------------
Processed keywords and options:
Genomic Sequence...
Column explanations:
Position: Position in your sequence counted from base 1.
Start: Position in your sequence counted from base 1.
Stop: Position in your sequence counted from base 1.
Score: The score calculated for the prediction.
Higher score means higher confidence.
Ranking scores: Calculated scores used for ranking gene models.
See paper for definition.
Frame (genomic sequences only):
First exons: The reading frame defined by the startcodon.
Internal exons: A registration of which of the three frames
are open (denoted by 1) in the particular exon.
The last of the four numbers denotes the
codon remainder of the exon.
Last exons: The reading frame defined by the startcodon.
Equivalent exons: Exons that are mutually interchangeable.
See paper for details.
-------------------------------------------------------------------------------
REPORT for genomic DNA sequence HUMIL1.CONSENSUS, 9721 bases.
Thu Sep 22 13:08:57 CDT 1994
0. Potential PolII PROMOTERS
Promoter prediction:
Region Score Likelihood
400 - 900 0.668 Marginal prediction
900 - 1300 0.647 Marginal prediction
1400 - 1800 1.095 Highly likely prediction
1800 - 2100 1.313 Highly likely prediction
2100 - 2300 0.577 Marginal prediction
2500 - 2800 0.561 Marginal prediction
////////////////////////////////////////////////////////////////////////////////
1. potential START CODONS
Position Score Sequence
222 0.75 AGGAATGG
229 0.78 GATTATGG
////////////////////////////////////////////////////////////////////////////////
2431 0.87 CGTCATGG
2481 0.88 AGCCATGG
2511 0.70 TGAAATGA
////////////////////////////////////////////////////////////////////////////////
9611 0.76 AGGCATGC
9619 0.73 TGGCATGC
2. potential STOP CODONS
Position Codon
66 TGA
106 TAA
////////////////////////////////////////////////////////////////////////////////
8305 TAA
8309 TGA
8337 TAA
8368 TGA
8387 TAG
////////////////////////////////////////////////////////////////////////////////
9693 TGA
9706 TGA
3. potential DONOR SITES
Position Score Sequence
73 0.70 GAAGGTGGCAG
103 0.70 AGTGGTTAATT
////////////////////////////////////////////////////////////////////////////////
2505 0.62 GCCAGTGAAAT
2527 0.91 ACAGGTCAGTG
2549 0.64 ACCAGTAACAT
////////////////////////////////////////////////////////////////////////////////
2962 0.75 CCAGGTGTTTC
3143 0.95 GAAGGTAAGAC
3234 0.71 CATGGTATTTT
////////////////////////////////////////////////////////////////////////////////
5219 0.72 CCGCGTCAGTT
5326 0.79 GAAGGTAGTTA
5329 0.66 GGTAGTTAGCC
////////////////////////////////////////////////////////////////////////////////
5984 0.62 TCTGGTCCATA
6038 0.89 CAAGGTAAATG
6109 0.63 CAAAGTAGAAA
////////////////////////////////////////////////////////////////////////////////
7310 0.73 GAAAGTAATGA
7405 0.98 GGAGGTAAGTG
7409 0.72 GTAAGTGAATG
////////////////////////////////////////////////////////////////////////////////
9555 0.74 CATAGTGAGAC
9691 0.71 TGCAGTGAGCT
4. potential ACCEPTOR SITES
Position Score Sequence
162 0.72 CCCCTAGGCCACCCAGA
185 0.83 TTTTAGCTATCTGCAGG
////////////////////////////////////////////////////////////////////////////////
3031 0.75 CAAGTCTTGTACACAGG
3092 0.85 CCTGTTTTTATTACAGT
3121 0.74 GACTTGTTCTTTGAAGC
////////////////////////////////////////////////////////////////////////////////
5086 0.76 GCATGGTTCCTGGCAGG
5125 0.80 CTCGCTCTCTCCGCAGT
5137 0.80 GCAGTGCTCCTTCCAGG
////////////////////////////////////////////////////////////////////////////////
5762 0.75 CTGGTGTTGTCATCAGA
5874 0.76 GTGCTCCACATTTCAGA
6017 0.81 AGCTCTCCACCTCCAGG
////////////////////////////////////////////////////////////////////////////////
7180 0.77 CTGTGGGTTCCCACAGA
7275 0.86 CCCCTTCTTTCTTCAGT
7302 0.73 ATGTCCTTTGTACAAGG
////////////////////////////////////////////////////////////////////////////////
8077 0.77 AACATCTTTCCCATAGG
8127 0.87 TTCTCTTTCGCTGCAGA
8187 0.71 ATTTGTCTTCAACAAGA
////////////////////////////////////////////////////////////////////////////////
9677 0.71 GATTGCTTGAGCCCAGA
9716 0.80 ATCCGTTGACCTGCAGG
-------------------------------------------------------------------------------
5a. potential FIRST EXONS (from startcodon to donor site)
Score Frame Start Stop Blastx matches Blast score
-----Highly confident predictions, more than 50% true exons: ---
0.85 2 2481 2527
0.77 2 2514 2527
-----Medium confident predictions, more than 10% true exons: ---
0.71 2 7380 7405
0.70 0 7286 7405
////////////////////////////////////////////////////////////////////////////////
-----Marginal predictions, less than 1% true exons: ------------
0.43 0 862 864
0.43 2 2431 2438
////////////////////////////////////////////////////////////////////////////////
6a. potential INTERNAL EXONS
Score Frame Start Stop Blastx matches Blast score
-----Highly confident predictions, more than 50% true exons: ---
-----Medium confident predictions, more than 10% true exons: ---
0.80 1 0 0 2466 2527
0.74 0 1 0 7275 7405
0.67 0 0 1 3092 3143
0.57 1 0 0 5762 6038
0.57 1 1 1 6017 6038
0.56 0 1 0 7302 7405
0.55 1 0 0 5125 5326
0.53 1 0 0 2364 2527
0.52 1 0 0 5137 5326
0.51 1 0 0 8127 8236
0.51 1 0 0 5735 6038
0.50 0 0 1 3804 3902
0.47 1 1 1 5556 5583
0.45 0 1 0 5874 6038
0.44 0 0 1 578 734
////////////////////////////////////////////////////////////////////////////////
-----Marginal predictions, less than 1% true exons: ------------
0.25 0 1 1 1370 1445
0.25 1 0 0 9504 9555
////////////////////////////////////////////////////////////////////////////////
7a. potential LAST EXONS (from acceptor site to stopcodon)
Score Frame Start Stop Blastx matches Blast score
-----Highly confident predictions, more than 30% true exons: ---
-----Medium confident predictions, more than 10% true exons: ---
0.57 2 2466 2473
0.55 1 8127 8133
0.55 2 7007 7020
0.55 1 7007 7013
////////////////////////////////////////////////////////////////////////////////
-----Marginal predictions, less than 1% true exons: ------------
0.40 0 7275 7376
0.39 2 7275 7495
////////////////////////////////////////////////////////////////////////////////
-------------------------------------------------------------------------------
5b. potential FIRST EXON CLASSES
(Only exons upstream of 3'-most predicted last exon)
Score Frame Start Stop Equivalent exons, if any
-----Highly confident predictions, more than 50% true exons: --
0.85 2 2481 2527 2514 2527 2388 2527 2397 2527
2431 2438 2379 2527 2511 2527 2416 2465
0.71 2 7380 7405
0.70 0 7286 7405
-----Medium confident predictions, more than 10% true exons: --
0.61 0 3138 3143
0.59 1 6026 6038 5978 6038
0.53 1 5236 5326 5254 5326
0.52 0 764 766 862 864
0.50 0 9341 9352
////////////////////////////////////////////////////////////////////////////////
-----Marginal predictions, less than 1% true exons: -----------
0.34 1 4483 4495
////////////////////////////////////////////////////////////////////////////////
6b. potential INTERNAL EXON CLASSES
first exon, and upstream of 3'-most predicted last exon)
Score Frame Start Stop Equivalent exons, if any
-----Highly confident predictions, more than 50% true exons: --
0.80 1 0 0 1 2466 2527 2364 2527 2322 2527
0.74 0 1 0 1 7275 7405 7302 7405
0.67 0 0 1 0 3092 3143
-----Medium confident predictions, more than 10% true exons: --
0.80 1 0 0 1 2466 2527 2364 2527 2322 2527
0.67 0 0 1 0 3092 3143
0.57 1 0 0 0 5762 6038 5762 6116
0.55 1 0 0 0 5125 5326 5137 5326 5086 5326
0.51 1 0 0 1 8127 8236
0.51 1 0 0 0 5735 6038 5735 6116
////////////////////////////////////////////////////////////////////////////////
-----Marginal predictions, less than 1% true exons: -----------
7b. potential LAST EXON CLASSES
(Only exons downstream of 5'-most predicted first exon)
Score Frame Start Stop Equivalent exons, if any
-----Confident predictions, more than 30% true exons: ---------
-----Medium confident predictions, more than 10% true exons: --
0.57 2 2466 2473 2364 2419
0.55 1 7007 7013
0.55 1 8127 8133 8226 8307
0.55 2 7007 7020 7150 7376 7180 7376
0.54 0 411 413 185 256 162 275 249 275
0.54 1 2466 2514
0.53 0 8127 8339 8226 8339
0.52 1 3957 3963
////////////////////////////////////////////////////////////////////////////////
-----Marginal predictions, less than 1% true exons: -----------
-------------------------------------------------------------------------------
8. potential GENES
12978 gene models were analyzed and ranked according to likelihood
The 20 most likely genes are
Ranking scores List of exons (*) constituting gene (first, internal(s), last)
0.797 1.450 2481 2527 3092 3143 5125 5219 5556 5583
5762 6038 7275 7405 8127 8339
0.797 1.444 2481 2527 3092 3143 5125 5219 5556 5583
6017 6038 7275 7405 8127 8339
0.797 1.406 2481 2527 3092 3143 5125 5219 5556 5583
5762 6038 7275 7405 8127 8236 8342 8480
0.797 1.400 2481 2527 3092 3143 5125 5219 5556 5583
6017 6038 7275 7405 8127 8236 8342 8480
0.797 1.387 2481 2527 3092 3143 5125 5219 5556 5583
5735 6038 7275 7405 8127 8339
0.797 1.385 2481 2527 3092 3143 5125 5326 5874 6038
7275 7405 8127 8339
0.797 1.364 2481 2527 3092 3143 3957 4177 5556 5583
5762 6038 7275 7405 8127 8339
////////////////////////////////////////////////////////////////////////////////
(*) Exons listed are the representatives of equivalence classes, if any,
and are the best bet from each class.
If two gene models are non-overlapping they may constitute two separate genes.
From: grail@gwen.EPM.ORNL.GOV (Gene Recognition and Analysis Internet Link)
>humil1.consensus, len = 9721
pos f-strand frame orf || pos r-strand frame orf
||
.....................................................................
////////////////////////////////////////////////////////////////////////////////
.....................................................................
2461 0.066 - - - - || 7261 0.128 - - - -
2471 0.092 - - - - || 7251 0.010 - - - -
.....................................................................
.....................................................................
2491 0.051 - - - - || 7231 0.000 - - - -
2501 0.264 - - - - || 7221 0.000 - - - -
2511 0.363 - - - - || 7211 0.000 - - - -
2521 0.124 - - - - || 7201 0.000 - - - -
.....................................................................
////////////////////////////////////////////////////////////////////////////////
.....................................................................
5121 0.074 - - - - || 4601 0.235 - - - -
5131 0.033 - - - - || 4591 0.093 - - - -
5141 0.094 - - - - || 4581 0.162 - - - -
5151 0.436 - - - - || 4571 0.692 2 4547 - 4655
5161 0.938 1 5044 - 5332 || 4561 0.938 2 4547 - 4655
5171 0.938 1 5044 - 5332 || 4551 0.938 2 4547 - 4655
5181 0.938 1 5044 - 5332 || 4541 0.938 2 4499 - 4547
5191 0.938 1 5044 - 5332 || 4531 0.938 2 4499 - 4547
5201 0.938 1 5044 - 5332 || 4521 0.938 2 4499 - 4547
5211 0.938 1 5044 - 5332 || 4511 0.937 2 4499 - 4547
5221 0.938 1 5044 - 5332 || 4501 0.896 2 4499 - 4547
////////////////////////////////////////////////////////////////////////////////
.....................................................................
5891 0.071 - - - - || 3831 0.002 - - - -
5901 0.153 - - - - || 3821 0.005 - - - -
5911 0.193 - - - - || 3811 0.017 - - - -
////////////////////////////////////////////////////////////////////////////////
.....................................................................
7311 0.072 - - - - || 2411 0.000 - - - -
7321 0.922 2 7247 - 7493 || 2401 0.000 - - - -
7331 0.854 2 7247 - 7493 || 2391 0.000 - - - -
7341 0.937 2 7247 - 7493 || 2381 0.000 - - - -
7351 0.936 2 7247 - 7493 || 2371 0.000 - - - -
7361 0.936 2 7247 - 7493 || 2361 0.000 - - - -
////////////////////////////////////////////////////////////////////////////////
.....................................................................
8161 0.872 3 7965 - 8337 || 1561 0.000 - - - -
8171 0.935 3 7965 - 8337 || 1551 0.000 - - - -
8181 0.938 3 7965 - 8337 || 1541 0.000 - - - -
8191 0.938 3 7965 - 8337 || 1531 0.000 - - - -
8201 0.938 3 7965 - 8337 || 1521 0.035 - - - -
8211 0.938 3 7965 - 8337 || 1511 0.001 - - - -
8221 0.938 3 7965 - 8337 || 1501 0.001 - - - -
////////////////////////////////////////////////////////////////////////////////
Potential exons are listed in the following
pos strand strand_prob frame quality orf
5151 - 5291 f 0.71 1 excellent 5044 - 5332
2751 - 2821 r 0.63 3 good 2691 - 2871
7321 - 7391 f 1.00 2 excellent 7247 - 7493
1951 - 1971 r 0.69 1 good 1921 - 2086
8161 - 8331 f 0.99 3 excellent 7965 - 8337
8)
Interpretations and data analysis.Next, draw PostScript plots of all of your Figure files from this exercise. Initialize GCG graphics to PostScript by either typing the command postscript, and then specifying epsf for Encapsulated PostScript File as the graphics device, and then specifying a temporary filename as the output port, or by using the PostScript choice on the SetPlot menu. Until you reset your graphics configuration back to Tektronix, all graphics output will now be written to your account in PostScript format under the specified temporary filename. A sample screen trace of the PostScript command follows:
% postscript
Use Postscript graphics with what device:
LaserWriter
Lzr1200
LN03-ScriptPrinter
LPS20
ColorScript-100
EPSF (single page encapsulated postscript format)
Please choose one ( * LASERWRITER * ) epsf
To what port is your EPSF connected (* term: *) temp.ps
Plotting Configuration set to:
Language: psd
Device: EPSF
Port or Queue: temp.ps
Now use the Figure program to draw each plot to a PostScript file. Launch the
program by typing figure with the option -plot=(new actual
output filename) in order to redirect the output from temp.ps to yourplot.ps.
A screen trace illustrates:
% figure -plot=testcode.ps Figure makes figures and posters by drawing graphics and text together. You can include output from other GCG graphics programs as part of a figure. FIGURE from what file ? humil1.testcode PostScript instructions for a EPSF are now being sent to testcode.ps.Finally, use lpr to print your new PostScript files as well as any pertinent output text files from the signal searches and E-mail server analyses that you performed earlier. The point -- you want to compare all of these methods of analysis and try to make some educated guesses as to where your genes and exons actually start and stop.
Use a big table or the floor, if necessary, to analyze and ponder over your data. It probably won't be as easy as you would have hoped for. Fortunately, often in a lab situation cDNA data is also available on a given sequence, although with the increased emphasis on genomic sequencing this is becoming less and less true. Another advantage in a real analysis is you would be using any similarities discovered through database searching to assist your interpretation. This is often very helpful, especially if you are dealing with a system that has much available database information. You will be learning more about these techniques in next week's exercise. Make yourself notes on where you feel the actual starts and stops of the coding regions are in your sequence. Annotate the .map file you made in the beginning of the exercise with the precise sequence number of each position. Be sure to include exon and intron borders. Indicate on your .map printout where the various signals and content biases you found are located. This information will be required in the exercise report form at the conclusion and it will be utilized in the next step.
9) Translate the sequence.
The next logical step in sequence analysis is to translate your tentative URF's into a protein. First we will translate the putative coding regions which the previous analyses have indicated and then we will use the reference information of the actual database file(s) to translate the "real thing." Finally we will compare the two results with the Gap program to see how well our "educated guesses" have been. Realize that the validity of your interpretations will relate directly to your understanding of the molecular biology of the system you are dealing with.
Type translate and you will be asked from what sequence; answer with your consensus sequence's filename.ext. Now be careful -- you do not want to translate the entire sequence! Give the proper coordinates for the first, 5', coding region, as you determined in the above synthesis of your analyses, and the program will verify their locations. Next, you will be asked if you want to add another exon, or more genes, or just write out the translation. You need to make the correct decisions based on the sequence you are working on. Since your gene contains introns, you would, naturally, want to add more exons until finished at which point you would write it out and accept the default output filename. Refer to the screen trace following the next section for an example of Translate with the Genbank entry HumIL1Bx.
Remember the genomic nucleotide sequence that you found in the database back in Exercise 3 and then used as a verification of your GelAssemble work in Exercise 6, well, we need to use that entry again. In most cases this is the entry for your Selected Molecule in the organism that has had both its genomic sequence and three dimensional structure solved. Go back and check your notes or move over into previous subdirectories to verify its identity; note the sequence's name and the database it comes from. We are not using cDNA sequences here. The cDNA or mRNA entries do not have any introns so their header information cannot corroborate or refute your coding region analyses, so don't bother with them. As in earlier exercises, it may be a good idea to check with your lab instructor to be sure that you are using the proper sequence. Return to the Exercise 7 subdirectory and use typedata -reference to allow you to read about the exon-intron structure of the sequence. An example command line is shown below with Genbank:HumIL1Bx:
% typedata -reference genbank:humil1bx
Some of the Selected Molecule genomic sequences may not be found as just one entry in GenBank. If you are dealing with a genomic gene not found as one continuous sequence within the database, then you will have use TypeData on each one separately and use the "add another exon from a new sequence" option of Translate in the next step. Now Translate the actual sequence based on the information in its reference header. Of course, you must write down the relevant information first. You want to look for the terms CDS and exon within the FEATURES table; write down the numbers that follow each as well as the brief description of the segment. The relevant reference information from the interleukin example follows:
FEATURES Location/Qualifiers
CDS join(2481. .2527,3092. .3143,5125. .5326,5874. .6038,
7275. .7405,8127. .8339)
/note="prointerleukin-1-beta"
repeat_region 656. .668
/note="Alu repeat copy A 5' flank"
repeat_region 669. .960
/note="Alu repeat copy A"
repeat_region 961. .971
/note="Alu repeat copy A 3' flank"
prim_transcript 1934. .8941
/note="IL mRNA and introns"
intron 2006. .2465
/note="IL intron A"
exon 2481. .2527
/note="prointerleukin-1-beta, exon 2 ( first expressed
exon) /map='2q13-q21' /hgml_locus_uid='LG0004J'"
/gene="IL1B"
intron 2528. .3091
/note="IL intron B"
exon 3092. .3143
/note="prointerleukin-1-beta, exon 3"
intron 3144. .5124
/note="IL intron C"
repeat_region 4642. .4658
/note="Alu repeat copy B 5' flank"
repeat_region 4659. .4988
/note="Alu repeat copy B"
repeat_region 4989. .5007
/note="Alu repeat copy B 3' flank"
exon 5125. .5326
/note="prointerleukin-1-beta, exon 4"
intron 5327. .5873
/note="IL intron D"
exon 5874. .6038
/note="prointerleukin-1-beta, exon 5"
intron 6039. .7274
/note="IL intron E"
exon 7275. .7405
/note="prointerleukin-1-beta, exon 6"
intron 7406. .8126
/note="IL intron F"
exon 8127. .8339
/note="prointerleukin-1-beta, exon 7"
repeat_region 9251. .9721
/note="Alu repeat copy C"
Notice that the reference section is chock full of good information. It
contains important literature references, descriptive comments and oftentimes
valuable structural and regulatory insights. Using the CDS information for
your entry, Translate it. An abridged screen trace from the interleukin
translation session follows:
% translate
TRANSLATE translates nucleotide sequences into peptide sequences.
TRANSLATE from what sequence ? gb:humil1bx
Begin (* 1 *) ? 2481
End (* 9721 *) ? 2527
Reverse (* No *) ? <rtn>
Range begins ATGGC and ends TACAG. Is this correct (* Yes *) ? <rtn>
That is done, now would you like to:
A) Add another exon from this sequence
B) Add another exon from a new sequence
C) Translate and then add more genes from this sequence
D) Translate and then add more genes from a new sequence
W) Translate assembly and write everything into a file
Please choose one (* W *): a
Begin (* 1 *) ? 3092
End (* 9721 *) ? 3143
Reverse (* No *) ? <rtn>
Range begins TGGCA and ends TGAAG. Is this correct (* Yes *) ? <rtn>
////////////////////////////////////////////////////////////////////////
This procedure is repeated as many times as you have separate exons; finally
the sequence is written out:
That is done, now would you like to: A) Add another exon from this sequence B) Add another exon from a new sequence C) Translate and then add more genes from this sequence D) Translate and then add more genes from a new sequence W) Translate assembly and write everything into a file Please choose one (* W *): <rtn> What should I call the output file (* Humil1bx.Pep *) ? <rtn>The last program to be used today should be familiar by now. Use Gap to compare your hypothetical protein from the previous coding analysis to the actual protein translated just now. You all should know how to run this comparison, if you have any questions, refer to the GCG program manual or the online genhelp system. Name the Gap output file after your last name and give it the extension .peppair. I hope that I can trust everyone not to cheat by using database feature table entries in your inital translation. The purpose of this exercise is to learn, not to get a perfect translation. I do not expect anyone to get perfect translations and will be suspicious if somebody does.
10) Conclusions and evaluations.
In order to get credit for this lab, at the conclusion of the exercise remote copy (rcp) to the teacher account three output files. Also complete the exercise report form with the pico editor and then remote copy it to teacher. This is all done exactly the same way as in previous exercises. Be sure that the files you send the teacher account are all named with your last name as the filename but retain either the standard GCG file extensions or week7s as the extension for the report form. These files are:
Since they all have the same filename, (your name), you can remote copy them to teacher all at the same time using a wild card:
% rcp (your name).* teacher@ribozyme:receive
Logout of ribozyme and then exit the emulator. As before, leave the computer turned on.
You have been exposed to a perplexing variety of techniques within this laboratory exercise for the identification of protein coding regions in genomic DNA. As in all molecular and biological computer analyses, the more you understand the chemical, physical and biological systems involved, the better your chance of success in analyzing them. Certain strategies are inherently more appropriate to others in certain circumstances. Making these types of subjective, discriminatory decisions and utilizing all of the available options so that you can generate the most practical data for evaluation are two of the most important "take-home" messages that this course can offer!
Several general references are available in this field -- most provide extensive weight matrices for consensus pattern searches. Naturally each would have to be tailored into the format correct for whichever matrix searching program you might be using. They also all describe many of the factors involved and the constraints used in content type algorithms. The suggested textbook for the course, Sequence Analysis Primer, by Gribskov and Devereux, is a very good starting point.
Next week we will investigate what to do with your protein sequence now that you've got it. Enter the world of database searching -- a world of batch processes, parallel processing, and huge numbers, and many questions such as, "Is this significant; does it mean anything?" and "This is homologous?" A world often fraught with frustrating results but also exciting discoveries!
The current EPD online (Release 48, September 1996) has 1285 total promoter entries (846 independent entries) and is available on ribozyme through the GCG database logical EPD. The references and sequences have been built into a GCG style database. The user's manual file is EPD.doc, available in the directory specified by the environment variable $DOC_DIR.
% fasta humil1.consensus epd:*
EPD17079 (+) Hs IL-1b'; range -499 to 100.
SCORES Init1: 2400 Initn: 2400 Opt: 2400
100.0% identity in 600 bp overlap
Bucher, P. (1990). Weight Matrix Descriptions of Four Eukaryotic RNA Polymerase II Promoter Elements Derived from 502 Unrelated Promoter Sequences. Journal of Molecular Biology 212, 563-578.
Bucher, P. (1995). The Eukaryotic Promoter Database EPD. EMBL Nucleotide Sequence Data Library Release 42, Postfach 10.2209, D-6900 Heidelberg.
Ghosh, D. (1990). A Relational Database of Transcription Factors. Nucleic Acids Research 18, 1749-1756.
Hawley, D.K. and McClure, W.R. (1983). Compilation and Analysis of Escherichia coli promoter sequences. Nucleic Acids Research 11, 2237-2255.
Kozak, M. (1984). Compilation and Analysis of Sequences Upstream from the Translational Start Site in Eukaryotic mRNAs. Nucleic Acids Research 12, 857-872.
McLauchen, J., Gaffrey, D., Whitton, J. and Clements, J. (1985). The Consensus Sequences YGTGTTYY Located Downstream from the AATAAA Signal is Required for Efficient Formation of mRNA 3' Termini. Nucleic Acid Research 13 , 1347-1368.
Proudfoot, N.J. and Brownlee, G.G. (1976). 3' Noncoding Region in Eukaryotic Messenger RNA. Nature 263, 211-214.
Stormo, G.D., Schneider, T.D. and Gold, L.M. (1982). Characterization of Translational Initiation Sites in E. coli. Nucleic Acids Research 10, 2971-2996.
von Heijne, G. (1987). SIGPEP: A Sequence Database for Secretory Signal Peptides. Protein Sequences & Data Analysis 1, 41-42.