'97 BC/BP 578

Week 7

Sequence Series

Gene Finding Strategies: After the Sequencing's Done, What Next? How are coding sequences recognized in genomic DNA? Searching by signal versus searching by content -- transcriptional/translational regulatory sites and exon/intron splice sites; "nonrandomness" and codon usage -- understanding the concepts and differentiating between the approaches.

Author:

Steven M. Thompson

Introduction

Current genome projects are generating billions of base pairs of data at an explosive rate. GenBank has doubled in size almost every 14 months since 1982 and now contains over a million sequencs. Given all this unknown DNA, how are encoded genes determined and positioned? Translating from all translational start codons to all "nonsense" chain terminating, stop codons in every frame provides a list of ORF's (Open Reading Frames), but which of them, if any, actually code for proteins? And if you are dealing with eukaryotic genomic DNA, then exons and introns considerably complicate the matter. Two general approaches to the gene finding problem can be imagined:

We can utilize both of these facts to help locate the position of genes in DNA. These methods are often known as "searching by signal" and "searching by content" respectively. Although neither method is absolutely reliable, one seldom has the luxury of knowing the complete amino acid sequence to the protein of interest and simply translating DNA until the correct pieces fall out. This is the only method that would be 100% positive. Since we are usually forced to discover just where these pieces are, especially with genomic DNA, computerized analysis becomes invaluable.

Signal searches look for transcriptional and translational features. Transcriptional regulatory sites such as promoters and other transcription factor and enhancer binding sequences can help identify the beginnings of genes, however, some of these motifs can be quite distant from the actual start of transcription. The prokaryote Shine-Dalgarno consensus and other eukaryote ribosome binding sites obviously relate to translation initiation, as does the methionine start codon. However, matters can be complicated by alternative start codons; AUG works in about 90% of cases, but there are exceptions in some prokaryotes and organellar genomes. Transcriptional terminator and attenuator sequences can help identify gene ends as do the termination (stop) codons, but, again, exceptions can be found, especially in some ciliated protists and due to eukaryote suppresser tRNA's. Furthermore, splice site donor-acceptor consensus sequences can point to intron-exon borders. All of these types of signals can help us recognize coding sequences. A major problem is simple consensus pattern type searching is often either overly or insufficiently stringent because of the variable and loosely defined nature of these types of sites. Weight matrix approaches are more powerful, but not nearly as simple to set up in most commercial sequence analysis packages.

There are two general strategies for finding coding regions of DNA based on the content of the DNA itself -- one is based on the local "nonrandomness" of a stretch and the other is based on the known codon usage of a particular life form. The first, the nonrandomness test, does not tell us anything about the particular strand or reading frame; however, it does not require a previously built codon usage table. The codon usage approach is based on the fact that different types of organisms use different frequencies of codons to code for particular amino acids. This approach requires a codon usage table built up from known translations; however, it also tells us the strand and reading frame as opposed to the former. Both of these content approaches rely on implicit biological constraints imposed on the genetic code, constraints which we can utilize to help discriminate structural genes from all the rest of the, some would incorrectly say, "junk" DNA found in most genomes. The content approach is often more accurate; it does not generate nearly as many false positives as signal type searches, but its answers aren't concise either.

All available methods should be used together to help reinforce and/or reject the others' findings. Furthermore, the similarity searching methods to be taught in Exercise 8 should be used for inference through homology. The chore of identifying coding sequences is far from trivial and is a long way from being solved in an unambiguous manner; however, it is extremely important anytime anyone starts sequencing genomic DNA and doesn't have the luxury of an available cDNA library.

Several powerful E-mail servers have been established which can also help with these types of analyses. Most combine many of the methods discussed above. In particular they have been designed to use neural net and artificial intelligence approaches to help in the "decision" process. A careful application and interpretation of the many resources at one's disposal can go a long way to increasing our understanding of gene structure and function. But, as always, carry a healthy dose of skepticism to, and be extremely wary, at any session with the terminal, as the naive can easily be misled into accepting inappropriate or downright wrong results.

Exercise Series #7: A "Real-Life" Project Oriented Approach.

The objective of this laboratory exercise is to gain familiarity with and to utilize many of the computerized methods available for trying to identify coding regions within genomic DNA. You will be using your selected molecule throughout the exercise and hopefully you will be able to apply the experience to your own research programs. This exercise can be particularly frustrating; hang in there. Its purpose is to show just how difficult of a problem this is. I don't expect you to get the "right" answer, just close answers and lots of insight.

Preliminary Preparations:

Log on to your account and initialize gcg. Create a new subdirectory for this exercise, move to it, and copy your consensus output file from last week's Fragment Assembly session into the new directory:

% cp ../week6/lastname.consensus .

Also copy the necessary files and the exercise report form from ribozyme to the new directory. Complete this procedure just the same as you've done in previous exercises by using the environment variable name $GRAD_DIR:

% cp $GRAD_DIR/week7s/* .

Prelude: Exercise 6 Housecleaning!

If you have finished Exercise 6, you need to return to that directory to delete the Fragment Assembly System's databases in order to free up some space on ribozyme. Issue the command gelstart -delete and answer the project query with your fragment assembly project's name. Confirm that you really do want to delete the entire project. The deletion process can take a few moments as the database is quite extensive. Also delete the Practice session's database by repeating the above procedure with its name. Move back over into your new Exercise 7 directory to proceed with the rest of the exercise.

1) Map your sequence consensus.

URFs and ORFs: whats the difference? Using MAP to locate potentially translated reading frames.

The first order of business is to utilize GCG's mapping program to translate all reading frames within your sequence. Again, I'll illustrate with the prion example, but you should use your own molecule of interest from the Selected Molecule list. Type map (your molecule's consensus filename.ext [from Exercise 6]) and a whole list of questions is presented. At this point we are not really interested in where the restriction sites are, therefore, we need to tell the program "no enzymes:" To do this press either the # sign or a <space> then <return> to select no enzymes. Realize that this program's primary purpose is to generate restriction maps and it is very powerful in this function. Next, we need to tell the program to select all six reading frames; select "s." This will generate a list of all Unidentified Reading Frames (URF's) as opposed to Open Reading Frames (ORF's) which by definition start with a start codon and end with a stop codon. Since many exons will not begin with a start codon (only the first will necessarily begin with one), all six frames is the more appropriate choice for genomic eukaryotic sequences. Accept the default output and display the file to the terminal after it's ready. The prion example session follows:

% map prion.consensus

Map maps a DNA sequence and displays both strands of the mapped sequence
with restriction enzyme cut points above the sequence and protein
translations below.

                  Begin (* 1 *) ?  <rtn>
                End (*  2415 *) ?  <rtn>

 Select the enzymes:  Type nothing or "*" to get all enzymes. Type "?"
 for help on what enzymes are available and how to select them.

                                       Enzyme(* * *):  <space><rtn>

 What protein translations do you want:

      a) frame 1   b) frame 2   c) frame 3
      d) frame 4   e) frame 5   f) frame 6

      t)hree forward frames   s)ix frames   o)pen frames only

      n)o protein translation   q)uit

 Please select (capitalize for 3-letter) (* t *):  s

 What should I call the output file (* prion.map *) ?  <rtn>

 Mapping .

 Writing ..........

 MAP complete with:

   Sequence Length:   2,415
    Enzymes Chosen:       0
    Cutsites found:       0
          CPU time:   00.32

    Output file(s): prion.map

Next, a small portion of the prion output file is displayed from the more command:

% more prion.map

 (Linear) MAP of: prion.consensus  check: 4378  from: 1  to: 2415

                             November 21, 1996 10:55  ..
 
         CGGCGCCGCGAGCTTCTCCTCTCCTCACGACCGAGGCAGAGCAGTCATTATGGCGAACCT
       1 ---------+---------+---------+---------+---------+---------+ 60
         GCCGCGGCGCTCGAAGAGGAGAGGAGTGCTGGCTCCGTCTCGTCAGTAATACCGCTTGGA
 
a        R  R  R  E  L  L  L  S  S  R  P  R  Q  S  S  H  Y  G  E  P   -
b         G  A  A  S  F  S  S  P  H  D  R  G  R  A  V  I  M  A  N  L  -
c          A  P  R  A  S  P  L  L  T  T  E  A  E  Q  S  L  W  R  T  L -
       1 ---------+---------+---------+---------+---------+---------+ 60
d           A  G  R  A  E  G  R  R  V  V  S  A  S  C  D  N  H  R  V   -
e            R  R  S  S  R  R  E  E  R  G  L  C  L  L  *  *  P  S  G  -
f          P  A  A  L  K  E  E  G  *  S  R  P  L  A  T  M  I  A  F  R -
 
         TGGCTGCTGGATGCTGGTTCTCTTTGTGGCCACATGGAGTGACCTGGGCCTCTGCAAGAA
      61 ---------+---------+---------+---------+---------+---------+ 120
         ACCGACGACCTACGACCAAGAGAAACACCGGTGTACCTCACTGGACCCGGAGACGTTCTT
 
a        W  L  L  D  A  G  S  L  C  G  H  M  E  *  P  G  P  L  Q  E   -
b         G  C  W  M  L  V  L  F  V  A  T  W  S  D  L  G  L  C  K  K  -
c          A  A  G  C  W  F  S  L  W  P  H  G  V  T  W  A  S  A  R  S -
      61 ---------+---------+---------+---------+---------+---------+ 120
d        K  A  A  P  H  Q  N  E  K  H  G  C  P  T  V  Q  A  E  A  L   -
e         Q  S  S  S  A  P  E  R  Q  P  W  M  S  H  G  P  G  R  C  S  -
f          P  Q  Q  I  S  T  R  K  T  A  V  H  L  S  R  P  R  Q  L  F -

///////////////////////////////////////////////////////////////////////////
Both strands of the DNA sequence are shown together with all of the URF's below each block; asterisks indicate stop codons. Frames a through c are in the forward direction and d through f are in the reverse. Use the lpr command to make yourself a printout of your URF map; you'll need to refer to it later on. At this point we can see that there are many potentially translated stretches; so what? What can be done with them; how can we turn them into potential genes?

Signal Methods: promoters, terminators, repeat regions . . .

Using FINDPATTERNS and CONSENSUS/FITCONSENSUS; also consider STEMLOOP & REPEAT.

Typical signals to look for are promoter and terminator consensus stretches. GCG provides a searching program named Terminator for looking for the latter in prokaryotic rho-independent cases; however, promoter signals from both prokaryotes and eukaryotes are so varied that they do not have a `canned' search for them. An impressive eukaryotic transcription factor consensus sequence database has been assembled though, and prokaryotic promoter sequences are fairly well characterized. We can utilize the GCG program FindPatterns to look for these type of sites within our sequence. GCG also provides the ability to find short consensus patterns based on a family of related sequences using weight matrix analysis with the programs Consensus and FitConsensus. These can be used to form and search for specific promoters or other signals based on known sequences. Also, remember that many termination sites are accompanied by inverted repeats, and enhancer sequences are often strong direct repeats; because of these points, the GCG programs StemLoop and Repeat, as well as dotplot procedures, may be helpful.

I am forced to change my example molecule for this exercise as the previous prion example was a cDNA so it is not appropriate for illustrating gene splicing. Therefore, the human cytokine interleukin I beta gene, IL1B, will be used to illustrate the procedures throughout the remainder of this exercise. The much simpler prokaryote case will only be mentioned in these exercises. As always, use your own selected molecule, not the example in the following steps!

2) Look for potential regulatory sites by screening with FindPatterns.

One-dimensional signal hunting: simple consensus and pattern matching.

You've already used FindPatterns to match up your probe to the database in Exercise 5; now we're going to use it to try to locate promoter regions in our sequences.

I have written and placed a prokaryote promoter consensus pattern based on the E. coli data of Hawley and McClure (1983) in the GCG logical directory location GenMoreData. You are welcome to screen your eukaryotic sequence with this pattern, although it is not required and probably will not be relevant since the pattern encompasses both the -35 and -10 regions (if it was just the -10 portion, some eukaryotic TATA boxes might be found with it). If desired, run FindPatterns on your consensus sequence specifying data= genmoredata:promoter.dat. If you run this search, use the append option to attach the data file to the end of your output so you can see what it looks like. Specify your consensus sequence's filename.extension in response to the "what sequence" query. The first part of an example screen trace follows:

% findpatterns -data=genmoredata:promoter.dat -mismatch=1 -append

FINDPATTERNS identifies sequences with short pattern queries like GAATTC
or YRYRYRYR.  You can define the patterns ambiguously and allow mismatches.
You can provide the patterns in a file or simply type them in from the
terminal.
Notice that this example was run with a mismatch level of one. Remember that FindPatterns allows the user to set any mismatch level desired. The prokaryote promoter data pattern follows; it may be of help to you in other research work.

The standard E. coli RNA polymerase promoter "Pribnow" box file for the 
program FINDPATTERNS. This pattern includes both the -35 & the -10 region.  
For an incredibly extensive list of eukaryotic transcription factor 
recognition sites see the GCG public datafile TFSites.Dat.  To specify one of 
these files use the command line option -data=_datafile_name.

Name    Offset    Pattern           Overhang  Documentation  ..

Promoter     1    TTGACwx{15,21}TAtAaT     0  !Hawley & McClure (1983)
Now for the tougher case -- your eukaryote example. A huge public pattern data file is available. This file (Ghosh, 1990) is called tfsites.dat and is available in the GCG logical directory location GenMoreData. Run FindPatterns specifying genmoredata:tfsites.dat as an alternate data file. Do not use a mismatch level above zero, as way too many hits would result; also do not use the append option, as this data file is really huge and too much space in your account would be consumed by it. (You are welcome to take a look at the Transcription Factor Sites database by using the GCG command name to resolve the disk location of genmoredata.) Do use the onestrand option to restrict your search to the forward direction only since these sequences all transcibe in the forward direction only and to help restrict the size of the output file. Specify your consensus sequence's filename.ext in response to the program's query. The interleukin example FindPatterns session follows:
% findpatterns -data=genmoredata:tfsites.dat -one

FindPatterns identifies sequences that contain short patterns like
GAATTC or YRYRYRYR.  You can define the patterns ambiguously and allow
mismatches. You can provide the patterns in a file or simply type them
in from the terminal.

 FINDPATTERNS in what sequence(s) ?  humil1.consensus

 What should I call the output file (* humil1.find *) ?  <rtn>

    Humil1.Consensus len:      9,721 ...........................................
...............................

 	FINDPATTERNS in what sequence(s) ?  <rtn>

     Total finds:        705
    Total length:      9,721
 Total sequences:          1
        CPU time:      35.02

    Output file: /disk2/usr/local/people/thompson/BC578/EX7/il1/humil1.find
Notice the output is still really huge; 705 finds doesn't do us much good. Scim through your output and see if anything stands out as significant. The tfsites.dat file is available in printed form here in the lab and you can look through it to note any interesting references concerning finds within your sequence. Similarly, you can use the system level search utility grep to search tfsites.dat (after you've resolved its disk location with GCG's name program) for your entries of interest to get the references.

Other signals that could be looked for in a similar fashion are the Shine-Dalgarno prokaryote translational initiation site, (AGG,GAG,GGA)x{6,9}ATG (Stormo et al., 1982) and other types of eukaryotic ribosome binding sequences. These type of searches are based on complementarity to the 16s rRNA in prokaryotes; however, in eukaryotes ribosomes seem to initiate translation at the first AUG encountered following the modified guanosine 5' cap. Kozak (1984) has compiled a consensus at the start codon of cc(A,g)ccAUGg which doesn't seem to relate to 18s rRNA complementarity at all, yet seems to hold true in many cases. Find this pattern in your sequence by typing findpatterns; provide Kozak's pattern within the program when queried. Keep increasing the mismatch level until you find the pattern.

A main point of these consensus searches is "Don't believe everything your computer tells you!" A computer can provide guidance and insight but the limitations can sometimes be overwhelming as is all too evident in these promoter analyses. Oftentimes weight matrix analysis is more appropriate to this type of search. Unfortunately, they are not nearly as simple to set up.

3) Using the weight matrix program FitConsensus to find regulatory sites.

Two-dimensional signal hunting: weight matrices.

GCG has preassembled consensus weight matrices of the donor and acceptor site sequences at exon-intron splice junctions for use with FitConsensus available in their public data files. However, they do not provide any others; therefore, I have reformatted the four weight matrix descriptions of eukaryotic RNA polymerase II promoter elements reported by Bucher (1990) into a form appropriate for GCG's programs. Additionally, McLauchlan et al. (1985) assembled a eukaryotic terminator weight matrix which I have reformatted for GCG use.

Use all seven of these weight matrices in FitConsensus to help locate the transcription start, each subsequent exon's splice site, and termination signals. The GCG donor and acceptor consensus weight matrix files as well as the others mentioned above have all been placed in the GCG logical location GenMoreData. Start the program by typing fitconsensus (your consensus sequence's filename.ext); specify the genmoredata:donor.csn file matrix; generate lists of only the 20 best fits and give the output file your sequence's name but the extension .donor. The screen trace follows:

% fitconsensus humil1.consensus

FITCONSENSUS uses a consensus table written by CONSENSUS as a
probe to find the best examples of the consensus in a DNA sequence. You
can specify the number of fits you want to see, and FITCONSENSUS
tabulates them with their position, frame, and a statistical measure of
their quality.

                  Begin (* 1 *) ?  <rtn>
                End (*  9721 *) ?  <rtn>

 Using what consensus table file ?  genmoredata:donor.csn

 Your consensus table has position(s) with 100% certainty,

 Are these necessary conditions for a fit (* Yes *) ?  <rtn>

 Show how many fits (* 40 *) ?  20

 What should I call the output file (* humil1.fit *) ?  humil1.donor
Run the same type of analysis with the acceptor.csn data file supplying the appropriate responses to the program's queries and each of the other five matrices. These have the file names tata.csn, cap.csn, ccaat.csn, gc.csn, and terminator.csn. They are all located in GenMoreData. After running each of these searches, display the results of each to see what they look like. The interleukin donor output follows:
% more humil1.donor

 FITCONSENSUS of: humil1.consensus  Check: 5335  from: 1  to: 9721

LOCUS       HUMIL1BX     9721 bp ds-DNA             PRI       15-DEC-1989
DEFINITION  Human interleukin 1 beta gene.
ACCESSION   X04500
KEYWORDS    interleukin.
SOURCE      Human (Homo sapiens) fetal liver DNA, clones
            BDC[3,240,426,454,457]. . . .

 Using Consensus: genmoredata:donor.csn

 CONSENSUS from:
Splice site sequences
from Stephen Mount NAR 10(2) 459;472 figure 1 page 460

 List-size: 20  Average quality: 37.03      September 26, 1991  08:22   ..

  position:   731  1375  1442  1511  1561  2236  2524  2649  2857  3140  3899
     frame:     2     1     2     2     1     1     1     3     1     2     2
   quality: 52.42 49.83 47.17 47.75 50.67 56.17 55.50 49.50 49.33 56.08 52.75


  position:  4389  5323  5580  6035  6472  7402  7534  8991  9332
     frame:     3     1     3     2     1     1     1     3     2
   quality: 48.42 49.83 50.25 53.83 47.42 57.00 49.92 47.25 49.67
Notice the output includes the position, frame, and "quality" of each match. The position and quality are tremendously helpful. Position is where in your sequence the specified weight matrix begins, perhaps not where the site that you are most concerned is occurs, rather only where the matrix begins. This is particularly important in the donor and acceptor matrices as these both begin in front of the splice site, not at it. Quality is the percentage fit to the matrix. The frame designation is troubling -- it is quite misleading as it identifies the frame of the best fit to the matrix not to the coding region -- so disregard it. Also note that none of the hits would have been found by a normal FindPatterns type consensus search without mismatches because the scores are all less than 100%. Later on in the exercise we will see how well these predicted sites correlate with the actual splice sites.

4) Help in locating the ends of genes: Terminator in prokaryotes; StemLoop, Repeat, and finding poly(A) signals in eukaryotes.

The GCG program Terminator will find about 95% of all prokaryotic factor-independent terminators. This is great odds for any computer algorithm; even its namesake Arnold Schwarzenegger would have a hard time matching this! This program is mentioned for those of you involved with prokaryotic data in your own research programs. However, realize that a main disadvantage of most signal searches, even a sophisticated two-dimensional approach like Terminator, is they find too many false positive sites, in other words they are not discriminatory enough. Just like Schwarzenegger, a few innocents always manage to get in the way.

The GCG programs StemLoop and Repeat may provide some regulatory insight with eukaryotic sequences since many eukaryotic terminators also have hairpin structures associated with them and some enhancer sequences contain strong direct repeats.

The sequence YGTGTTYY has been reported as a eukaryotic terminator consensus (McLauchlan et al., 1985 [this is the consensus from the weight matrix used above]) and the poly(A) adenylation signal AAUAAA is well conserved (Proudfoot and Brownlee, 1976); however, realize the inherent problems with consensus searches, as has been previously illustrated. Run the poly(A) search mentioned above; you should already have data regarding the termination site from the FitConsensus runs previously made. Type findpatterns to launch the program in the interactive mode; provide the poly(A) pattern within the program as you are queried. Again, just as with Kozak's pattern, rerun the program increasing the mismatch level until you find at least one poly(A) pattern.

Content Methods: what the sequence looks like

You have now been exposed to many of the pitfalls of signal type searches. In general, the second type of gene-finding technique, "searching by content," is more reliable, at least it seems to be less fraught with false positive problems, however, it can not locate exact positions. Used in concert ,with the former, the two can be quite powerful tools.

Searching by content utilizes the fact that genes necessarily have many constraints placed upon them. This induces certain periodicities and patterns which we can use to help locate coding sequences as opposed to noncoding stretches of DNA. These constraints arise in a number of fashions -- the three base genetic code, the "wobble" hypothesis, an unequal use of synonymous codons, translational factors, the amino acid content of the encoded proteins themselves, and, possibly, because of remnants of an ancient genetic code. All together these factors create distinctly unique coding sequences; non- coding stretches do not exhibit this type of periodic compositional bias. This fact can serve as a gene finding tool in two manners.

5) Using methods based on sequence composition alone.

Nonrandomness techniques: TESTCODE, a gene finding algorithm based exclusively on statistics

The first technique relies solely on the base compositional bias of every third position -- nonrandomness. A truly random sequence does not show any type of pattern at all and is not characteristic of any coding sequence. The program can estimate the probability that any stretch of DNA sequence is either coding or noncoding. It will not tell us the strand or the reading frame; however, it does not require any a priori assumptions as it relies exclusively on a statistical evaluation of the sequence itself. To run this program, and many of the subsequent ones, first make sure that your GCG graphics setting has been initialized because it produces graphical output. This only has to be done once per session, but it is essential. Remember the command:

% setplot then pick versa_plot for color VersaTerm-PRO tektronix emulation

Next type testcode (your consensus sequence's filename.ext) to launch the program. Accept the next series of default answers and the program will ask you if you are ready for it to plot the figure; press <return>. The emulator will then jump into graphics mode and the TestCode plot will be drawn. A screen trace from the interleukin example follows:

% testcode humil1.consensus

TESTCODE helps you identify protein coding sequences by plotting a
measure of the non-randomness of the composition at every third base.
The statistic does not require a codon frequency table.

  Process set to plot with VERSATERM-TEK4105 attached to term
  using the tekd graphic interface.

                  Begin (* 1 *) ?  <rtn>
                End (*  9721 *) ?  <rtn>
               Reverse (* No *) ?  <rtn>

 What window size in bp (* 200 *) ?  <rtn>

 The minimum density for a one page plot is: 9721.0 bases/page
 A typical density is about 3000.0 bases/page
 What density would you like (* 9721.0 *) ?  <rtn>

 When your VERSATERM-TEK4105 attached to tty is ready, press <rtn>.  <rtn>
One limitation of this program is it is not designed to detect coding regions shorter than 200 base pairs, hence the 200 bp window size. No claim is made for significance with windows less than the default 200; therefore, smaller eukaryotic exons may be missed. The resulting plot is shown below:

The plot is divided into three regions. The top and bottom areas predict coding and noncoding regions, respectively, to a confidence level of 95%, while the middle area claims no statistical significance. Diamonds and vertical bars above the graph denote potential stop and start codons respectively. Rerun TestCode with -figure=(your last name).testcode on the command line to generate a file to send to teacher for evaluation.

6) Running codon usage analysis programs.

Codon Usage: codon frequency tables and using CODONFREQUENCY.

The second content type of gene finding strategy utilizes the fact that different organisms have different codon usage preferences. In other words, genomes use synonymous codons unequally in a phylogenetic fashion. Codon usage frequency is not the genetic translation code -- the genetic code is nearly universal across all phylogenetic lines. However, not all lineages use the same percentage of the various degenerate codons the same amount. The manner in which different types of organisms utilize the available codons is usually tabulated into what is known as a codon usage or frequency table. You have already been exposed to this concept in Exercise 5 when you used one to backtranslate a peptide probe sequence into a oligonucleotide. In order to utilize the codon usage type of gene finding strategy a codon usage table for the particular organism in question must be accessible. Just as in Exercise 5, the GCG default table for these programs is from E. coli. If your protein comes from anything else, this table is not appropriate. Remember that GCG provides other tables in the public data library GenMoreData; refer to exercise 5 for help on how to access these tables.

Even more tables are available from various network file servers accessible through E-mail, annonymous ftp, Gopher, and the Web. GCG also provides a program called CodonFrequency for those who need to build custom codon usage tables. To make your own codon usage table gather all of the known and appropriate coding sequences for your organism and run them through this program. A strategy to achieve this objective, if the need ever arises, follows: To find the sequences in the database you could use the GCG searching program StringSearch. The resulting output file is then edited down to include only those sequences that refer to structural genes, excluding inappropriate organelle genomes, if that is of concern. The title lines of the sequence files aren't quite what you need though -- you need the sequences' references which list CDS regions. Use fetch -reference -out=filename.ext @your StringSearch filename to pull over the references from the database into your own directory. Then use the program CodonFrequency being very careful to only specify the coding regions for each gene from your list.

Programs that can utilize codon usage tables to help find genes: FRAMES & CODONPREFERENCE.

The two GCG content analysis programs that can use codon usage tables in this context -- Frames, a very simple open reading frame identifier which can utilize codon frequency tables to show rare codon usages, and the quite sophisticated codon frequency analyzer CodonPreference -- need to know which codon usage table you want to use. You must determine and specify the codon usage table appropriate for your situation. As stated above, the GCG logical directory location for these alternate tables is GenMoreData. If you do not know how to specify the correct table, use GCG's "super-option" -check. As is the case in all GCG programs, this will give you a chance to use any option while in the program.

A sample Frames session with the human cytokine interleukin I beta gene example is illustrated below. However, since we are dealing with eukaryotic genomic sequence only the first exon will actually have a start codon. This program is more appropriate for sequences without exons such as cDNA or prokaryotic data. Therefore, do not perform a Frames analysis on your Selected Molecule FAS consensus sequence.

% frames -rare=genmoredata:human_high.cod

FRAMES shows open reading frames for the six translation frames of a
DNA sequence. FRAMES can superimpose the pattern of rare codon
choices if you provide it with a codon frequency table.

  Process set to plot with VERSATERM-TEK4105 attached to TERM:
  using the tekd graphic interface.

 Plot FRAMES for what sequence ?  humil1.consensus

                  Begin (* 1 *) ?
                End (*  9721 *) ?

 Mark rare codons at or below what threshold  (* 0.0 *) ?

 When your VERSATERM-TEK4105 attached to tty is ready, press <rtn>.

The plot shows all open reading frames and marks rare codon choices with a dot above each.

Now that you've seen Frames, run CodonPreference. Type codonpreference with the noframes option, since we know this is eukaryotic genomic DNA so ORF's are not relevant. Next, you will be asked for a sequence name; enter your consensus's name. The program then asks about beginning, ending and reversal; accept the defaults. The next question is very important! The codon usage table that you need to use will not be GenRunData:EcoHigh.Cod; type in the correct response for your sequence instead. Accept the next defaults and press return after the "are you ready?" prompt. The interleukin session is illustrated below:

% codonpreference -noframes

CODONPREFERENCE is a frame-specific gene finder that tries to
recognize protein coding sequences by virtue of the similarity of their
codon usage to a codon frequency table or by the bias of their
composition (usually GC) in the third position of each codon.

    Process set to plot with VERSATERM-TEK4105 attached to term
  using the tekd graphic interface.

 CODONPREFERENCE for what sequence ?  humil1.consensus

                  Begin (* 1 *) ?  <rtn>
                End (*  9721 *) ?  <rtn>
               Reverse (* No *) ?  <rtn>

 What codon frequency file (* genrundata:ecohigh.cod *) ?  genmoredata:human_hig
h.cod

 What codon preference window size (in codons) (* 25 *) ?  <rtn>

 The minimum density for a one-page plot is 318.93 bases/cm.  <rtn>

 What density would you like (* 318.93 *) ?  <rtn>

 When your VERSATERM-TEK4105 attached to tty is ready, press <Return>.  <rtn>

The plot shows two curves, a codon preference curve and a third position bias curve, for each reading frame of the sequence in question. These are color coded on a color monitor. The curves rise above background scatter in areas of strong probability of coding potential. The horizontal lines within each plot are the average values of each attribute. CodonPreference moves its window in increments of three recalculating its statistic at each position to generate a continuous function so that each function defines an individual reading frame. An open reading frame display would accompany each panel with start codons represented as vertical lines rising above each box and stop codons shown as lines falling below the reading frame boxes had we not suppressed it with the noframes option. Rare codon choices are also shown for each frame as hash marks. The final output from this program helps in the interpretation of its significance:

 Average codon preference for frame 1 = 0.7821
 Average codon preference for frame 2 = 0.7721
 Average codon preference for frame 3 = 0.7864
 Average codon preference for a random sequence = 0.7224
Calculations of average codon preference for each frame and the whole sequence randomized are shown which can be compared to the peaks of the plot. One must realize, however, that not all genes show particularly high codon usage preference. This is especially true of genes which are only weakly expressed. Therefore, you must, as always, interpret results with more than just a few grains of salt. Use as many sources of information as possible!

Rerun CodonPreference with the same parameters as before, however, specify the figure option on the command line to generate an output file with the name (your last name).codonpref.

7) E-mail servers for gene-finding analyses.

An additional source of information that should not be ignored are the powerful Internet E-mail servers available for these type of analyses. VADMS has installed a very nice utility, the MSU program, by Ranier Fuchs of EMBL, to take advantage of these servers in a user-friendly manner. More information on MSU can be obtained by reading the documentation file found in the directory specified by the environment variable $DOC_DIR. Four gene-finding servers available through MSU are GRAIL, GeneId, NetGene, and GenMark. GRAIL (Gene Recognition and Analysis Internet Link) is a neural net system for detecting exons (trained on human data). It looks for base composition that "appear" exon-like. It does not define boundaries. GeneId is an Artificial Intelligence system for analyzing genomic DNA and predicting exon and gene structure. NetGene predicts splice sites likelihood using neural net techniques. GenMark is based on a special type of Markov chain model of coding and noncoding sequences. Many of these services are also available through the web and can be found by searching for their name or at standard launching points.

MSU, as installed on ribozyme, prefers it's input sequences not to be in GCG format. It does readily accept PIR/NBRF, GenBank, EMBL, and FastA formats though. Therefore, first run GCG's program ToPIR in order to convert your sequence's format into PIR compatible form:

% topir humil1.consensus

ToPIR writes GCG sequence(s) into a single file in PIR format.

                  Begin (* 1 *) ?  <rtn>
                End (*  9721 *) ?  <rtn>
               Reverse (* No *) ?  <rtn>

 What should I call the output file (* humil1.pir *) ?  <rtn>

 HUMIL1 9721 characters.
Launch the MSU program by typing msu at the prompt. An abridged screen trace follows:
% msu

**************** MSU (Mail Server Utility) *********************
Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library

Sequence loaded: None

Options:
  L - Load sequence
  H - Retrieve HELP file from server   R - Register with service
  O - Other options                    Q - Quit (exit program)

Enter L, H, R, O, or Q to quit: h
Type h to get HELP files from the gene finding servers. Specify at least the first three exon finding servers, numbers 13, 14, and 15 then type m to return to the main menu. The HELP files will be mailed back to you promptly.
Retrieve HELP file on service: (press RETURN for next page)
  1 - EMBL BLITZ server
  2 - BIOCCELERATOR
  3 - FLASH
  4 - NCBI BLAST
  5 - EMBL FASTA server
  6 - NBRF/PIR FASTA server
  7 - EMBL QuickSearch server
  8 - CBRG (ETH Zuerich)
  9 - BLOCKS server
 10 - MotifFinder

Options:
  M - Back to Main Menu

Enter 1-22 or M to go back to Main Menu:

Retrieve HELP file on service: (press RETURN for next page) <rtn>
 11 - ProteinPredict
 12 - nnpredict
 13 - NetGene
 14 - Grail
 15 - GeneID (less than 20kb)
 16 - GenMark
 17 - PYTHIA (Rpts)
 18 - PYTHIA (Alu)
 19 - GenomeNet BLAST
 20 - GenomeNet FASTA

Options:
  M - Back to Main Menu

Enter 1-22 or M to go back to Main Menu: 13

Request mailed to netgene@virus.fki.dth.dk at Wed Sep 21 10:17:49 1994
The reply should soon arrive in your mailbox

PRESS <RETURN> TO CONTINUE... <rtn>

///////////////////////////////////////////////////////////////////////////////////

Enter 1-26 or M to go back to Main Menu: 14

Request mailed to grail@ornl.gov at Wed Sep 21 10:18:19 1994
The reply should soon arrive in your mailbox

///////////////////////////////////////////////////////////////////////////////////

Enter 1-26 or M to go back to Main Menu: 15

Request mailed to geneid@darwin.bu.edu at Wed Sep 21 10:19:00 1994
The reply should soon arrive in your mailbox

///////////////////////////////////////////////////////////////////////////////////

Enter 1-26 or M to go back to Main Menu: m

Type m to return to the main menu.

**************** MSU (Mail Server Utility) *********************
Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library

Sequence loaded: None


Options:
  L - Load sequence
  H - Retrieve HELP file from server   R - Register with service
  O - Other options                    Q - Quit (exit program)

Enter L, H, R, O, or Q to quit: l

Enter l to load your PIR formatted consensus sequence into the utility, then reply with its name. After the sequence is loaded, tell the program to submit it to NetGene, number 8.
Enter file name: humil1.pir

**************** MSU (Mail Server Utility) *********************
Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library

Sequence loaded: humil1.consensus
Nucleotide, 9721 residues, 1-AGAAA...TCGAC-9721

Services: (press RETURN for next page)
  1 - BIOCCELERATOR
  2 - NCBI BLAST
  3 - EMBL FASTA server
  4 - NBRF/PIR FASTA server
  5 - EMBL QuickSearch server
  6 - CBRG (ETH Zuerich)
  7 - BLOCKS server
  8 - NetGene
  9 - Grail
 10 - GeneID (less than 20kb)

Options:
  L - Load sequence                    S - Set sequence limits (1 - 9721)
  H - Retrieve HELP file from server   R - Register with service
  O - Other options                    Q - Quit (exit program)

Enter 1-17, L, S, H, R, O, or Q to quit:  <rtn>

**************** MSU (Mail Server Utility) *********************
Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library

Sequence loaded: humil1.consensus
Nucleotide, 9721 residues, 1-AGAAA...TCGAC-9721

Services: (press RETURN for next page)
 11 - GenMark
 12 - PYTHIA (Rpts)
 13 - PYTHIA (Alu)
 14 - GenomeNet BLAST
 15 - GenomeNet FASTA
 16 - GenQuest (Q)
 17 - ProDom

Options:
  L - Load sequence                    S - Set sequence limits (1 - 9721)
  H - Retrieve HELP file from server   R - Register with service
  O - Other options                    Q - Quit (exit program)

Enter 1-17, L, S, H, R, O, or Q to quit: 8

Service NetGene
Neural network prediction of splice sites in vertebrates

Request mailed to netgene@virus.fki.dth.dk at Thu Sep 22 11:13:04 1994
The reply should soon arrive in your mailbox

PRESS <RETURN> TO CONTINUE... <rtn>
Next, submit your sequence to GeneID. It is number 10 on the menu:
///////////////////////////////////////////////////////////////////////////////////

Enter 1-18, L, S, H, R, O, or Q to quit: 10

Service GeneID (less than 20kb)
Pediction of exons and gene structure.

Suppress automatic comparison of best model with protein database (Y/N or ?) [N]
:  <rtn>

Press <return> to accept the default protein comparison.
Request mailed to geneid@darwin.bu.edu at Thu Sep 22 11:13:53 1994
The reply should soon arrive in your mailbox

PRESS <RETURN> TO CONTINUE...  <rtn>
GRAIL is special in that it requires registration; therefore, if you consider this type of service worthwhile to your own research program, you may wish to register. If so, return to the main menu by replying m. You are not required to register with or use the services of GRAIL.
Enter 1-18, L, S, H, R, O, or Q to quit: m

**************** MSU (Mail Server Utility) *********************
Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library

Sequence loaded: humil1.consensus
Nucleotide, 9721 residues, 1-AGAAA...TCGAC-9721

Options:
  L - Load sequence
  H - Retrieve HELP file from server   R - Register with service
  O - Other options                    Q - Quit (exit program)

Enter L, H, R, O, or Q to quit: r
If you are registering, then answer the prompt with r and then specify 2 to enable registration at GRAIL. My registration response follows:
Register with service: (press RETURN for next page)
  1 - FLASH
  2 - Grail
  3 - GenMark

Options:
  M - Back to Main Menu

Enter 1-3 or M to go back to Main Menu: 2

Service Grail
Gene finding service. Registration required at grail@ornl.gov

Your name (or ?): Steve Thompson
Your affiliation (or ?): Washington State University VADMS Center
Your postal address (or ?): WSU VADMS Pullman WA 99164-4660
Your email address (or ?): thompson@ribozyme.vadms.wsu.edu
Your phone number (or ?): 509-335-3179
Request mailed to grail@ornl.gov at Thu Sep 22 11:57:45 1994
The reply should soon arrive in your mailbox

PRESS <RETURN> TO CONTINUE... <rtn>
Register with service: (press RETURN for next page)
  1 - FLASH
  2 - Grail
  3 - GenMark

Options:
  M - Back to Main Menu

Enter 1-3 or M to go back to Main Menu: m
Finally enter q to quit the program.
Enter L, H, R, O, or Q to quit: q
If you decided to register with GRAIL, you may want to reenter MSU after you have gotten your user ID number back from GRAIL and then submit your sequence there following the prompts.

Abridged results of the MSU exon finding analyses for the human interleukin I gene follow below in the order that they were submitted. Refer to your new HELP files for interpretations. Based on your readings of the HELP files, some analyses may be more appropriate than others for your particular sequence, for instance, netgene is optimized for vertebrate DNA and thus may infer incorrect splice sites for other types of sequences.

From:   SMTP%"server@cbs.dth.dk"
Subj:   humil1.consensus: NetGene splice site prediction

------------------------------------------------------------------------
                               NetGene
               Neural Network Prediction of Splice Sites

Reference:
Brunak, S.,  Engelbrecht,  J., and  Knudsen, S.  (1991).  Prediction  of
Human mRNA donor and acceptor  sites from the DNA  sequence.  Journal of
Molecular Biology 220:49-65.

------------------------------------------------------------------------

Report ERRORS to Jacob Engelbrecht engel@virus.fki.dth.dk.

Potential splice sites are assigned by combining output from a local and
a global  network.  The  prediction is made with two cutoffs:  1) Highly
confident  sites (no or few false  positives, on average 50% of the true
sites  detected);  2) Nearly all true sites (more  false  positives - on
average of all positions 0.1% false positive  donor sites and 0.4% false
positive  acceptor  sites, at 95% detection of true sites).  The network
performance on sequences from distantly  related  organisms has not been
quantified.  Due to the  non-local  nature of the algorithm sites closer
than 225 nucleotides to the ends of the sequence cannot be assigned.

Column explanations, field identifiers:

POSITION in your sequence (either first or last base in intron).

Joint CONFIDENCE level for the site (relative to the cutoff).
EXON INTRON gives 20 bases of sequence around the predicted site.
LOCAL is the site confidence from the local network.
GLOBAL is the site confidence from the global network.

------------------------------------------------------------------------
The sequence: humil1.consensus contains 8521 bases, and has the following compos
ition:
A 2293 C 2068 G 1834 N 2 T 2324

1) HIGHLY CONFIDENT SITES:
==========================

ACCEPTOR SITES:
POSITION     CONFIDENCE        INTRON EXON         LOCAL   GLOBAL

DONOR SITES:
POSITION     CONFIDENCE          EXON INTRON       LOCAL   GLOBAL

2) NEARLY ALL TRUE SITES:
=========================

ACCEPTOR SITES:
POSITION     CONFIDENCE        INTRON EXON         LOCAL   GLOBAL
    7766           0.11    TTCGCTGCAG^AGTGTAGATC    0.21     0.63
    4944           0.05    CTCTCCGCAG^TGCTCCTTCC    0.11     0.69
    2243           0.01    ATTTACTCAG^AGATCATTTC    0.46     0.11

DONOR SITES:
POSITION     CONFIDENCE          EXON INTRON       LOCAL   GLOBAL
From: geneid@bir.cedb.uwf.edu (GeneID server)
Subject: humil1.consensus

>------------------------------ GENE-ID OUTPUT -------------------------------<

_______________________  Announcing Promoter Prediction:  _______________

1. GeneID now scans for vertebrate PolII promoters as well and outputs
   predicted promoter regions along with their score and likelihood.
   For more information, send an email with the keyword "help" to
   geneid@bir.cedb.uwf.edu or check the gopher server at:
     name=Bioinformatics Resource Gopher
     host=dna.cedb.uwf.edu
     type=+
____________________________________________________________________________

                           GENEID SERVER
    Bioinformatics Resource, The University owf West Florida, U.S.A.

Developed by Steen Knudsen, Roderic Guigo, Kathleen Klose and Temple Smith
   at MBCRR, Harvard University, BMERC, Boston University, and BIR, UWF.

References:
Guigo,R., Knudsen,S., Drake,N.,  and Smith,T. (1992)
Prediction of Gene Structure. Journal of Molecular Biology, 245:45-56.
This server uses the NCBI BLAST Network Service:
Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers and
David J. Lipman (1990).  Basic local alignment search tool.  J. Mol.
Biol. 215:403-410.

-------------------------------------------------------------------------------
Processed keywords and options:
Genomic Sequence...

Column explanations:
  Position:   Position in your sequence counted from base 1.
  Start:      Position in your sequence counted from base 1.
  Stop:       Position in your sequence counted from base 1.
  Score:      The score calculated for the prediction.
              Higher score means higher confidence.
  Ranking scores:  Calculated scores used for ranking gene models.
                   See paper for definition.
  Frame (genomic sequences only):
           First exons: The reading frame defined by the startcodon.
        Internal exons: A registration of which of the three frames
                        are open (denoted by 1) in the particular exon.
                        The last of the four numbers denotes the
                        codon remainder of the exon.
            Last exons: The reading frame defined by the startcodon.
  Equivalent exons:   Exons that are mutually interchangeable.
                      See paper for details.

-------------------------------------------------------------------------------

REPORT for genomic DNA sequence HUMIL1.CONSENSUS, 9721 bases.
Thu Sep 22 13:08:57 CDT 1994

0. Potential PolII PROMOTERS
Promoter prediction:
  Region      Score  Likelihood
  400 -  900  0.668  Marginal prediction
  900 - 1300  0.647  Marginal prediction
 1400 - 1800  1.095  Highly likely prediction
 1800 - 2100  1.313  Highly likely prediction
 2100 - 2300  0.577  Marginal prediction
 2500 - 2800  0.561  Marginal prediction
////////////////////////////////////////////////////////////////////////////////

1. potential START CODONS

    Position  Score  Sequence
     222      0.75   AGGAATGG
     229      0.78   GATTATGG
////////////////////////////////////////////////////////////////////////////////
    2431      0.87   CGTCATGG
    2481      0.88   AGCCATGG
    2511      0.70   TGAAATGA
////////////////////////////////////////////////////////////////////////////////
    9611      0.76   AGGCATGC
    9619      0.73   TGGCATGC

2. potential STOP CODONS

    Position  Codon
      66      TGA
     106      TAA
////////////////////////////////////////////////////////////////////////////////
    8305      TAA
    8309      TGA
    8337      TAA
    8368      TGA
    8387      TAG
////////////////////////////////////////////////////////////////////////////////
    9693      TGA
    9706      TGA

3. potential DONOR SITES

    Position  Score  Sequence
      73      0.70   GAAGGTGGCAG
     103      0.70   AGTGGTTAATT
////////////////////////////////////////////////////////////////////////////////
    2505      0.62   GCCAGTGAAAT
    2527      0.91   ACAGGTCAGTG
    2549      0.64   ACCAGTAACAT
////////////////////////////////////////////////////////////////////////////////
    2962      0.75   CCAGGTGTTTC
    3143      0.95   GAAGGTAAGAC
    3234      0.71   CATGGTATTTT
////////////////////////////////////////////////////////////////////////////////
    5219      0.72   CCGCGTCAGTT
    5326      0.79   GAAGGTAGTTA
    5329      0.66   GGTAGTTAGCC
////////////////////////////////////////////////////////////////////////////////
    5984      0.62   TCTGGTCCATA
    6038      0.89   CAAGGTAAATG
    6109      0.63   CAAAGTAGAAA
////////////////////////////////////////////////////////////////////////////////
    7310      0.73   GAAAGTAATGA
    7405      0.98   GGAGGTAAGTG
    7409      0.72   GTAAGTGAATG
////////////////////////////////////////////////////////////////////////////////
    9555      0.74   CATAGTGAGAC
    9691      0.71   TGCAGTGAGCT

4. potential ACCEPTOR SITES

    Position  Score  Sequence
     162      0.72   CCCCTAGGCCACCCAGA
     185      0.83   TTTTAGCTATCTGCAGG
////////////////////////////////////////////////////////////////////////////////
    3031      0.75   CAAGTCTTGTACACAGG
    3092      0.85   CCTGTTTTTATTACAGT
    3121      0.74   GACTTGTTCTTTGAAGC
////////////////////////////////////////////////////////////////////////////////
    5086      0.76   GCATGGTTCCTGGCAGG
    5125      0.80   CTCGCTCTCTCCGCAGT
    5137      0.80   GCAGTGCTCCTTCCAGG
////////////////////////////////////////////////////////////////////////////////
    5762      0.75   CTGGTGTTGTCATCAGA
    5874      0.76   GTGCTCCACATTTCAGA
    6017      0.81   AGCTCTCCACCTCCAGG
////////////////////////////////////////////////////////////////////////////////
    7180      0.77   CTGTGGGTTCCCACAGA
    7275      0.86   CCCCTTCTTTCTTCAGT
    7302      0.73   ATGTCCTTTGTACAAGG
////////////////////////////////////////////////////////////////////////////////
    8077      0.77   AACATCTTTCCCATAGG
    8127      0.87   TTCTCTTTCGCTGCAGA
    8187      0.71   ATTTGTCTTCAACAAGA
////////////////////////////////////////////////////////////////////////////////
    9677      0.71   GATTGCTTGAGCCCAGA
    9716      0.80   ATCCGTTGACCTGCAGG

-------------------------------------------------------------------------------

5a. potential FIRST EXONS (from startcodon to donor site)
     Score  Frame  Start  Stop          Blastx matches    Blast score

     -----Highly confident predictions, more than 50% true exons: ---
     0.85   2       2481  2527
     0.77   2       2514  2527

     -----Medium confident predictions, more than 10% true exons: ---
     0.71   2       7380  7405
     0.70   0       7286  7405
////////////////////////////////////////////////////////////////////////////////

     -----Marginal predictions, less than 1% true exons: ------------
     0.43   0        862   864
     0.43   2       2431  2438
////////////////////////////////////////////////////////////////////////////////

6a. potential INTERNAL EXONS
     Score  Frame       Start  Stop      Blastx matches    Blast score

     -----Highly confident predictions, more than 50% true exons: ---

     -----Medium confident predictions, more than 10% true exons: ---
     0.80   1 0 0      2466  2527
     0.74   0 1 0      7275  7405
     0.67   0 0 1      3092  3143
     0.57   1 0 0      5762  6038
     0.57   1 1 1      6017  6038
     0.56   0 1 0      7302  7405
     0.55   1 0 0      5125  5326
     0.53   1 0 0      2364  2527
     0.52   1 0 0      5137  5326
     0.51   1 0 0      8127  8236
     0.51   1 0 0      5735  6038
     0.50   0 0 1      3804  3902
     0.47   1 1 1      5556  5583
     0.45   0 1 0      5874  6038
     0.44   0 0 1       578   734
////////////////////////////////////////////////////////////////////////////////

     -----Marginal predictions, less than 1% true exons: ------------
     0.25   0 1 1      1370  1445
     0.25   1 0 0      9504  9555
////////////////////////////////////////////////////////////////////////////////

7a. potential LAST EXONS (from acceptor site to stopcodon)
     Score  Frame  Start  Stop          Blastx matches    Blast score
     -----Highly confident predictions, more than 30% true exons: ---

     -----Medium confident predictions, more than 10% true exons: ---
     0.57   2       2466  2473
     0.55   1       8127  8133
     0.55   2       7007  7020
     0.55   1       7007  7013
////////////////////////////////////////////////////////////////////////////////

     -----Marginal predictions, less than 1% true exons: ------------
     0.40   0       7275  7376
     0.39   2       7275  7495
////////////////////////////////////////////////////////////////////////////////

-------------------------------------------------------------------------------

5b. potential FIRST EXON CLASSES
   (Only exons upstream of 3'-most predicted last exon)

     Score  Frame  Start  Stop     Equivalent exons, if any

     -----Highly confident predictions, more than 50% true exons: --
     0.85   2       2481  2527      2514  2527      2388  2527      2397  2527
                    2431  2438      2379  2527      2511  2527      2416  2465
     0.71   2       7380  7405
     0.70   0       7286  7405

     -----Medium confident predictions, more than 10% true exons: --
     0.61   0       3138  3143
     0.59   1       6026  6038      5978  6038
     0.53   1       5236  5326      5254  5326
     0.52   0        764   766       862   864
     0.50   0       9341  9352
////////////////////////////////////////////////////////////////////////////////

     -----Marginal predictions, less than 1% true exons: -----------
     0.34   1       4483  4495
////////////////////////////////////////////////////////////////////////////////

6b. potential INTERNAL EXON CLASSES
   first exon, and upstream of 3'-most predicted last exon)

     Score  Frame       Start  Stop     Equivalent exons, if any

     -----Highly confident predictions, more than 50% true exons: --
     0.80   1 0 0 1      2466  2527      2364  2527      2322  2527
     0.74   0 1 0 1      7275  7405      7302  7405
     0.67   0 0 1 0      3092  3143

     -----Medium confident predictions, more than 10% true exons: --
     0.80   1 0 0 1      2466  2527      2364  2527      2322  2527
     0.67   0 0 1 0      3092  3143
     0.57   1 0 0 0      5762  6038      5762  6116
     0.55   1 0 0 0      5125  5326      5137  5326      5086  5326
     0.51   1 0 0 1      8127  8236
     0.51   1 0 0 0      5735  6038      5735  6116
////////////////////////////////////////////////////////////////////////////////

     -----Marginal predictions, less than 1% true exons: -----------

7b. potential LAST EXON CLASSES
   (Only exons downstream of 5'-most predicted first exon)
     Score  Frame  Start  Stop       Equivalent exons, if any

     -----Confident predictions, more than 30% true exons: ---------

     -----Medium confident predictions, more than 10% true exons: --
     0.57   2       2466  2473      2364  2419
     0.55   1       7007  7013
     0.55   1       8127  8133      8226  8307
     0.55   2       7007  7020      7150  7376      7180  7376
     0.54   0        411   413       185   256       162   275       249   275
     0.54   1       2466  2514
     0.53   0       8127  8339      8226  8339
     0.52   1       3957  3963
////////////////////////////////////////////////////////////////////////////////

     -----Marginal predictions, less than 1% true exons: -----------

-------------------------------------------------------------------------------

8. potential GENES

12978 gene models were analyzed and ranked according to likelihood
The 20 most likely genes are

Ranking scores   List of exons (*) constituting gene (first, internal(s), last)
 0.797  1.450    2481  2527      3092  3143      5125  5219      5556  5583
                 5762  6038      7275  7405      8127  8339
 0.797  1.444    2481  2527      3092  3143      5125  5219      5556  5583
                 6017  6038      7275  7405      8127  8339
 0.797  1.406    2481  2527      3092  3143      5125  5219      5556  5583
                 5762  6038      7275  7405      8127  8236      8342  8480
 0.797  1.400    2481  2527      3092  3143      5125  5219      5556  5583
                 6017  6038      7275  7405      8127  8236      8342  8480
 0.797  1.387    2481  2527      3092  3143      5125  5219      5556  5583
                 5735  6038      7275  7405      8127  8339
 0.797  1.385    2481  2527      3092  3143      5125  5326      5874  6038
                 7275  7405      8127  8339
 0.797  1.364    2481  2527      3092  3143      3957  4177      5556  5583
                 5762  6038      7275  7405      8127  8339
////////////////////////////////////////////////////////////////////////////////

 (*) Exons listed are the representatives of equivalence classes, if any,
     and are the best bet from each class.
 If two gene models are non-overlapping they may constitute two separate genes.  


From: grail@gwen.EPM.ORNL.GOV (Gene Recognition and Analysis Internet Link)

>humil1.consensus, len = 9721

 pos   f-strand frame     orf       ||  pos   r-strand frame      orf
                                    ||
 .....................................................................
////////////////////////////////////////////////////////////////////////////////
 .....................................................................
 2461   0.066     -      - - -      ||  7261   0.128     -       - - -
 2471   0.092     -      - - -      ||  7251   0.010     -       - - -
 .....................................................................
 .....................................................................
 2491   0.051     -      - - -      ||  7231   0.000     -       - - -
 2501   0.264     -      - - -      ||  7221   0.000     -       - - -
 2511   0.363     -      - - -      ||  7211   0.000     -       - - -
 2521   0.124     -      - - -      ||  7201   0.000     -       - - -
 .....................................................................
////////////////////////////////////////////////////////////////////////////////
 .....................................................................
 5121   0.074     -      - - -      ||  4601   0.235     -       - - -
 5131   0.033     -      - - -      ||  4591   0.093     -       - - -
 5141   0.094     -      - - -      ||  4581   0.162     -       - - -
 5151   0.436     -      - - -      ||  4571   0.692     2    4547 -  4655
 5161   0.938     1   5044 -  5332  ||  4561   0.938     2    4547 -  4655
 5171   0.938     1   5044 -  5332  ||  4551   0.938     2    4547 -  4655
 5181   0.938     1   5044 -  5332  ||  4541   0.938     2    4499 -  4547
 5191   0.938     1   5044 -  5332  ||  4531   0.938     2    4499 -  4547
 5201   0.938     1   5044 -  5332  ||  4521   0.938     2    4499 -  4547
 5211   0.938     1   5044 -  5332  ||  4511   0.937     2    4499 -  4547
 5221   0.938     1   5044 -  5332  ||  4501   0.896     2    4499 -  4547
////////////////////////////////////////////////////////////////////////////////
 .....................................................................
 5891   0.071     -      - - -      ||  3831   0.002     -       - - -
 5901   0.153     -      - - -      ||  3821   0.005     -       - - -
 5911   0.193     -      - - -      ||  3811   0.017     -       - - -
////////////////////////////////////////////////////////////////////////////////
 .....................................................................
 7311   0.072     -      - - -      ||  2411   0.000     -       - - -
 7321   0.922     2   7247 -  7493  ||  2401   0.000     -       - - -
 7331   0.854     2   7247 -  7493  ||  2391   0.000     -       - - -
 7341   0.937     2   7247 -  7493  ||  2381   0.000     -       - - -
 7351   0.936     2   7247 -  7493  ||  2371   0.000     -       - - -
 7361   0.936     2   7247 -  7493  ||  2361   0.000     -       - - -
////////////////////////////////////////////////////////////////////////////////
 .....................................................................
 8161   0.872     3   7965 -  8337  ||  1561   0.000     -       - - -
 8171   0.935     3   7965 -  8337  ||  1551   0.000     -       - - -
 8181   0.938     3   7965 -  8337  ||  1541   0.000     -       - - -
 8191   0.938     3   7965 -  8337  ||  1531   0.000     -       - - -
 8201   0.938     3   7965 -  8337  ||  1521   0.035     -       - - -
 8211   0.938     3   7965 -  8337  ||  1511   0.001     -       - - -
 8221   0.938     3   7965 -  8337  ||  1501   0.001     -       - - -
////////////////////////////////////////////////////////////////////////////////

     Potential exons are listed in the following

      pos     strand   strand_prob    frame    quality        orf

 5151 -  5291   f        0.71          1      excellent   5044 -  5332
 2751 -  2821   r        0.63          3      good        2691 -  2871
 7321 -  7391   f        1.00          2      excellent   7247 -  7493
 1951 -  1971   r        0.69          1      good        1921 -  2086
 8161 -  8331   f        0.99          3      excellent   7965 -  8337           
8) Interpretations and data analysis.

Next, draw PostScript plots of all of your Figure files from this exercise. Initialize GCG graphics to PostScript by either typing the command postscript, and then specifying epsf for Encapsulated PostScript File as the graphics device, and then specifying a temporary filename as the output port, or by using the PostScript choice on the SetPlot menu. Until you reset your graphics configuration back to Tektronix, all graphics output will now be written to your account in PostScript format under the specified temporary filename. A sample screen trace of the PostScript command follows:

% postscript

 Use Postscript graphics with what device:

  LaserWriter
  Lzr1200
  LN03-ScriptPrinter
  LPS20
  ColorScript-100
  EPSF (single page encapsulated postscript format)

 Please choose one ( * LASERWRITER * ) epsf

 To what port is your EPSF connected (* term: *) temp.ps

 Plotting Configuration set to:

       Language: psd
         Device: EPSF
  Port or Queue: temp.ps
Now use the Figure program to draw each plot to a PostScript file. Launch the program by typing figure with the option -plot=(new actual output filename) in order to redirect the output from temp.ps to yourplot.ps. A screen trace illustrates:
% figure -plot=testcode.ps

Figure makes figures and posters by drawing graphics and text
together. You can include output from other GCG graphics programs as
part of a figure.

 FIGURE from what file ?  humil1.testcode

 PostScript instructions for a EPSF are now being sent to testcode.ps.
Finally, use lpr to print your new PostScript files as well as any pertinent output text files from the signal searches and E-mail server analyses that you performed earlier. The point -- you want to compare all of these methods of analysis and try to make some educated guesses as to where your genes and exons actually start and stop.

Use a big table or the floor, if necessary, to analyze and ponder over your data. It probably won't be as easy as you would have hoped for. Fortunately, often in a lab situation cDNA data is also available on a given sequence, although with the increased emphasis on genomic sequencing this is becoming less and less true. Another advantage in a real analysis is you would be using any similarities discovered through database searching to assist your interpretation. This is often very helpful, especially if you are dealing with a system that has much available database information. You will be learning more about these techniques in next week's exercise. Make yourself notes on where you feel the actual starts and stops of the coding regions are in your sequence. Annotate the .map file you made in the beginning of the exercise with the precise sequence number of each position. Be sure to include exon and intron borders. Indicate on your .map printout where the various signals and content biases you found are located. This information will be required in the exercise report form at the conclusion and it will be utilized in the next step.

9) Translate the sequence.

Translation: where to start and stop Exons and Introns, Splice Junctions, etc.

What about precursor versus mature, signal peptides and processing? How can we tell just what makes up the mature protein?

Many matters complicate computerized translation. Eukaryotic exons and introns can be especially confusing, but prokaryotic and organellar DNA have their own problems too. One that concerns all genes is whether you want to translate the entire thing or whether it has a signal peptide portion and how to tell which is what. A database of preprotein signal peptides is available through Gunnar von Heijne for just this type of analysis (1987). The EGCG (Extended GCG package from EMBL, online in the VADMS system) incorporates von Heijne's method with the SigCleave program and can be run with a -prokaryote switch to change from the default eukaryote search matrix. Type egenhelp for more information on EGCG.

The next logical step in sequence analysis is to translate your tentative URF's into a protein. First we will translate the putative coding regions which the previous analyses have indicated and then we will use the reference information of the actual database file(s) to translate the "real thing." Finally we will compare the two results with the Gap program to see how well our "educated guesses" have been. Realize that the validity of your interpretations will relate directly to your understanding of the molecular biology of the system you are dealing with.

Type translate and you will be asked from what sequence; answer with your consensus sequence's filename.ext. Now be careful -- you do not want to translate the entire sequence! Give the proper coordinates for the first, 5', coding region, as you determined in the above synthesis of your analyses, and the program will verify their locations. Next, you will be asked if you want to add another exon, or more genes, or just write out the translation. You need to make the correct decisions based on the sequence you are working on. Since your gene contains introns, you would, naturally, want to add more exons until finished at which point you would write it out and accept the default output filename. Refer to the screen trace following the next section for an example of Translate with the Genbank entry HumIL1Bx.

Remember the genomic nucleotide sequence that you found in the database back in Exercise 3 and then used as a verification of your GelAssemble work in Exercise 6, well, we need to use that entry again. In most cases this is the entry for your Selected Molecule in the organism that has had both its genomic sequence and three dimensional structure solved. Go back and check your notes or move over into previous subdirectories to verify its identity; note the sequence's name and the database it comes from. We are not using cDNA sequences here. The cDNA or mRNA entries do not have any introns so their header information cannot corroborate or refute your coding region analyses, so don't bother with them. As in earlier exercises, it may be a good idea to check with your lab instructor to be sure that you are using the proper sequence. Return to the Exercise 7 subdirectory and use typedata -reference to allow you to read about the exon-intron structure of the sequence. An example command line is shown below with Genbank:HumIL1Bx:

% typedata -reference genbank:humil1bx

Some of the Selected Molecule genomic sequences may not be found as just one entry in GenBank. If you are dealing with a genomic gene not found as one continuous sequence within the database, then you will have use TypeData on each one separately and use the "add another exon from a new sequence" option of Translate in the next step. Now Translate the actual sequence based on the information in its reference header. Of course, you must write down the relevant information first. You want to look for the terms CDS and exon within the FEATURES table; write down the numbers that follow each as well as the brief description of the segment. The relevant reference information from the interleukin example follows:

FEATURES             Location/Qualifiers
     CDS             join(2481. .2527,3092. .3143,5125. .5326,5874. .6038,
                     7275. .7405,8127. .8339)
                     /note="prointerleukin-1-beta"
     repeat_region   656. .668
                     /note="Alu repeat copy A 5' flank"
     repeat_region   669. .960
                     /note="Alu repeat copy A"
     repeat_region   961. .971
                     /note="Alu repeat copy A 3' flank"
     prim_transcript 1934. .8941
                     /note="IL mRNA and introns"
     intron          2006. .2465
                     /note="IL intron A"
     exon            2481. .2527
                     /note="prointerleukin-1-beta, exon 2 ( first expressed
                     exon) /map='2q13-q21' /hgml_locus_uid='LG0004J'"
                     /gene="IL1B"
     intron          2528. .3091
                     /note="IL intron B"
     exon            3092. .3143
                     /note="prointerleukin-1-beta, exon 3"
     intron          3144. .5124
                     /note="IL intron C"
     repeat_region   4642. .4658
                     /note="Alu repeat copy B 5' flank"
     repeat_region   4659. .4988
                     /note="Alu repeat copy B"
     repeat_region   4989. .5007
                     /note="Alu repeat copy B 3' flank"
     exon            5125. .5326
                     /note="prointerleukin-1-beta, exon 4"
     intron          5327. .5873
                     /note="IL intron D"
     exon            5874. .6038
                     /note="prointerleukin-1-beta, exon 5"
     intron          6039. .7274
                     /note="IL intron E"
     exon            7275. .7405
                     /note="prointerleukin-1-beta, exon 6"
     intron          7406. .8126
                     /note="IL intron F"
     exon            8127. .8339
                     /note="prointerleukin-1-beta, exon 7"
     repeat_region   9251. .9721
                     /note="Alu repeat copy C"
Notice that the reference section is chock full of good information. It contains important literature references, descriptive comments and oftentimes valuable structural and regulatory insights. Using the CDS information for your entry, Translate it. An abridged screen trace from the interleukin translation session follows:
% translate

TRANSLATE translates nucleotide sequences into peptide sequences.

 TRANSLATE from what sequence ?  gb:humil1bx

                  Begin (* 1 *) ?  2481
                End (*  9721 *) ?  2527
               Reverse (* No *) ?  <rtn>

 Range begins ATGGC and ends TACAG.  Is this correct (* Yes *) ?  <rtn>

 That is done, now would you like to:

   A) Add another exon from this sequence
   B) Add another exon from a new sequence

   C) Translate and then add more genes from this sequence
   D) Translate and then add more genes from a new sequence

   W) Translate assembly and write everything into a file

 Please choose one (* W *):  a

                  Begin (* 1 *) ?  3092
                End (*  9721 *) ?  3143
               Reverse (* No *) ?  <rtn>

 Range begins TGGCA and ends TGAAG.  Is this correct (* Yes *) ?  <rtn>

////////////////////////////////////////////////////////////////////////
This procedure is repeated as many times as you have separate exons; finally the sequence is written out:
 That is done, now would you like to:

   A) Add another exon from this sequence
   B) Add another exon from a new sequence

   C) Translate and then add more genes from this sequence
   D) Translate and then add more genes from a new sequence

   W) Translate assembly and write everything into a file

 Please choose one (* W *):  <rtn>

 What should I call the output file (* Humil1bx.Pep *) ?  <rtn>
The last program to be used today should be familiar by now. Use Gap to compare your hypothetical protein from the previous coding analysis to the actual protein translated just now. You all should know how to run this comparison, if you have any questions, refer to the GCG program manual or the online genhelp system. Name the Gap output file after your last name and give it the extension .peppair. I hope that I can trust everyone not to cheat by using database feature table entries in your inital translation. The purpose of this exercise is to learn, not to get a perfect translation. I do not expect anyone to get perfect translations and will be suspicious if somebody does.

10) Conclusions and evaluations.

In order to get credit for this lab, at the conclusion of the exercise remote copy (rcp) to the teacher account three output files. Also complete the exercise report form with the pico editor and then remote copy it to teacher. This is all done exactly the same way as in previous exercises. Be sure that the files you send the teacher account are all named with your last name as the filename but retain either the standard GCG file extensions or week7s as the extension for the report form. These files are:

  1. (your name).testcode
  2. (your name).codonpref
  3. your Gap output, (your name).peppair
  4. the week7s report form, (your name).weeek7s

Since they all have the same filename, (your name), you can remote copy them to teacher all at the same time using a wild card:

% rcp (your name).* teacher@ribozyme:receive

Logout of ribozyme and then exit the emulator. As before, leave the computer turned on.

You have been exposed to a perplexing variety of techniques within this laboratory exercise for the identification of protein coding regions in genomic DNA. As in all molecular and biological computer analyses, the more you understand the chemical, physical and biological systems involved, the better your chance of success in analyzing them. Certain strategies are inherently more appropriate to others in certain circumstances. Making these types of subjective, discriminatory decisions and utilizing all of the available options so that you can generate the most practical data for evaluation are two of the most important "take-home" messages that this course can offer!

Several general references are available in this field -- most provide extensive weight matrices for consensus pattern searches. Naturally each would have to be tailored into the format correct for whichever matrix searching program you might be using. They also all describe many of the factors involved and the constraints used in content type algorithms. The suggested textbook for the course, Sequence Analysis Primer, by Gribskov and Devereux, is a very good starting point.

Next week we will investigate what to do with your protein sequence now that you've got it. Enter the world of database searching -- a world of batch processes, parallel processing, and huge numbers, and many questions such as, "Is this significant; does it mean anything?" and "This is homologous?" A world often fraught with frustrating results but also exciting discoveries!

Supplemental Information for Further Exploration:

Phillipp Buchers Eukaryotic Promoter Database (EPD):
Dr. Bucher has assembled an extensive list of eukaryotic promoter regions compiled from the EMBL database. His database includes a user's manual, the sequence information itself, and an independent, journal abstracted data reference section for each entry. In order to be included in EPD an entry must:
1) be recognized by eukaryotic RNA POL II,
2) be active in eukaryotes (excludes phycophytes, fungi, myxomycetes, protists),
3) be experimentally defined or sufficiently similar to one defined as such,
4) be biologically functional,
5) be available in the current EMBL release,
6) be distinct from other promoters in the database.

The current EPD online (Release 48, September 1996) has 1285 total promoter entries (846 independent entries) and is available on ribozyme through the GCG database logical EPD. The references and sequences have been built into a GCG style database. The user's manual file is EPD.doc, available in the directory specified by the environment variable $DOC_DIR.

All GCG database searching programs work through the logical EPD in the usual manner. For example, FastA will find the entry:
% fasta humil1.consensus epd:*

EPD17079 (+) Hs IL-1b'; range -499 to 100.

SCORES     Init1: 2400 Initn: 2400 Opt: 2400
           100.0% identity in 600 bp overlap

References

Bucher, P. (1990). Weight Matrix Descriptions of Four Eukaryotic RNA Polymerase II Promoter Elements Derived from 502 Unrelated Promoter Sequences. Journal of Molecular Biology 212, 563-578.

Bucher, P. (1995). The Eukaryotic Promoter Database EPD. EMBL Nucleotide Sequence Data Library Release 42, Postfach 10.2209, D-6900 Heidelberg.

Ghosh, D. (1990). A Relational Database of Transcription Factors. Nucleic Acids Research 18, 1749-1756.

Hawley, D.K. and McClure, W.R. (1983). Compilation and Analysis of Escherichia coli promoter sequences. Nucleic Acids Research 11, 2237-2255.

Kozak, M. (1984). Compilation and Analysis of Sequences Upstream from the Translational Start Site in Eukaryotic mRNAs. Nucleic Acids Research 12, 857-872.

McLauchen, J., Gaffrey, D., Whitton, J. and Clements, J. (1985). The Consensus Sequences YGTGTTYY Located Downstream from the AATAAA Signal is Required for Efficient Formation of mRNA 3' Termini. Nucleic Acid Research 13 , 1347-1368.

Proudfoot, N.J. and Brownlee, G.G. (1976). 3' Noncoding Region in Eukaryotic Messenger RNA. Nature 263, 211-214.

Stormo, G.D., Schneider, T.D. and Gold, L.M. (1982). Characterization of Translational Initiation Sites in E. coli. Nucleic Acids Research 10, 2971-2996.

von Heijne, G. (1987). SIGPEP: A Sequence Database for Secretory Signal Peptides. Protein Sequences & Data Analysis 1, 41-42.