'97 BC/BP 578

Week 6

Sequence Series

DNA Sequencing: The GCG Fragment Assembly Package (FAS)--getting your fragments into the computer and assembled into a continuous sequence.

Author:

Steven M. Thompson

Fragment Assembly Systems:
one the best things going for coping with DNA sequencing data!

DNA sequencing data can be voluminous and perplexing; its management is a formidable task. The Genetics Computer Group (GCG) has implemented a tremendously powerful integrated package of programs for building up complete DNA sequences and managing the myriad of data obtained in routine DNA sequencing experiments so commonplace in molecular biology laboratories. They turn a sometimes dreaded and often tedious job, that of recreating an entire sequenced stretch, into a manageable proposition. This package is called the Fragment Assembly System (FAS). As installed, it can build a continuous DNA sequence from up to 1,650 individual fragment sequences up to a total length of 380,000 bases per project. The system is based on an "electronic notebook" concept similar to Staden's (1980); its editor, GelAssemble, was developed from Dr. William Gilbert's MSE program. The Fragment Assembly System has five objectives:

  1. to provide a manageable method for storing fragment sequence data in a project database that remains invisible to the user so that the intricacies of data and file manipulations are not necessary for the user to tackle.

  2. to recognize the overlaps between separate fragments and perform alignments of those fragments; yet allow for . . .

  3. user manipulation and editing of these alignments so that one is not "trapped" into accepting anything that the system suggests and any alterations can be performed in an interactive "on-screen" manner.

  4. to display the alignment and allow export of any part or all of it to standard GCG file formats in either a base-by-base sequence file or in a pictorial "Big Picture" representation.

  5. Finally, the system generates a consensus, at any point desired within the process, based on your accepted alignment using standard ambiguity codes and it can easily export that consensus into your directory structure.

The Fragment Assembly System has been extensively modified and has become very powerful since it was first introduced with GCG version 7. The entire process is now quite quick and easy. The programs automatically assemble contigs, thus avoiding user required assembly. The program GelMerge both discovers and assembles overlaps. An enhanced GelAssemble program is used as the editor for checking and manipulating alignments; however, the contigs come in already assembled. An additional enhancement is GelMerge can optionally automatically excise designated vector sequences. GelMerge has a whole slew of options and the GelAssemble editor has several powerful features; therefore, be sure to carefully review the documentation before attempting to take advantage of the fragment assembly package. Furthermore, GCG provides a very good overview essay of the FAS in the Program Manual -- please take the time to read that essay.

The Fragment Assembly System can provide an incredible relief from the reams of paperwork necessary in traditional sequencing data management. It can free up massive amounts of time to allow the investigator to concentrate more on the research and less on the tedium of any given project. If you still manually run and read gels, this can be very time consuming also, but it is your choice; however, there is absolutely no sense in not utilizing the computer to manage the generated data -- directly input the fragments to the computer and let it do the work.

Selected Molecule List

The following molecules are again listed for your reference. Please maintain using the same one as in exercise 5. Note the entry number for your choice; it will be needed later in the exercise:

  1. plant ribulose bisphosphate carboxylase/oxygenase, small subunit only

  2. mammalian P21 ras proto-oncogene transforming protein

  3. mammalian basic fibroblast growth factor

  4. fungal superoxide dismutase

Exercise Series #6: A "Real-Life" Project Oriented Approach.

Preliminary Preparations:

After logging into your account on ribozyme, type gcg at the system prompt to activate the Genetics Computer Group's programs. Also create (mkdir) and move down into (cd or down) a new week6 subdirectory for all of exercise 6. Next, copy the necessary files and the exercise report form from ribozyme to the new directory. Complete this procedure just the same as you've done in previous exercises by using the environment variable name $GRAD_DIR and the appropriate path, here /week6s:

% cp $GRAD_DIR/week6s/* .

Review gel entry: SETKEYS

So, how to start? The first order of business is to get your fragments into the system; therefore, before we actually begin learning FAS we are going to review some data entry. Remember from Exercise 4 that GCG provides a neat means of reassigning the keys of the keyboard called SetKeys. This is very handy for sequence entry because the letters of the bases, A, C, G, T, are pretty awkward to type repeatedly since they are so spread out over the left portion of the keyboard. SetKeys creates a file that is recognized by all GCG sequence editors.

1) Run SetKeys to establish remapped key assignments.

Decide whether you like SetKeys based on your experience in Exercise 4. I am not going to require you to use SetKeys if you do not like it -- I empathize with you and don't use it myself. If you want to use SetKeys, launch the program and decide which keys you wish to represent the various bases and ambiguity codes. The following screen, with my choices shown, if I were to use it, will be displayed:

% setkeys

SETKEYS writes a file in your directory that redefines your
keyboard's keys for sequence entry with the programs SEQED,
LINEUP, GELENTER, and GELASSEMBLE.  The output file, called
Set.Keys, can be edited if you want to use keys that were not
defined in your interactive session with SETKEYS.

 Choose key(s) for each nucleotide:

 What key(s) should mean G ?  j
 What key(s) should mean A ?  k
 What key(s) should mean T ?  l
 What key(s) should mean C ?  ;

 Now choose key(s) for the common ambiguity codes:

 What key(s) should mean R ?  r
 What key(s) should mean Y ?  y
 What key(s) should mean N ?  n
 What key(s) should mean <Delete> ?  <Delete> (i.e.press the
Delete/Backspace key)

 SetKeys complete: output file is
/disk2/usr/local/people/thompson/BC578/EX6/set.keys.

The new key assignments can then be seen by using the more command on the new set.keys file. This file can be modified later with a text editor if desired.

% more set.keys

 SETKEYS output file for initializing SEQED, LINEUP, GELENTER, and
GELASSEMBLE

                        January 21, 1996  09:01          ..
Change j   into   G
Change k   into   A
Change l   into   T
Change ;   into   C
Change r   into   R
Change y   into   Y
Change n   into   N

Now, as long as you work in this directory, all GCG sequence editors will only use these key assignments. Remember, you must use the key assignments that youchose with SetKeys if the set.keys file is present in the directory in whichyou are working!

FAS sessions using fragments from data sets provided > entire sequence.

In an actual laboratory situation I would suggest directly entering fragment sequences as the data is read off the gels to avoid an accumulation of excess data and the tedious job of having to enter kilobases of information all at one time; however, here the data will be provided in electronic format for all but one of yo ur Selected Molecule fragments which will be handed out in order to conserve time. Mistakes have purposely been placed in the overlaps of the Selected Molecule fragments to force some interaction with the fragment assembly package, otherwise the system w ould automatically assemble the entire sequence with no user intervention -- not the objective of a learning experience.

Make sure that you are in the directory created above -- stay in it for the duration of this exercise except where otherwise noted. Please make sure that you have read the introduction to the FAS package within the GCG Program Manual -- most of the FAS documentation is helpful reference but does not require comprehensive reading; however, this section is essential!

2) Practice session with GCG supplied data.

To assist in learning the system GCG has installed a set of practice sequencing fragments in their data libraries. Before using your Selected Molecule data, run through the fragment assembly system with the example set provided by GCG. To use these fragments in a practice session use the command Fetch as follows:

% fetch 9b*.seq

This will copy them to your present directory (they are in GenDocData). This set contains vector sequences from GenBank:synblkspv. GelMerge can excise these before creating the alignment, if you specify the sequence name in response to the vector prompt in GelStart along with specifying the option -excise in GelMerge (separate more than one with commas). GelMerge -nomerge -report can write a file (your_project.report) with information about what sequences will be excised if -excise is added to the command line subsequently (for the faint of heart).

In order to utilize FAS the initialization program GelStart must first be run. Initialize the system with GelStart; give an appropriate project name (name only, no extension) and confirm that it is a new project by accepting the default "Yes;" specify the proper vector name; press Return at the restriction site prompt:

% gelstart

GELSTART begins a fragment assembly session by creating a new
fragment assembly project or by identifying an existing project.

 What is the name of your fragment assembly project?  practice

 GELSTART cannot find this project. Is it a new one (* No *) ? y

 You have a new project named "PRACTICE".

 Which vector sequence(s) would you like highlighted?  genbank:synblkspv

 Which restriction site(s) would you like highlighted ? <rtn>

 Project PRACTICE has 0 fragments in 0 contigs.

You are ready to run the other fragment assembly programs.

GelEnter is used either as a sequence editor for raw data entry or to transfer preexisting files into the FAS database. Use GelEnter with its -enter option to load the practice GCG fragments: ,pre>% gelenter -enter=9b*.seq GELENTER adds fragment sequences to a fragment assembly project. It accepts sequence data from your terminal keyboard, a digitizer, or existing sequence files. "9b_06" 338 nucleotides .... "9b_88" 180 nucleotides

Now that all of the fragments have been loaded into the system, let the overlap program GelMerge discover how they all fit together. Give the command gelmerge with the option -check to read and choose from the lengthy options list available:

% gelmerge -check

GELMERGE aligns the sequences in a fragment assembly project into
assemblies called contigs.  You can view and edit these
assemblies in GELASSEMBLE.

Syntax: % GELMerge -Default

Required Parameters:

-WORdsize=7           word size for overlap determination
-STRIngency=0.8       minimum fraction of matching words in overlap
-MINOverlap=14        minimum length of overlap

Local Data Files:

-DATa1=GelMergeDNA.Cmp       comparison table for contig assembly
-DATa2=GelMergeLocalDNA.Cmp  comparison table for vector recognition

Optional Parameters:

-MINIdent=14             minimum run of identical bases found at least
                           once in an overlap between two contigs
-MAXGap=10               maximum gap size for overlap determination
-GAPweight=0.8           gap creation penalty in contig assembly
-LENgthweight=0.2        gap extension penalty in contig assembly
-ARChive                 creates contigs from the original gel readings
-WORKing                 creates contigs from individual working

 Press <rtn> for more:  <rtn>

                           fragment (with gaps removed)
-REPortfile[=filename]   write report of recognized vector sequences
-EXCise                  remove vector sequences from single-fragment
                           contigs
-VECTORSTrigency=0.8     minimum fraction of matches in vector recognition
-VECTORMINIdent=10       minimum run of identical bases found at least
                           once in a match between vector and fragment
-VECTORMAXGap=5          maximum gap size in first step of vector
                           recognition
-VECTORGAPweight=3.0     gap creation penalty in vector recognition
-VECTORLENgthweight=0.3  gap extension penalty in vector recognition
-NOMERge                 suppresses contig assembly
-NOMONitor               suppresses screen trace of program progress
-NOSUMmary               suppresses screen summary at the end of the
                          program
-BATch                   submits program to the batch queue

Notice that the option list is extensive. This makes the program very powerful, especially at recognizing and excising vector sequences. The two options that we wish to use at this point are report and excise; these will produce a report file of the found vector sequences and automatically cut them out of the fragments respectively. Therefore, add these options to the prompt when asked what to add. Accept all of the default parameters after this point, in this first run through the GelMerge program. Notice that the program reads the fragments, searches for the vector sequences, cuts them out, finds overlaps between the fragments and aligns them, all in one session:

 Add what to the command line ?  -report -excise

 What word size (* 7 *) ?  <rtn>

 What fraction of the words in an overlap must match (* 0.80 *) ?  <rtn>

 What is the minimum overlap length (* 14 *) ?  <rtn>

   Reading ....................

 Searching for synblkspv
           ...*..*...*.........
  Excising all 271 bases from 9b_19
  Excising all 169 bases from 9b_28
  Excising all 235 bases from 9b_42

 Comparing .................

  Aligning ................

   Writing ....

          Input Contigs:         20
         Output Contigs:          4
    Zero Length Contigs:          3

               CPU time:      01.75

Notice the three "Zero Length Contigs" caused by GelMerge's excise option. Take a look at the report file that was written, in my case, practice.report:

 GELMERGE vector report of Project: Practice

  VECTORS:  synblkspv

  Word-size: 7  Stringency: 0.80  MinIdentities: 12  MaxGap: 5
  Gap Weight: 3.0  Length Weight: 0.3    ..

//////////////////////////////////////////////////////////////////

9b_19  Length: 271    Excised all 271 bases
synblkspv  Length: 2964
                  .         .         .         .         .
     271 TTTTCTGTGACTGGTGAGTACTCAACCAAGTCATTCTGAGAATAGTGTAT 222
         ||||||||||||||||||||||||||||||||||||||||||||||||||
    2508 TTTTCTGTGACTGGTGAGTACTCAACCAAGTCATTCTGAGAATAGTGTAT 2557
                  .         .         .         .         .
//////////////////////////////////////////////////////////////////
                  .         .
9b_28  Length: 169    Excised all 169 bases
synblkspv  Length: 2964
                  .         .         .         .         .
     169 CTGGTAGCGGTGGXTTTTTTTGTTTGCAAGCAGCAGATTACGCGCAGAAA 120
         ||||||||||||| ||||||||||||||||||||||||||||||||||||
    1741 CTGGTAGCGGTGG.TTTTTTTGTTTGCAAGCAGCAGATTACGCGCAGAAA 1789
                  .         .         .         .         .
//////////////////////////////////////////////////////////////////
                  .
      19 AGGATCTTCACCTAGATCC 1
         |||||||||||||||||||
    1890 AGGATCTTCACCTAGATCC 1908

9b_42  Length: 235    Excised all 235 bases
synblkspv  Length: 2964
                  .         .         .         .         .
       5 CAAGAAGATCCTTTGATCTTTTCTACXGGXTCTGACGCTCAGTGGAACGA 54
         |||||||||||||||||||||||||| || ||||||||||||||||||||
    1800 CAAGAAGATCCTTTGATCTTTTCTACGGGGTCTGACGCTCAGTGGAACGA 1849
                  .         .         .         .         .
//////////////////////////////////////////////////////////////////
                  .         .
     205 TATCTCAGCGATCTGTCTATTTCGTTC 231
         |||||||||||||||||||||||||||
    2000 TATCTCAGCGATCTGTCTATTTCGTTC 2026

Notice the alignments between the vector sequences and the three fragments. Imagine the confusion that this could cause in an attempted contig assembly unless recognized beforehand.

Run GelView to see how it displays the results found by GelMerge. Type gelview; specify "term" as the name of the output file so that no file is created; it is merely scrolled to the screen. Again notice the three "Zero Length Contigs" listed.

The next program to run is GelAssemble. Please peruse the manual before attempting to launch it. GCG places vertical bars along the margins of sections in the manual which differ from previous releases so it is easy to spot the new features if you've used previous releases.

Now launch the editor by typing gelassemble. The first screen that you will see is a diagrammatic list of the available contigs found by GelMerge. Scroll though the contigs with your arrow direction keys as explained in the manual. Only the first will contain any fragments; the other are the "Zero Length Contigs" produced by GelMerge. The screen should look like this:

% gelassemble

GELASSEMBLE is a multiple sequence editor for viewing and editing
contigs assembled by GELMERGE.




  11        9b_15                 +------------>
  10        9b_07               <-------------+
   9        9b_87             +----------->
   8        9b_32         <---------+
   7        9b_68         +------------>
   6        9b_84         <-----------+
   5        9b_29        +-------->
   4        9b_31       <----------------+
   3        9b_25       +------------>
   2       9b_06       +----------------->
   C       CONSENSUS   +----------------------------------->

|----------|----------|----------|----------|----------|
                       0         200        400        600        800
Contig    1 of    4

Scroll with <up-arrow>, <down-arrow>, <Prev-Screen>, and
<Next-Screen>

<right-arrow> for next contig, <left-arrow> for previous contig
<Ctrl-D> for Screen Mode, <Ctrl-K> to load a contig:

After selecting the contig that you would like to edit, in this case the first since the others are empty, press <Ctrl-k>. The screen will change into "Screen Mode:"

GelAssemble                          9b_06
GCG
                     Absolute:       1   Relative:       1


TACTGGT
CCCCCAGCCCCTACTGGT
AAGGCCTCTGCXGGCCCCAG.CCCTACTGGT
GCCCAACATGCAGCCTCTGCCCCCAA.GCCTCTGC.GGCCCCAG.CCCTACTGGT
AGGAAGTCAAGCTCAGCCTGCCCCCAAGGCCTCTGCGGGCCCCAG.CCCTACTGGT
AACACAGACCATTGCTGCAGCCCAACATGCAGCCTCTGCCCCCAAGGCCTCTGCXGGCCCCAG.CCCTACTGGT
AACACAGACCATTGCTGCAGcccaaCAtGCagcctCTGCCCCCAAgGCCTCTGCgGgCCCCAG.CCCTACTGGT
....|.........|.........|.........|.........|.........|.........|.........|....
    0        10        20        30        40        50        60        70

   7       +---------------->
   6      <---------------+
   5      +---------->
   4    <----------------------+
   3    +---------------->
   2   *---------------------->
   C   +--------------------------------------------->

|......|......|......|......|......|......|......|......|......|......|
       0     100    200    300    400    500    600    700    800    900
1000

<Ctrl-D> for Command Mode, or ? for help
                                                                 Edit Mode: INS

Notice on your screen that areas of discrepancy are highlighted with inverse video. Press <Ctrl-d> to get the command prompt. Explore some of the commands listed in the manual while the contig alignment is in the editor in front of you. Return to the initial Contig List by giving the contig command. The "Ze ro Length Contigs" can now be erased from the database by loading them one at a time into the editor and issuing the erase command. Exit the GelAssemble editor.

Now run GelView again and notice that the three "Zero Length Contigs" are no longer listed:

% gelview

GELVIEW displays the structure of the contigs in a fragment
assembly project.

 What should I call the output file (** Practice.View **) ?  term

 Practice has 17 Fragments in 1 Contigs

GELVIEW Fragment Assembly contig display of Project: Practice
    September 15, 1993  11:24

Contig: 9b_06

  18       9b_48                                 <------+
  17       9b_46                              +------->
  16       9b_79                            <---------+
  15       9b_88                        +--------->
  14       9b_44                        +------------------>
  13       9b_21                       <------------+
  12       9b_74                   <------------+
  11       9b_15                  +------------>
  10       9b_07                <-------------+
   9       9b_87              +----------->
   8       9b_32          <---------+
   7       9b_68          +------------>
   6       9b_84          <-----------+
   5       9b_29         +-------->
   4       9b_31        <----------------+
   3       9b_25        +------------>
   2       9b_06       +----------------->
   C       CONSENSUS   +----------------------------------->
                       |----------|----------|----------|---------|---------|
                       0         200        400        600        800

 17 Fragments in 1 Contigs

3) Your Selected Molecule data: Initialize the Fragment Assembly System.

Now that you've seen how the system works with the practice data set, begin a new project with your Selected Molecule data. GelStart must be run every time you start a new session with the assembler. The first time a project is started you assign it a project name and thereafter always refer to it by that name. As you saw above, GelStart has options to provide for the recognition of vector and restriction site sequences. This can be very handy for identifying overcloned vector portions; however, we will not utilize it with the Selected Molecule data as I can assure you that no vector sequence is included in them. Type gelstart again; give your project an appropriate name considering the protein you are working on. Press <Return> at all other prompts. I'll illustrate with the prion example from the previous exercise:

% gelstart

GELSTART begins a fragment assembly session by creating a new
fragment assembly project or by identifying an existing project.

 What is the name of your fragment assembly project?  prion

 GELSTART cannot find this project. Is it a new one (* No *) ?  y

 You have a new project named "PRION".

 Which vector sequence(s) would you like highlighted?  <rtn>

 Which restriction site(s) would you like highlighted ?  <rtn>

 Project PRION has 0 fragments in 0 contigs.

 You are ready to run the other fragment assembly programs.

4) Enter your gel fragment sequences.

First type gelenter without the enter option for manual fragment entry. A ruler line and sequence name prompt will appear -- type in whatever fragment name you desire for the sequence handed out to you, without an extension. The cursor will move to the start of the ruler line; proceed to enter your respective sequence, using the key assignments of SetKeys if you used it! This program works just like SeqEd; don't be intimidated. Enter your entire fragment sequence, without interruption. Mistakes must be edited before the sequence is saved (GelEnter will not allow you to work on the same fragment twice); the com mands of GelEnter are the same as those of SeqEd. Check your work and correct any errors.

% gelenter

GELENTER adds fragment sequences to a fragment assembly project. It
accepts sequence data from your terminal keyboard, a digitizer, or
existing sequence files.

for-1                      ***** K E Y B O A R D *****
GELENTER


     CGGCGCCGCGAGCTTCTCCTCTCCTCA    etc.....
....|.........|.........|.........|.........|.........|.........|.........|...
    0        10        20        30        40        50        60        70


    ^~~~~~~~~~~~~~~~~~~~
    |......|......|......|......|......|......|......|......|......|......|
    0     10     20     30     40     50     60     70     80     90     100

After finishing, type <Ctrl-d> to enter command mode then type the command exit to save the sequence into the database and leave GelEnter.

Next we need to use GelEnter to enter all of the rest of the fragments. To get your fragments use the copy command with the environment variable name and path $GRAD_DIR/geldata. Designate your group of fragments with its respective number off of the Selected Molecule list. For example, to retrieve all of the ras fragments issue the command cp $GRAD_DIR/geldata/2* .. Make sure that these sequences all end up in your current directory and that they all have the common extension ".seq". Type gelenter -enter=(your number)*.seq to enter the new fragment sequences into your database. Be sure to specify your sequences' number, otherwise the 9b fragment group will be installed in the same project as well! GelEnter will read and enter each of the fragments into the database automatically:

% gelenter -enter=(your_number)*.seq

GELENTER adds fragment sequences to a fragment assembly project. It
accepts sequence data from your terminal keyboard, a digitizer, or
existing sequence files.

 "For-1...through...
 "For-8"  302 nucleotides

5) Find how the pieces fit together.

Now the vital job of discovering the overlaps between the fragments and assembling the pieces is performed by GelMerge. The program assembles contigs from individual fragments and previously assembled contigs. Type gelmerge to launch the process. For the first pass accept the default values for window size, match fraction, diagonal integration, minimum overlap length. The program will read, compare, and assemble the fragment sequences that it can and then return the system prompt. However, if no overlaps are found, this will be reported and you will need to decrease the search stringency.

% gelmerge

GELMERGE aligns the sequences in a fragment assembly project into
assemblies called contigs.  You can view and edit these
assemblies in GELASSEMBLE.

 What word size (* 7 *) ?  <rtn>

 What fraction of the words in an overlap must match (* 0.80 *) ? <rtn>

 What is the minimum overlap length (* 14 *) ?  <rtn>

   Reading ....................

 Comparing ....................

  Aligning .................

   Writing ...

          Input Contigs:          8
         Output Contigs:          5

               CPU time:       5.68

6) Check it out.

In order to see how well GelMerge worked we can use the program GelView. Type gelview; this will provide a view of the current status of the project. Accept the default output file name and the system prompt will be returned. Take a look at your new .view file to see the contigs that GelMerge discovered. These are the results of GelMerge at this first pass -- don't get discouraged, with each pass through the system, more will fall into place:

% gelview

GELVIEW displays the structure of the contigs in a fragment
assembly project.

   What should I call the output file (** prion.view **)?  <rtn>

   Prion has 8 Fragments in 5 Contigs
The prion.view example contig display is shown below:

GELVIEW Fragment Assembly contig display of Project: Prion
    July 11, 1991  13:13

Contig: For-6

   3       For-7                          +-------------------->
   2       For-6       +------------------->
   C       CONSENSUS   +--------------------------------------->
                       |----------|----------|----------|---------|---------|
                       0         200        400        600        800

Contig: For-1

   5       For-3                                 +---------------->
   4       For-2                   +----------------->
   3(   2) For-1        +------------->
   2       For-1       +------------->
   C       CONSENSUS   +------------------------------------------>
                       |----------|----------|----------|---------|---------|
                       0         200        400        600        800

//////////////////////////////////////////////////////////////////////////////

8 Fragments in 5 Contigs

Notice in my example the program discovered two groups of overlapping sequences and in one of them a fragment occurs twice. Three other fragments were not found to align with anything.

7) Check out the assembled pieces.

For the verification and adjustment phase type gelassemble:

% gelassemble

GELASSEMBLE is a multiple sequence editor for viewing and editing
contigs assembled by GELMERGE.


   5       For-3                                 +---------------->
   4       For-2                   +----------------->
   3(   2) For-1        +------------->
   2       For-1       +------------->
   C       CONSENSUS   +------------------------------------------>

|----------|----------|----------|----------|----------|
                       0         200        400        600        800

Contig         1  of   5

Scroll with <up-arrow>, <down-arrow>, <Prev-Screen>, and
<Next-Screen>

<right-arrow> for next contig, <left-arrow> for previous contig
<Ctrl-D> for Screen Mode, <Ctrl-K> to load a contig:

Okay, now press <Ctrl-k> to "load a contig." The screen is changed into both the pictorial representation, the "Big Picture," and the actual sequence data. Notice that the cursor is placed at the beginning of the bottom most fragment sequence. The longer sequence below it is the consensus for the whole contig. After moving my cursor to the top fragment, the following screen appears:< p>

GelAssemble                          For-3
GCG
                     Absolute:     489   Relative:       1



TACATTTGGGCAGTGACTATGAGGACCG.TTACTATCGTGAAAACATGCACCGTTAC
AGTGCCATGAGCAGGCCCATCATACATTTCGGCAGTGACTATGAGGACCGGTTACTATCGTGAAAACATGCACCGTTAC


AGTGCCATGAGCAGGCCCATCATACATTTCGGCAGTGACTATGAGGACCGgTTACTATCGTGAAAACATGCACCGTTAC
....|.........|.........|.........|.........|.........|.........|.........|....
   470       480       490       500       510       520       530       540
   7
   6
   5A                                    *-------------------->
   4                  +---------------------->
   3    +----------------->
   2   +---------------->
   C  M+------------------------------------------------------>

|......|......|......|......|......|......|......|......|......|......|
       0     100    200    300    400    500    600    700    800    900
1000

<Ctrl-D> for Command Mode, or ? for help
                                                                Edit Mode:INS

Fragment offset may have to be adjusted slightly to improve alignments in the contig. This is done by adding or subtracting spaces with the space bar and delete key (the space bar will move the sequence right regardless of where the cursor is at, but be careful of the delete key; if it's within a sequence it will delete the character to its left). Adjust the alignment if needed and decide whether to accept or reject the contig. Cursor motion can be controlled with the direction keys and/or with the commands listed in the command summaries of the Program Manual; all GelAssemble commands are also accessible by typin g a "?." Several powerful control key commands help you find your way around the edit screen. Some extremely useful cursor screen mode commands follow:

Several additional features of the screen mode display are worth noting:

Discrepancies are highlighted; check for any areas of highlighting at the junctions. Gaps may have to be introduced to improve alignment of the junctions; insert "n's" to represent possible deletions in your reading of the gel. It seems to help if you work on junction problems from the top down and from the right end toward the left; the reasons become apparent as larger contigs are managed -- the anchoring concept becomes important and you'll have to worry about it less by working down and in. Any changes made within one sequence will affect all the other alignments as soon as you are working with more than just two. Therefore GCG has built in the anchoring function -- pay attention! Any changes made within an anchored fragment are propagated throughout all anchored fragments! This can be a tremendous help and/or a terrible hindrance when more than just a pair of fragments are being worked with -- just be careful. <KP7> and <KP8> anchor and unanchor sequences respectively from the screen mode.

A few tips that I have discovered are:

After all editing, press <CtrlD> to enter command mode, in which global type manipulations can be accomplished; some that I've found very useful follow:

The LOAD command allows a whole different contig to be added onto an existing one and fit in manually. This can be very helpful in cases where GelMerge fails to discover an overlap that is known to be present.

After you are satisfied with the assembled contig you must save any changes made while in the editor by issuing the write command. The contig is written to the database named after the lowermost fragment within it and the cursor returns to its last position in the sequence.

Next you need to perform the same type of manipulations on the other contigs found by GelMerge, so reenter command mode and enter the command contig. The same contig that you had just worked on will be displayed. However, you don't want to work on it again so use the cursor direction key to read a different contig and loa d that one in a similar fashion as before. Again, adjustments are made as necessary, a consensus is reformed, and the contig is saved with write. When all of your contigs have been processed, exit GelAssemble by reentering command mode and typing exit. The current contig is written and the system prompt is returned.

8) Find new overlaps.

In all likelihood, more than one contig will still be present after your first run through GelMerge and GelAssemble. Therefore, run GelMerge again, only this time decrease the stringency some, for example, by changing the match fraction from 0.80 to 0.60:

% gelmerge

GELMERGE aligns the sequences in a fragment assembly project into
assemblies called contigs.  You can view and edit these
assemblies in GELASSEMBLE.

 What word size (* 7 *) ?  <rtn>

 What fraction of the words in an overlap must match (* 0.80 *) ? 0.60

 What is the minimum overlap length (* 14 *) ?  <rtn>

   Reading ....................

 Comparing ....................

  Aligning .................

   Writing ...

          Input Contigs:          5
         Output Contigs:          1

               CPU time:       3.24

Rerunning GelView now shows an entirely different picture:

% gelview

GELVIEW displays the structure of the contigs in a fragment
assembly project.

 What should I call the output file (** prion.view **) ?  <rtn>

 Prion has 8 Fragments in 1 Contigs

Display the resulting file:

% more prion.view

GELVIEW Fragment Assembly contig display of Project: Prion
    September 15, 1993  11:24

   9       For-8                                             +----->
   8       For-7                                       +------>
   7       For-6                                +------>
   6       For-5                            +---->
   5       For-4                    +-------->
   4       For-3               +----->
   3       For-2          +----->
   2       For-1       +--->
   C       CONSENSUS   +------------------------------------------->
                       |----------|----------|----------|---------|---------|
                       0         600       1200       1800       2400


 8 Fragments in 1 Contigs

9) Repeat the process as many times as it takes.

The new contigs will need to be checked, so relaunch GelAssemble. Again, proceed into screen mode and load the contigs just the same as before. Adjustments are made as necessary, a new consensus is created, the contig is saved with write, and contig is used to repeat the process with any other contigs that might still be present. Be careful with anchoring -- pay attention to the "A's" in the diagram at the bottom of the screen. Anchoring doesn't always behave as nicely as one would like it to. Once you are satisfied with all the alignments exit GelAssemble.

GelView can verify the assembly process at any stage. As metioned before, a convenient output filename to give GelView is "term." This causes the output to scroll to the screen rather than creating a new file. If there is still more than one contig present after your second round through, another iteration through the system, using even less stringent overlap parameters should bring in the final alignment. Try decreasing the minimum overlap required as well as decreasing the match fraction; only decrease the word size as a last resort.

Finally, a resulting consensus sequence can't do us much good if it's stuck in the assembly system's database. Therefore, after you are finished assembling your complete sequence, relaunch GelAssemble. From the command line of screen mode and with the cursor's current position on the consensus, enter the command seqout; this will cause the consensus sequence generated by the system to be written out to the current directory. GelAssemble will prompt you for a file name; give it something appropriate. Another couple of handy GelAssemble command functions at this point, or at any other for that matter, are the commands prettyout and bigpicture. These write to output files the alignment on a base by base sequence a nd pictorial level respectively. If any of your fragments are anchored, only those anchored fragments will be exported. Therefore, make sure all of your fragments are either anchored or unanchored and then give these commands, accepting the default file names; and lastly, exit GelAssemble.

% gelassemble

GELASSEMBLE is a multiple sequence editor for putting sequences
together into assemblies called contigs.

GelAssemble                    *** CONSENSUS ***
GCG
                     Absolute:     703   Relative:     703





ACCAGAGAnnnTCGAGCATGGnnnnCTTCTCCTCTCCAC
ATGTGTATCACCCAGTACGAGAGGGAATCTCAGGCCTATTACCAGAGAGGATCGAGCATGGTCCTCTTCnnnnCTCCAC
ATGTGTATCACCCAGTACGAGAGGGAATCTCAGGCCTATTACCAGAGAggaTCGAGCATGGtcctCTTCtcctCTCCAC
....|.........|.........|.........|.........|.........|.........|.........|....
   690       700       710       720       730       740       750       760
  10A                               +--->
   9A                          +---->
   8A                     +---->
   7A                  +-->
   6A            +----->
   5A        +--->
   C   +--------*----------------------->

|......|......|......|......|......|......|......|......|......|......|
       0     500   1000   1500   2000   2500   3000   3500   4000   4500
5000

 What should I call the output file (** For-1.Consensus **)  ?  prion.consensus
                                                                 Edit Mode:INS

10) Corroborate your sequence with the "real thing."

Next, we're going to cheat here and bring in the actual sequence. Since any true sequencing project would use more than just one fragment per stretch of DNA, usually sequencing both the forward and reverse strands twice apiece, I suppose it's not really cheating that bad. Remember near the end of last week's exercise you ran FindPatterns of your probe/primers against the appropriate subdivision of GenBank; you will check and use those results now. It should corroborate what you found in your preliminary database searches in Exercise Three. Here we are interested in the genomic sequence, not the mRNA/cDNA.

Move over into your exercise 5 subdirectory and take a look at your FindPatterns output file. This file is liable to be pretty huge so use the search function within more to try to find the relevant entry. Display the file with the more command, then once the file has loaded, type a diagonal slash "/" and specify the search string. In my case, I searched for the word "prion" to quickly locate all hits on prion sequences. You only have to type the search phrase once; thereafter, just type the slash and press return. You are looking for entries that refer to the genomic or gene sequence but not mRNA/cDNA entries. The relevant portion from the prion example fol lows:

% more prion.finds

! FINDPATTERNS on gb_pr:* allowing 3 mismatches

! Using patterns from: primer.dat  January 22, 1996 15:16 ..

            AGU08309  ck: 651   len: 696   ! U08309 Ateles geoffroyi prion
prote in gene, partial cds. 2/95

F1                    AACCGCTACCCCCCCCAG
           118: GAGGC aaccgctacccaccccag GGTGG mis=1

F2                    CAACCGCTACCCCCCCCAG
           117: GGAGG caaccgctacccaccccag GGTGG mis=1

F3                    AACCGCTACCCCCCCCAGG
           118: GAGGC aaccgctacccaccccagg GTGGT mis=1

F4                    CTGGAACACCGGCGGCAG
            69: GGAGG atggaacactgggggcag CCGAT mis=3

R1 /Rev               CCGTGATCCTGCTGATCA
           668: CCCAC ctgtgatcctcctgatct CTTTC mis=3
--More--(0%)
/prion

            HUMPRPRA  ck: 4384  len: 673   ! M81929 Human prion protein gene,
5' end. 1/95

F1                    AACCGCTACCCCCCCCAG
           116: GAGGC aaccgctacccacctcag GGCGG mis=2

F2                    CAACCGCTACCCCCCCCAG
           115: GGAGG caaccgctacccacctcag GGCGG mis=2

F3                    AACCGCTACCCCCCCCAGG
           116: GAGGC aaccgctacccacctcagg GCGGT mis=2

F4                    CTGGAACACCGGCGGCAG
            67: GGAGG atggaacactgggggcag CCGAT mis=3

            HUMPRPRB  ck: 66    len: 699   ! M81930 Human prion protein gene,
5' end. 1/95

In all cases the sequences you need are genomic and not mRNA/cDNA so be sure to pick entries that correspond to the genomic DNA sequence for your chosen protein. In some cases there may be genomic sequences available for your particular protein from more than one organism of the desired sort, and in other cases the genomic sequence from one organism may be spread over more than one entry. Therefore, check with your lab instructor before proceding with the next step to be sure that you are choosing the correct genomic sequence(s), otherwise, it will not work.

The entry name will be all uppercase followed by the accession code and title line in the FindPatterns output. You should have found these genomic entries while database browsing in Exercise Three and they will correspond to the cDNA entries that you used in Exercise Five. Now move back over to your Exercise Six subdirectory and use Gap to compare your protein's actual DNA databank entry to your assembled sequence's consensus using the appropriate database:name convention. If your sequence's database entry is in multiple pieces, then you will need to run Gap analyses of each one separately using the results of the first to help guide the starting position of the subsequent. Gap will try to align each piece from the very beginning of your consensus sequence otherwise. Gi ve the output file the extension gelpair. In my prion example's case no full length genomic sequence exists in the database, so I will illustrate with the cDNA:

% gap

GAP uses the algorithm of Needleman and Wunsch to find the alignment of
two complete sequences that maximizes the number of matches and minimizes
the number of gaps.

 GAP of what sequence 1 ?  genbank:humprp

                  Begin (* 1 *) ?  <rtn>
                End (*  2415 *) ?  <rtn>
               Reverse (* No *) ?  <rtn>

 to what sequence 2 (* gen1:humprp *) ?  prion.consensus

                  Begin (* 1 *) ?  <rtn>
                End (*  2415 *) ?  <rtn>
               Reverse (* No *) ?  <rtn>

 What is the gap weight (* 50 *) ?  <rtn>

 What is the gap length weight (* 3 *) ?  <rtn>

 What should I call the paired output display file (* humprp.pair *) ? humprp.gelpair

 Aligning .................................................................

          Gaps:     0
       Quality: 24150
 Quality Ratio: 10.00
  % Similarity: 100.000
        Length:  2415

The screen trace shows the percent similarity of the two sequences and the output file can be used to locate the exact position of any mismatches. If you have some discrepancies, launch SeqEd to correct the mistakes in your consensus sequence.

% seqed

prion.consensus            ***** K E Y B O A R D *****                 SEQED

TTGTAAATGTTTAATATCTGACTGAAATTAAACGAGCGAAGATGAGCACC
....|.........|.........|.........|.........|.........|.........|.........|....
  2370      2380      2390      2400      2410      2420      2430      2440

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
    |......|......|......|......|......|......|......|......|......|......|
    0     500   1000   1500   2000   2500   3000   3500   4000   4500   5000

 "prion.consensus"  2415 nucleotides

11) Finishing up.

At the conclusion of this exercise rename (mv) and remote copy (rcp) to the teacher account four files. Do not rename all of your files in the directory. Also complete the Exercise report form with the pico editor then remote copy it to teacher. This is all done exactly the same way as in previous exercises where you rename the selected files with your last name as the filename and you retain the standard GCG file extension. These files are:

  1. gelassemble.bigp
  2. gelassemble.prty
  3. your consensus sequence from the FAS session
  4. the .gelpair file from the above Gap run (before making any corrections!)
  5. the week6s report form

Since all five files will have a common filename, rcp them all at once:

% rcp lastname.* teacher@ribozyme:receive

Logout of ribozyme and then exit the emulator. As before, leave the computer turned on.

GCG has provided an extremely powerful utility for building up DNA sequences from individual sequencing fragments with the FAS package. It is hard to imagine trying to put together all this information without help from the computer -- unfortunately many biologists still live in the dark ages, as far as computer technology is concerned, and do not utilize this tremendous tool, or similar ones available. Please check it out; learn the system and spread the word! Next week -- How to tell if and where you've got any structural genes within your sequence -- Gene Finding Strategies.

Reference

Staden, R. (1980). A New Computer Method for the Storage and Manipulation of DNA Gel Reading Data. Nucleic Acids Research 8, 3673-6694.