'97 BC/BP 578

Week 4

Sequence Series

Learning about entering data for various uses. Exploring data formats and how to use them.

Author:

Susan Jean Johns

Background Information on Data Entry

There are two basic ways to enter data into a computer. One is to use an editor or a piece of software to create a new data file in the proper format. The second is to modify an existing data file already in the proper format, by editing or some other means, to contain the new data.

The steps to create a data file from scratch are as follows:

1) collect the necessary raw data for the creation of the file
2) become familiar with the format needed for the desired computing tasks
3) determine the proper way to create the data, through editing or software usage
4) enter the data
5) check the created data for  errors by visual examination
6) check the created data for errors by using it in a computing task similar to
   the desired one, and see if it functions properly


To modify an existing data file:

1) collect the necessary raw data 
2) become familiar with the format being used
3) determine the proper way to change the data, through editing or software usage
4) change the existing data
5) check the created data for errors by visual examination 6) check the created data for errors by using it in a computing task similar to the desired one, and see if it functions properly

You must know the type of format the data needs to be in prior to entering it.

For doing sequence analysis tasks, data will be entered in the GCG format. There are other possible data formats, but GCG will be used in this course since that is the software package supported for sequence analysis by the VADMS Center.

Molecular modelling tasks have a greater variety of potential formats into which the data can be entered. The format used depends on the size of the molecule, the source of its raw data, and the software used to produce the raw data.

You can enter small molecular structures directly into a given program's format through the program's graphical interface followed by minimization.

Large molecular data sets, such as those for proteins or nucleotides, are usually entered in a PDB format. New data of this type is collected by sophisticated computer software and the output is usually some form of a PDB format. At times, modifications of this data are needed in order to make it workable with currently available visualization software. Some visualization packages access ascii PDB data files directly, while others use their own form of this data in their program and require data file conversion.

Format conversion for molecular modelling packages is a well-known problem. Many packages have their own conversion programs to allow data created with other software to be used within their program.



Background Information on Sequence Analysis Data Formats

At WSU, the software supported for sequence analysis tasks uses a GCG format. The databases are created and stored using software from GCG. GCG uses one-letter code for the sequence data. This is the most efficient method of storing the data. GCG sequence files have header information at the beginning of the file; the actual sequence at the end of the file, and a checksum line between the two sections.

The nature of the header section of the file depends on the database from which it was extracted or the verboseness of the person who created the file. If the file is from a database, there will be information on the name of the sequence, its source, its accession number, references, and feature information about the sequence. A sequence file that is not from a database may contain anything in its header section. The information placed there depends solely on the whims of its creator. Hopefully, at a minimum, the name of the sequence and some information about its preparation or features will be there.

Located between the header and sequence sections is what GCG refers to as the checksum line. It contains the filename of the sequence file, the length of the sequence, the date the file was created, the type of data it is (P for protein and N for nucleotide), and a number. This number is used in GCG programs to see if any scrambling of the data has occurred for whatever reason. After the checksum number are two periods. GCG uses the location of these two periods to signal the end of non-sequence material in the file, and the beginning of the actual sequence information.

In the sequence portion of the file, the data is shown in blocks of ten, with fifty characters to a line. Each data line is preceded by a number showing the position in the sequence of the first character in that line.

With this format information, you can create a GCG-convertible file. Use an editor to create a file that contains whatever is desired in the header information. On a line by itself, between the header information and the actual sequence data, there must be a line with just two periods, one right after the other. The sequence can be entered in any manner you find convenient. No numbering is necessary, and a three-letter code may be used if desired. After the file is created, the GCG program REFORMAT is then run. REFORMAT will convert this file from a text file into a usable GCG sequence file. Command switches can be used to handle the conversion of three-letter codes to one-letter codes.

Modifying an existing sequence file can be done by making the necessary changes in the header and sequence sections. After the use of REFORMAT on this changed file, the new output file containing the modified data is ready for use.

In the GCG software suite a program called SEQED is normally used to enter and/or modify sequence data. This program can be run from a subdirectory in which the keyboard has been redefined to allow easy entry of nucleotide data.

Like GCG, other software packages have their own data format for storing sequence data files. When working with data gathered off the nets or from other computer users, it is not unusual to have conflicting data formats to deal with. Luckily, conversion software exists to handle the problem of converting one data format into another.

While databases are stored in the GCG format here, they are not in all places. Each of the databases has their own individual format in which they store information. The main differences are in the way reference information is presented, the order in which the parts of the information are presented, and the length of their respective access codes. Therefore, data on the same sequence from various databases will look different, but contain the same information.



Examples of GCG formatted data files.

A protein file from the NRL_3D database in GCG format.

P1;1CRN - crambin - Abyssinian crambe
C;Species: Crambe abyssinica (Abyssinian crambe)
A;Note: seed
R;Hendrickson, W.A.; Teeter, M.M.
submitted to the Brookhaven Protein Data Bank, April 1981
A;Reference number: A50099; PDB:1CRN
R;Teeter, M.M.
Proc. Natl. Acad. Sci. U.S.A. 81, 6014, 1984
A;Title: Water structure of a hydrophobic protein at atomic resolution.
 pentagon rings of water molecules in crystals of crambin.
R;Hendrickson, W.A.; Teeter, M.M.
Nature 29, 107, 1981
A;Title: Structure of the hydrophobic protein crambin determined directly from
 the anomalous scattering of sulphur.
R;Teeter, M.M.; Hendrickson, W.A.
J. Mol. Biol. 12, 219, 1979
A;Title: Highly ordered crystals of the plant seed protein crambin.
C;Resolution: 1.5 angstroms
C;Determination: X-ray diffraction
C;Keywords: seed
F;1-4,32-35/Region: beta sheet
F;7-19/Region: helix (right hand alpha) (3/10 conformation res 17,19)
F;23-30/Region: helix (right hand alpha) (distorted 3/10 at res 30)
F;41-44/Region: turn
F;3-40/Disulfide bonds:
F;4-32/Disulfide bonds:
F;16-26/Disulfide bonds:

   1CRN  Length: 46  January 18, 1996 07:16  Type: P  Check: 923  ..

       1  TTCCPSIVAR SNFNVCRLPG TPEAICATYT GCIIIPGATC PGDYAN

The NRL_3D database was derived from the PIR (Protein Information Resource) system and uses their conventions for formatting the sequence information. The first line of the file says that the data is about a protein (P1;), its access code is 1CRN, the name of the protein is crambin and its source is Abyssinian crambe. Lines which start with R; denote the start of reference information. The C; denotes a comment line and the F; a feature information line.

Since this file is in GCG format, a checksum line separates the header or reference information from the sequence data. This line contains the access code for the sequence in the database, its length, the type of sequence it is (P for protein) and the checksum number for the sequence. Note the two periods at the end of the line. The location of these two periods allows GCG to determine where the header information ends and the sequence data starts.

At the end of the file is the actual sequence data. The default GCG format puts the sequence into numbered lines containing blocks of 10 characters, 5 blocks to a line. The number at the beginning of the line is the position in the sequence of the first character in the first block on that line of data. Since crambin is only 46 residues long it takes a single line to show the entire sequence.

A nucleotide file from the GenBank database in GCG format.

LOCUS       A22411        479 bp    DNA             PAT     09-NOV-1994
DEFINITION  DNA for CMV Ribozyme RNA-spacer-antisense RNA.
ACCESSION   A22411
NID         g641479
KEYWORDS    .
SOURCE      unidentified.
  ORGANISM  unidentified
            unclassified.
REFERENCE   1  (bases 1 to 479)
  AUTHORS   Muellner,H., Uhlmann,E., Eckes,P., Schneider,R. and Uijtewaal,B.
  TITLE     Multifunctional RNA with self-processing activity, its production
            and use
  JOURNAL   Patent: EP 0421376-A 1 10-APR-1991;
            HOECHST AKTIENGESELLSCHAFT
COMMENT     NCBI gi: 641479
FEATURES             Location/Qualifiers
     source          1. .479
                     /organism="Artificial sequences"
BASE COUNT      134 a    110 c    133 g    102 t
ORIGIN

 A22411  Length: 479  January 18, 1996 07:27  Type: N  Check: 6241 ..

       1  CCGGGAGGTA GCTCCTGATG AGTCCGTGAG GACGAAACAA CCTTGTCGTC

      51  GACAAAATGG TCAGTATGCC CCTCGAGTGG TCTCCTTATG GAGAACCTGT

     101  GGAAAACCAC AGGCGGTACC CGCACTCTTG GTAATATCAG TGTATTACCG

     151  TGCACGAGCT TCTCACGAAG CCCTTCCGAA GAAATCTAGG AGATGATTTC

     201  AAGGGTAGCT CGACAACCTG GATCCAAAAT GGTCAGTATG CCCCCCATGG

     251  CAACAGATTG GCGAATGAGA AAGTGGGTGG AGGACTTATC ATAGTAACAG

     301  AAGAGAGACT AGAACTGCAG AAAATGGTCA GTATGCCCCA GATCTACCGG

     351  AGGTTCTACT AGCATTGGGA GAGCTCGATT TGTCCATAGG CACACTGAGA

     401  CGCAAAAAGC TTAAGGTTGT CGAGCTACCG GGGCCCAGGG CATACTCTGA

     451  TGAGTCCGTG AGGACGAAAC CATTTTGGG

GenBank uses its own conventions for formatting sequence information. The first line of the file shows its access code as A22411, that it contains a DNA sequence (DNA) which is 479 bases long, and was deposited in the database on Nov 9, 1994. The coding of the rest of the lines is easier to understand because this database uses longer identifiers than the PIR system does.

Since this file is in GCG format, a checksum line separates the header or reference information from the sequence data. At the end of the file is the actual sequence data. Since this sequence is 479 bases long, it takes 10 lines to show the entire sequence.



Background Information on Molecular Modelling Formats

Normally sequence analysis folks don't need to know anything about molecular modeling format other than they differ depending on the size of the molecule being worked with. Small molecules may be reported in a number of different formats while large molecules like proteins are in PDB format. Some software can access these PDB (ascii) files directly, others have conversion programs to transform the data into their own internal formats.

Molecular modelling efforts require a modeller to be familiar with the general ideas of the PDB format for storing x-ray crystallographic data. A PDB file is an ascii file with lines 80 characters long. In general, the data has been divided into various subject areas, each area using a code located in the first six characters of a line in the file to distinguish it from other areas. The access code for the structure and the line number in the file are located at the end of each line.

A listing of some of the common subject areas:

HEADER      type of the material studied
COMPND      name of the material studied
SOURCE      source of the material used for the crystal
AUTHOR      who did the work
JRNL        journal reference for the work
REVDAT      revisions to the original data submitted
REMARK      comments on some aspect of the crystallization process, the
            refinement process used, references or changes in the data
SEQRES      the sequence of the material studied
HET         the names of non peptide units in the structure other than water
FORMUL      the formula(s) for these non peptide units
HELIX       helical assignments within the structure and their type
SHEET       sheet assignments within the structure and their type
TURN        turn assignments within the structure and their type
SSBOND      the location of disulfide linkages in the structure
CRYST1      the crystal's cell parameters
ORIGX       transformation values
SCALE       scaling factors for the crystal
HETATM      atom data for non peptide units of the structure
ATOM        atom data for peptide residues of the structure
CONECT      connections between atoms in the structure
TER         the end of a protein chain
MASTER      line stating the number of various types of areas
END         the end of the file

Each code signifies a set format for each the line. PDB files don't have tabs. The presence of tabs in a file that otherwise looks fine to the eye will cause conversion software to crash.

The use of these codes makes locating certain types of desired data from a PDB file very easy. Any data that has been made into an accepted subject area can be searched for with grep. Non subject area data can be more difficult to find.

Care must be taken with PDB files for the following reasons: some workers in this field name things after themselves, the residue codes for unusual amino acids may vary, and the newest x-ray equipment appears to have developed its own order for the component atoms of a peptide residue that does not match the one originally used in earlier PDB data. These small changes can cause problems with software written to convert files from one format to another.

Week 4 Exercise

This series of exercises will acquaint you with entering data for a number of different uses. Items in these instructions which appear in bold should be entered followed by pressing the RETURN key.

l) Activate the computer

Pressing any key changes the terminal from screen saver mode to active.



2) Select the RIBOZYME icon

From the Launcher window, select the RIBOZYME icon by moving the arrow with the mouse over to the RIBOZYME icon and pressing the mouse button twice. Successful connection to ribozyme is denoted by the appearance of a ribozyme information line and a login: prompt.

IRIX (ribozyme)

login:


3) Log onto ribozyme.

Once the login: prompt appears, log on to the machine by entering first your account name to the login: prompt, and then your password to the Password: prompt.

Now that you are on ribozyme, enter sequence data into the computer. To do this, you will go through a number of steps designed to give you insight into how data entry works.



4) Create a subdirectory to keep this week's work in.

To keep data in separate working areas, it is necessary to create subdirectories. This is done with the mkdir command. Create the following subdirectory in your account.

% mkdir week4

Now move into that location using the following command line.

% cd week4

Copy over the data files needed for this week's activities.

% cp $GRAD_DIR/week4s/*.*  .


5) Entering sequence data into the computer.

There are two ways you can enter sequence data from scratch with the VADMS software. The first way uses an editor to create a file. To do this, you need to be familiar with the format required by the software the file will be used with. The second way uses the GCG program SEQED. This program will create a sequence file in compatible GCG format. See the GCG program manual for complete details. A brief description is given later in this exercise. The method you choose to enter a sequence depends on how comfortable you feel with editing files. Simple changes will allow a file created with one data entry method to be usable in another software system.

section 5a

In this section you will use the editor, pico, to create a sequence file in raw GCG format and then reformat it into final GCG format.

Enter the following protein sequence into a file with pico. The name of the protein is melittin, its source is the honeybee, and it is 27 amino acids long. Call your file bee.seq. The sequence you wish to enter is

GIGAVLKVLTTGLPALISWISRKKRQQ

% pico bee.seq

Now insert some sort of comment lines, naming the protein, its source, and any other information you feel is important for future reference. The better the comments made at the beginning of a user-entered sequence, the more useful that data will be to future users of the sequence. Follow this with two periods on a line by itself. This is used by GCG as a marker to denote the end of the header or comment information and the beginning of the actual sequence.

..

Enter the sequence using capital letters, since most formats require them and its a good habit to get into. There are times when DNA sequencers use lower case letters to denote bases that they are not sure of. If you are using such a coding system for any reason, be sure it will not negatively affect any later analysis work you want to run on the sequence. Test it out on dummy data and if need be, keep two forms of the sequence data around -- one coded and the other in all upper case letters.

Finish the editing session by pressing Ctrl-x and responding to the exiting prompts appropriately.

Type the file you just created. See if you can spot any possible problems with the data prior to using your file in the GCG program REFORMAT.

% cat bee.seq

Now use your file with the GCG program REFORMAT. Invoke the GCG package by entering gcg. The GCG software package is set up to run programs simply by entering their name. You will be using the program REFORMAT to convert your edited file into a GCG compatible sequence file.

% gcg

The GCG welcome message appears on the screen.

REFORMAT is an interesting program that can be run in a variety of ways. Here you will be using the most simple aspects of the program. The program will ask you for the name of the file to work with and what to call the changed file. The default value will name it the same as before, so try to keep your raw data distinct from sequence files that will work in the system by using a different extension for each case. If the file is OK, there will be no error messages. If you get one, go back and revise bee.seq and repeat this process until there aren't any error messages. User input is shown in bold type.

% reformat

REFORMAT rewrites sequence file(s), symbol comparison table(s), or
enzyme data file(s) so that they can be read by GCG programs.

REFORMAT what sequence file(s) ?   bee.seq <rtn>

    bee.seq  length:  27 aa

Type the reformatted file and note the differences between it and the original file you entered with the pico editor. While the comment or header information you entered is the same, the two periods have been replaced by an elaborate checksum line and your sequence data is now in blocks of 10 characters. Examine the reported length for the reformatted sequence. If it is not the 27 that it should be, locate the problem, correct it via editing and reformat it again.

% cat bee.seq


section 5b

In this section you will use the GCG program SEQED to enter a simple protein sequence into the computer in GCG usable format.

Activate the SEQED program by entering its name at the prompt.

% seqed

The screen displays a double set of lines numbered from 0 to either 70 or 100. Since a filename was not given when the program was started, the software prompts you for the name of sequence to work with. Respond as shown below. User input shown in bold type.

SEQED of what sequence ? bee2.seq <rtn>

Because there is no such file in your present directory location to work with, the program starts prompting you for information to insert in the header or comment portion of the new file. The cursor moves to the top of the screen. Enter some comment lines. For this section you will be entering the same data as in section 5a, so put in comments relating to the name of the material and its source. When you are finished entering comment information, press Ctrl-d to return to sequence entry mode.

When the cursor moves to the first position on the top of the two lines, enter in the sequence. As you enter the actual sequence, the cursor moves to the right and each new character appears on the screen. Symbols appear as you move along on the lower line as well. When you are finished, press Crtl-d to go into the command mode of the program.

When you are in the command mode, a colon will be in the bottom left-hand corner of the screen. Save your efforts to an output file by entering exit. The program will write the file with the same name you gave it earlier, bee2.seq, and return a notice that the file contains so many residues, and then quit.

Type off the results of your efforts. Notice the P after the Type: term in the checksum line. This indicates that the file contains a protein sequence.

% cat bee2.seq

SEQED is a complex program with many different options. For more complete information of its operation consult the GCG manual located in your carrel drawer.



6) Converting data received directly from a server.

Sometimes it is necessary to get data files directly from a database server. The database servers on the networks have their information updated every evening while VADMS' sequence databases are updated on a bi-monthly or quarterly basis. Therefore, a newly published paper may refer to an accession number or access code for a sequence that is not locally available. Assume for the purposes of this section that you have come across information that the needed sequence is in GenBank and its access code is M31742. Use the example given below to submit a request for this sequence. User input is shown in bold type.

% pine

Once in the mail utility, enter c for compose message. This will bring you to the COMPOSE MESSAGE screen. Fill in the lines as shown below.

To      :retrieve@ncbi.nlm.nih.gov<rtn>
Cc      :<rtn>
Attchmnt:<rtn>
Subj    :<rtn>
----- Message Text -----
datalib genbank <rtn>
begin <rtn>
m31742 <rtn>
<CTRL-x>

When the Send message? prompt appears, reply with y. The message will be written and sent off to the server. The screen returns to the mailer's main menu. It takes only a few moments for a reply to be returned to you. The speed of your response depends on the time of the day the request was submitted and how busy the networks are. Wait for 2 to 4 minutes then press the RETURN key twice. If all is well, there should be a new mail message waiting for you from the RETRIEVE-Server. Press the RETURN key to read that message.

To have data to work with it is necessary to extract this message into a file. This is done by pressing the e key and responding to the EXPORT: (copy message) to file in current directory: prompt with a file name. For the purposes of this exercise, use your last name for the filename and get as the extension. Exit the mail utility by entering q and responding to the Really quit pine? prompt with y.

The created file needs to have the following things done to it. Use the editor pico to make these changes. It would be better if the mailing header information at the top of the file was removed. In this case, it means removing all the lines prior to the one starting with LOCUS. Then move down in the file to where the actual sequence data is given. Between the ORGIN line and the sequence data should be a line containing just two periods, .., to assist the GCG reformatting process. Next note that the feature information above this section also contains some ".." notations. This will confuse the REFORMAT program. Therefore put a space between these periods. With the file so edited, exit the editor and go through the reformatting process on this modified file as was done in section 5a . If no problems are reported in the reformatting process and upon typing the file off it looks like a regular GCG sequence file, your efforts have been successful. If something is wrong, get help from your lab instructor prior to continuing on to the next section of this exercise.

To type off the results of your modifications using the cat command.

% cat (your lastname).get

More information on using servers can be found in a handout in your carrel drawer. Refer to it in the future if you need to use this service to get at needed sequences, both primary and structural.



7) Converting between various data formats.

Now that you have worked with GCG formatted data files, it is time to explore the world of other data formats. GCG's is only one of many possible formats that you may come across in your computing work. Different formats can result from running a specific program or software suite that is used for a specific purpose. Getting data from collaborators across the country might mean that their data is formatted with different software than what is used here.

First type off the GCG formatted data file 1crn.nrl_3d. Notice the way the data is organized. This file was given at the start of the exercise and is from the NRL_3D database. It contains the sequence information for the protein crambin. Record your observations on the way the file is organized in the space provided below.

% cat 1crn.nrl_3d

Crambin file observations: ______________________________________________

___________________________________________________________________

___________________________________________________________________

The GCG software suite contains a number of programs for automatically converting between its own format and other formats. A listing of these programs is given below.

name      description
-------------------------------------------------------------------------------------
tofasta   converts GCG sequence(s) into FastA format
toig      converts GCG sequence files(s) into a single file in IntelliGenetics format
topir     writes GCG sequence(s) into a single file in PIR format
tostaden  writes a GCG sequence into a file in Staden format

There is another program that will do format conversions as well. Its name is readseq. Readseq allows for the inter converting of a number of different formats. This software is more flexible in that it works with more formats than the four supported by GCG.

Run the 1crn.nrl_3d file through readseq to produce a PIR-formatted output file. Then run the file through the topir program. Compare the results of the two runs with the original data file that was used. Instructions on using these two programs are given below. Use them as guides for creating the desired output files. User input is shown in bold type. In the readseq program there are two different forms of PIR format. Option 3 (NBRF) is the standard PIR format and option 14 (PIR/CODATA) is the special PIR format for use with their new Atlas system. Use option 3.

% readseq
readSeq (1Feb93), multi-format molbio sequence reader.

Name of output file (?=help, defaults to display):
1crn.pir-readseq<rtn>
         1. IG/Stanford           10. Olsen (in-only)
         2. GenBank/GB            11. Phylip3.2
         3. NBRF                  12. Phylip
         4. EMBL                  13. Plain/Raw
         5. GCG                   14. PIR/CODATA
         6. DNAStrider            15. MSF
         7. Fitch                 16. ASN.1
         8. Pearson/Fasta         17. PAUP/NEXUS
         9. Zuker (in-only)       18. Pretty (out-only)
   Choose an output format (name or #): 
3<rtn>

Name an input sequence or -option:
1crn.nrl_3d<rtn><

Name an input sequence or -option:
<rtn>

Now repeat the process using the topir program to create another PIR formatted file.

% topir

ToPIR writes GCG sequence(s) into a single file in PIR format.

 TOPIR of what GCG sequence(s) ?  1crn.nrl_3d<rtn>

                  Begin (* 1 *) ? <rtn>
                End (*    46 *) ? <rtn>

 What should I call the output file (* 1crn.pir *) ? <rtn>

 1CRN 46 characters.

Type off these two output files and compare the results obtained by each method. How do these output files compare with one another and the original as far as reference information is concerned? Record your observations below and on the next page.

% cat 1crn.pir-readseq

% cat 1crn.pir

comparison results: ___________________________________________________

___________________________________________________________________

___________________________________________________________________


Based on your observations and your previous experience with manual format conversions, what sort of process would have to be run to convert a GCG formatted file into a PIR formatted file retaining complete reference information?

conversion process to be used in this case: __________________________________

From looking at the readseq menu that was presented on the previous page, there are a number of formats supported that GCG doesn't recognize. Run through the conversion process again, this time using the a22411.gb_pat sequence. Produce output files in the following formats: EMBL, DNAStrider and Pearson/Fasta. Use the following example as a guide. It gives the steps to follow for the EMBL conversion. Use the following extensions to tell your output files apart: embl for the EMBL conversion, dnas for the DNAStrider conversion and pfasta for the Pearson/Fasta conversion.

% readseq
readSeq (1Feb93), multi-format molbio sequence reader.

Name of output file (?=help, defaults to display):
a22411.embl<rtn>
         1. IG/Stanford           10. Olsen (in-only)
         2. GenBank/GB            11. Phylip3.2
         3. NBRF                  12. Phylip
         4. EMBL                  13. Plain/Raw
         5. GCG                   14. PIR/CODATA
         6. DNAStrider            15. MSF
         7. Fitch                 16. ASN.1
         8. Pearson/Fasta         17. PAUP/NEXUS
         9. Zuker (in-only)       18. Pretty (out-only)

Choose an output format (name or #):
4<rtn>

Name an input sequence or -option:
a22411.gb_pat<rtn>

Name an input sequence or -option:
<rtn>

Type off these output files and look at the differences in their respective formats.

% cat a22411.embl

% cat a22411.dnas

% cat a22411.pfasta

There are many formats available and readseq is a good tool for converting from one to another. It also handles conversions of files containing multiple sequences.

A listing of the GCG programs to be used for converting another format into GCG's format is given below.

name         description
---------------------------------------------------------------------------------
fromembl     reformats sequences from EMBL distribution flat file into individual 
             files in GCG format
fromfasta    reformats one or more sequences from FastA format into individual 
             files in GCG format
fromgenbank  reformats one or more sequences from GenBank flat file format 
             into individual files in GCG format
fromig       reformats one or more sequences from IntelliGenetics format into 
             individual files in GCG format
frompir      reformats sequences from the NBRF protein database into individual 
             files in GCG format
fromstaden   changes a sequence from Staden format into GCG format.

To more realistically use one of these converter programs, assume the following scenario. You are interested in finding any new homeobox sequences since the last full release of GenBank. The hope is to be able to locate a possible human homeobox-like sequence. A good spot to check out homeobox sequences is at a site concerned with fruit fly research, since this is a hot topic of research in this field. Such a site is located at Indiana University where the gopher search will begin. The screen traces shown on the next few pages are truncated to take up less space. The blank space between the menus and the bottom of the displayed screens have been reduced.

% gopher

The following screen appears on your terminal.

                   Internet Gopher Information Client v2.1.3

                       Home Gopher server: serval.net.wsu.edu

 -->  1.  About WSUinfo/
      2.  Student Information System/
      3.  WSU Campuses Information/
      4.  Desktop Resources/
      5.  Discussion Forums/
      6.  Library Resources/
      7.  Software Archives/
      8.  Gopher Tunnels/
      9.  News & Weather/
      10. Internet Reference/


Press ? for Help, q to Quit                                         Page: 1/1

Once the screen is displayed, press the v key. This will load the existing list of bookmarks into the program and start the process of searching molecular biology sites for the information being sought.

                   Internet Gopher Information Client v2.1.3

                                   Bookmarks

 -->  1.  Computational Biology (Welchlab - Johns Hopkins University)/
      2.  Brookhaven National Laboratory Protein Data Bank/
      3.  EMBnet BioInformation Resource EMBL (Germany)/
      4.  IUBio Biology Archive, Indiana University/
      5.  PIR Archive, University of Houston/


Press ? for Help, q to Quit, u to go up a menu                      Page: 1/1

Select option 4 from this list by moving the horizontal arrow down to that position with the terminal down arrow key and pressing the RETURN key. This moves you to the Indiana University gopher site.

                   Internet Gopher Information Client v2.1.3

                   IUBio Biology Archive, Indiana University

 -->  1.  Genbank-Sequences/
      2.  IUBio-Software+Data/
      3.  About-IUBio-Archive.text  [24May95, 26kb]
      4.  FlyBase @ IUBio Gopher/
      5.  FlyBase @ IUBio HTML <HTML>
      6.  HTML door to IUBio <HTML>
      7.  Molecular-Biology/
      8.  Network-News/
      9.  Other-Bio-Gophers/
      10. Other-Gophers-and-Things/
      11. Species/
      12. This-Server/
      13. old FlyBase/


Press ? for Help, q to Quit, u to go up a menu                      Page: 1/1

Press the RETURN key once this screen is displayed to search the GenBank sequences for the desired information. The following screen next appears.

                   Internet Gopher Information Client v2.1.3

                               Genbank-Sequences

 -->  1.  About  [7Oct96, 7kb]
      2.  Search GenBank <?>
      3.  Search GenBank EST <?>
      4.  Search Genbank (gopher form)/ <??>
      5.  Search Genbank (html form) <HTML>
      6.  Search PIR <?>
      7.  Search PIR (gopher form)/ <??>
      8.  Search PIR (html form) <HTML>
      9.  Search Prosite protein database <?>
      10. Search Swiss-Protein <?>
      11. Search Swiss-Protein (gopher form)/ <??>
      13. Srs-FastA: Similarity Search of GenBank Subsets <HTML>
          c-----------------------------------------------------
      15. genbank-release-brief  [24Oct96, 8kb]
      16. genbank-release-doc  [24Oct96, 81kb]
      17. genbank-update  [13Nov96, 1kb]
      18. prosite-entry-list  {13Jun95, 43kb]

Press ? for Help, q to Quit, u to go up a menu                      Page: 1/2

Move the cursor down to option 2 and press the RETURN key to start the GenBank database search.

                   Internet Gopher Information Client v2.1.3

                               Genbank-Sequences

      1.  About  [ 7Oct96, 7kb]
 -->  2.  Search GenBank <?>
      3.  Search GenBank EST <?>
      4.  Search Genbank (gopher form)/ <??>
--------------------------------Search GenBank---------------------------------
|                                                                             |
| Words to search for                                                         |
|                                                                             |
|                                                                             |
|                                                                             |
| [Help: ^-]  [Cancel: ^G]                                                    |
------------------------------------------------------------------------------
      13. Srs-FastA: Similarity Search of GenBank Subsets <HTML>
          c-----------------------------------------------------
      15. genbank-release-brief  [14Oct95, 7kb]
      16. genbank-release-doc  [31Dec95, 82kb]
      17. genbank-update  [17Jan96, 1kb]
      18. prosite-entry-list  [13Jun95, 43kb]

Press ? for Help, q to Quit, u to go up a menu                      Page: 1/2

In the highlighted box enter the term homeobox. The system goes off and searches for the requested information and returns the following screen or something like it.

                   Internet Gopher Information Client v2.1.3

                            Search GenBank: homeobox

          iAccession ..| Description ......................... [Date, Size]
 -->  2.  Titles of matches to "{ homeobox }"
          Items since last full GenBank release

4. ( 1) U74092 Pisum sativum branch induction, putative homeobox, par.. 5. ( 2) U74093 Pisum sativum branch induction, putative homeobox, par.. 6. ( 3) U74094 Pisum sativum branch induction, putative homeobox, par.. 7. ( 4) U74095 Pisum sativum branch induction, putative homeobox, par.. 8. ( 5) U73753 Pisum sativum clone Phox1, branch induction partial mR.. 9. ( 6) U73754 Pisum sativum clone Phox2, branch induction partial mR.. 10. ( 7) U73755 Pisum sativum clone Phox3, branch induction partial mR.. 11. ( 8) U73756 Pisum sativum clone Phox4, branch induction partial mR.. 12. ( 9) U73946 Caenorhabditis elegans LIM homeodomain protein CeLIM-7.. 13. ( 10) U72347 Caenorhabditis elegans LIM homeobox gene CeLIM-6 mRNA,.. 14. ( 11) U72348 Caenorhabditis elegans LIM homeobox gene CeLIM-4 mRNA,.. 15. ( 12) L41846 Polycelis nigra homeodomain protein (Pnox1 b) gene, pa.. 16. ( 13) L41857 Polycelis felina homeodomain protein (Pfox3) gene, par.. 17. ( 14) L41848 Polycelis nigra homeodomain protein (Pnox3) gene, part.. 18. ( 15) L41856 Polycelis felina homeodomain protein (Pfox2) gene, par.. Press ? for Help, q to Quit, u to go up a menu Page: 1/10

Look closely at this list. Notice that this page contains no human proteins. There are also ten pages in the total list. Use the arrow key to move your way through the additional pages until you find a human homeobox protein. When you have found a page with a human homeobox protein listed, move the arrow down to this position and press the RETURN key. A screen similar to the one given below appears.

                  Internet Gopher Information Client v2.1.3

                            Search GenBank: homeobox

 -( 82) U31762 Human homeobox protein (Dlx-8) mRNA, partial cds. [960530, 0k]--
 |                                                                             |
 |  -->   1. text/plain English (USA) [0k] (default)                           |
 |        2. application/rtf English (USA) [0k]                                |
 |        3. biosequence/genbank English (USA) [0k]                            |
 |        4. biosequence/fasta English (USA) [0k]                              |
 |        5. biosequence/gcg English (USA) [0k]                                |
 |        6. biosequence/embl English (USA) [0k]                               |
 |        7. biosequence/nbrf English (USA) [0k]                               |
 |        8. biosequence/phylip English (USA) [0k]                             |
 |        9. biosequence/msf English (USA) [0k]                                |
 |       10. biosequence/paup English (USA) [0k]                               |
 |                                                                             |
 |  Choose a document type (1-10):                                             |
 |  [Help: ?]  [Cancel: ^G]                                                    |
 ------------------------------------------------------------------------------
      89. ( 85) S76222 CTs-Hox3=homeobox [Ctenodrilus serratus, Genomic, 82 n..
      90. ( 86) S76224 CTs-Lox2=homeobox [Ctenodrilus serratus, Genomic, 82 n..
Press ? for Help, q to Quit, u to go up a menu       Receiving Information...    

Notice all the choices that you have to display and thus bring over the data in. This shows quite strongly that not all biocomputing sites use the GCG software package for sequence analysis. To get the data in a format where it is later required to do a conversion on it, press the RETURN key to see the file as a plain text data set. The screen changes to show the desired plain text version of the chosen protein.

Press the s key to let the software know that you want to save this file into your own account on the platform you are using. The screen changes and put up a box over the middle of the present screen. The highlighted box below the Save in file line is filled with junk. Start pressing the delete key until that entire line is blank and then enter the human.homeobox for the filename of the save data.

------------------------------------------------------------------------------
|                                                                            |
| Save in file:                                                              |
|                                                                            |
| junk in this space                                                         |
|                                                                            |     |
|  [Help: ?]  [Cancel: ^G]                                                   |
------------------------------------------------------------------------------

Wait a minute or so to insure that the file has been transferred over and then press the q key twice. This should cause the prompt about quitting the program to appear at the bottom of the screen. Respond with pressing the RETURN key to return to the machine prompt.

Type off the new file with the cat command.

% cat human.homeobox

This file was from the GenBank database so it is in that format. To convert it into a GCG formatted file use the fromgenbank program. Instructions are given below for doing this. User input is shown in bold type.

% fromgenbank

FromGenBank reformats one or more sequences in the flat file format
of the GenBank database into individual sequence files in GCG format.

 Reformat what GenBank data file?  human.homeobox <rtn>

    xxxxxxxx.seq  xxxx bp.

 reformatted: human.homeobox
 total files: 1
 total bases: xxxx

Notice how this program didn't ask for the desired name of an output file. It used the GenBank access code contained in the file for that purpose and gave the file the extension seq. Type off this file and see what changes have been made in it. Replace the xxxxxxxx.seq given in the command line below with the actual name of the file given in the fromgenbank run.

% cat xxxxxxxx.seq>


8) Entering DNA sequencing data into the computer.

In section 5 of this exercise, you entered a protein sequence into the computer. Protein sequences are best entered using the normal terminal keyboard, since you need so many of the keys to provide the necessary one-letter codes for the 20+ commonly used amino acids. DNA sequences, however, work with a much more limited set of codes, and are best handled by re-defining the keyboard to put all the needed keystrokes into a convenient section of the keyboard for one-handed entry of the data.

Redefining a keyboard can create problems for a user in that the other keys no longer work in their normal fashion. This can be avoided by setting up a subdirectory containing a file that redefines the keyboard. Therefore the keyboard is redefined in this special place and not for the entire account. You can then do your nucleotide sequence entry in this special subdirectory, and protein entry elsewhere in your account. GCG 's SEQED looks to see if this special redefining file is present and acts accordingly.

Create a subdirectory for your nucleotide sequence entry work with the name dna_entry and then move there.

% mkdir dna_entry

% cd dna_entry

Now that you are in the special subdirectory for nucleotide data entry, create the file that redefines the keyboard. This is done using the GCG program SETKEYS. SETKEYS will ask you for the keys to use for the four bases and three common ambiguity codes plus a delete key. A file called set.keys is then created to contain this information. This file can be further edited if you need to have more keys defined at a later date. Use the example given below to learn to use this program. Give some thought as to how you would prefer the keys to be assigned prior to running the program. Some keys are not allowed to be used. The keys , and / are two of these. It may be necessary to select a remapping scheme and then go into SEQED and see that there aren't any warning messages from the scheme you have come up with. The example given on the next page contains acceptable key reassignments. Record how you set it somewhere in your notes so that you can refer to it later in the class and use it again for entering other DNA sequences. User input is shown in bold type.

% setkeys

SetKeys writes a file in your directory that redefines your keyboard's
keys for sequence entry with the programs SEQED, LINEUP, GELENTER, and
GELASSEMBLE.  The output file, called set.keys, can be edited if you
want to use keys that were not defined in your interactive session with

SetKeys.

Choose key(s) for each nucleotide:

What key(s) should mean G ?  j <rtn>
What key(s) should mean A ?  k <rtn>
What key(s) should mean T ?  l <rtn>
What key(s) should mean C ?  ; <rtn>

Now choose key(s) for the common ambiguity codes:

What key(s) should mean R ?  u <rtn>
What key(s) should mean Y ?  i <rtn>
What key(s) should mean N ?  o <rtn>
What key(s) should mean <Delete> ?  p <rtn>

SetKeys complete: output file is "/disk3/usr/local/people/bcsxx/week4/dna_en.

With the keyboard now ready, it is time to enter a typical nucleotide sequence. Most modern labs can easily produce reading gels containing between 300 and 500 bases. Given below is a nucleotide sequence of 300 bases for entry. Use the SEQED program to enter this sequence. Refer to the earlier protein example or to the manual for assistance prior to calling your instructor for help. Give the sequence the filename unknown.seq. Remember, you must now use your new key assignments.

ACAACCGGCCCAACGACTCGATGAGGGAACTTTGGACACACTCGCAGCTC
ACAGGTGAACGATATGGCTCCAAGAAGAGTGTAGCCATCCTGACCAGCGG
TGTGACAGCCGGCGCCGCCGAGGAATTTACTTACATCATGAAGAGGCTGG
GCCGGGCCCTGGTCGTTGGTGAAGTGACAAGTGGAGGCTGCCAGCCACCA
CAGACCTACCACGTGGACGACACGCATCTCTATATCACCATCCCCACAGC
TCGCTCTGTGGGCGCCACGGACGGCAGTTCCTGGGAAGGGGTGGGTGTGA

% seqed unknown.seq

When the sequence is completely entered, exit from the program and examine it. There is always one nagging problem with nucleotide data entry, the sequences are so long that it is easy to make a mistake that could throw later analysis efforts off. Bases can be dropped or extras copies of correct ones inserted, so just seeing that the length is correct is not enough. One way to get around this is to double enter a sequence blindly. SEQED can be run in a checking mode in the following manner:

% seqed unknown.seq

When the unknown sequence is displayed on the screen, get into the command mode by entering Crtl-d. At the colon prompt enter check /blind. [Normally in GCG a dash is used before an option, however in this case a slash is used.] When the colon returns again after the sequence has disappeared, press the RETURN key to get into the entry mode. With the cursor now above the top number line, re-enter the 300 bases. When a base is entered that doesn't agree with one entered previously in that position, a beep will occur and a ^ will appear under the position in question. The arrow keys may now be used to toggle back and forth between the two lines to see if you can spot the problem, and determine which of the lines is correct. The only line shown on the screen is the one currently being looked at. To use your delete key, position the cursor at the right of the character to be removed. Even this checking won't remove all errors if care is not taken in referring between the original data and the checking lines. Corrections made in the original blind sequence are what will be saved when the sequence is rewritten to an output file.

Since you really don't know what the unknown.seq might be, do some database searching to determine what you are dealing with. The sequence you have been working with is from the GenBank database. In the GCG database system there are a number of GenBank subdivisions. The question here is which one contains the match for your entered sequence? A listing of GenBank subdivisions is given below.

GB_BA => gb_ba => bacterial sequences
GB_EST1 => gb_est1 => first part of the expressed sequence tag entries
GB_EST2 => gb_est2 => second part of the expressed sequence tag entries
GB_EST3 => gb_est3 => third part of the expressed sequence tag entries
GB_EST4 => gb_est4 => fourth part of the expressed sequence tag entries
GB_EST5 => gb_est5 => fifth part of the expressed sequence tag entries
GB_EST6 => gb_est6 => sixth part of the expressed sequence tag entries
GB_EST7 => gb_est7 => seventh part of the expressed sequence tag entries
GB_EST8 => gb_est8 => eighth part of the expressed sequence tag entries
GB_EST9 => gb_est9 => ninth part of the expressed sequence tag entries
GB_GSS => gb_gss => genome survey sequencesx
GB_HTG => gb_htg => high-throughput sequencing data
GB_IN => gb_in => invertebrate sequences
GB_OM => gb_om => other mammalian sequences
GB_OV => gb_ov => other vertebrate sequences
GB_PAT => gb_pat => patent sequences
GB_PH => gb_ph => phage sequences
GB_PL => gb_pl => plant sequences
GB_PR => gb_pr => primate sequences
GB_RO => gb_ro => rodent sequences
GB_ST => gb_st => structural RNA sequences
GB_STS => gb_sts => sequence tag site sequences
GB_SY => gb_sy => synthetic and chimeric sequences
GB_TAGS => points to the est and sts sequences
GB_UN => gb_un => unannotated sequences
GB_VI => gb_vi => viral sequences   
In the real world, you would want to know if your data shows any similarity to other reported sequences. In this case, you want to find out both that and how correctly you have entered the sequence.

One way to do this is to run the GCG program FASTA. Since such searches on the entire GenBank database can take a lot of time, you use the -batch command switch to set up the command file to have the job run automatically in batch mode. However, for this exercise you have found out that the original sequence is from a rodent and you can therefore run the program interactively on just that subdivision of the database.

% fasta

You are now in the program, selecting the parameters to be used. The query sequence is unknown.seq, the database to be searched is gb_ro:*, and you want only the top 5 sequences to be saved. For the rest of the parameters, accept the default values shown by pressing the RETURN key and moving on to the next item. Given below and on the next page is an example of what you can expect in this procedure. User input is shown in bold type. Use your last name as the filename for the output file.

FASTA does a Pearson and Lipman search for similarity between a query
sequence and any group of sequences of the same type (nucleic acid or
protein.) For nucleotide searches, FASTA may be more sensitive than
BLAST.

 FASTA with what query sequence ?  unknown.seq <rtn>

                 Begin (* 1 *) ? <rtn>

               End (*   300 *) ? <rtn>

Search for query in what sequence(s) (* GenEMBL:* *) ? gb_ro:* <rtn> What word size (* 6 *) ? <rtn> Don't show scores whose E() value exceeds: (* 2.0 *): What should I call the output file (* Unknown.Fasta *) ? <rtn> ** a listing of the sequences being searched is shown ** .** information is given on the statistics for the search ** The best scores are shown. The size of this list is determined by the number of hits that are above a default expectation score. The list contains x entries.

How many alignments would you like to see (* x *) ? <rtn> ** information is given on the time needed to conduct the search. **

Type off the resulting output file, unknown.fasta, and examine the alignment results. The first sequence in the list should have excellent agreement with your unknown.seq file. If this agrees 100% with the sequence you entered, then there were no mistakes. Any lower figure, and there are problems in the sequence somewhere. Edit this file with pico to contain just the best alignment. This file will be used later in giving a report of your accuracy.

% pico unknown.fasta

There is another means of doing database searching, BLAST. This program uses a GCG-formatted sequence file to submit a BLAST search to the server at NCBI. The process is automatic and requires very little input from the end user.

Your results should take just a few minutes if you can get connected and the server isn't down. If either you can't get connected to the server or it is down, go on with the rest of the exercise and try this section later. Do a BLAST search on your unknown.seq file by following the instructions given below. User input is shown in bold type. When everything works and you get your output file, type it off to see what sort of hits you got.

% blast

BLAST finds sequences in a database that are similar to a query
sequence.  BLAST can either search databases on your own computer or it
can search the databases maintained at the National Center for Biotechnology
Information (NCBI). 

 BLAST search with what query sequence?  unknown.seq <rtn>

 Search for query in what sequence database:

   1) nr          p Non-redundant GenBank CDS translations+PDB+SwissProt+PIR   
   2)   pdb       p PDB protein sequences 
   3)   swissprot p SwissProt sequences 
   4) yeast       p Saccharomyces cerevisiae protein sequences
   5) kabat       p Kabat Sequences of Proteins of Immunological Interest         
   6) alu         p Translations of Select ALu Repeats from REPBASE
   7) month       p All new or revised GenBank CDS translations+PDB+SwissProtein+PI
   8) nr          n Non-redundant GenBank+EMBL+DDJB+PDB sequences (but no EST's  
   9)   pdb       n PDB nucleotide sequences  
  10)   vector    n Vector subset of GenBank 
  11) yeast       n Saccharomyces cerevisiae genomic nucleotide sequences 
  12) est         n Non-redundant Database of GenBank+EMBL+DDJB EST Division 
  13) sts         n Non-redundant Database of GenBank+EMBL+DDJB STS Division 
  14) gss         n Genome Survey Sequences  
  15) mito        n Database of mitochondrial sequences, Rel. 1.0, July 1995  
  16) kabat       n Kabat Sequences of Nucleic Acid of Immunological InterestV
  17) epd         n Eukaryotic Promotor Database   
  18) alu         n Select ALu Repeats from REPBASE        
  19) month       n All new or revised GenBank +EMBL+DDJB+PDB sequences released 
  Please choose one (* 1 *):  8 <rtn>

 Ignore scores that could occur by chance more than (* 10.0 *) times? <rtn>

 Limit the number of sequences in my output to (* 250 *) ?  5 <rtn>

 What should I call the output file (* Unknown.Blastn *) ?  (your lastname).blastn<rtn>


 Trying cruncher.nlm.nih.gov (130.14.25.175)

 Connected to cruncher.nlm.nih.gov

 Search in progress on the network server.

 ........................................

 Retrieving results.

 Done!

You may encounter a waiting statement after the connection line. If the number of jobs is the queue is not too big (there appears to be a limit of 5 for this type of search), you will still be connected while these earlier jobs are finished. Type off these results and compare them to the earlier FASTA ones.

Normally it is best to do this type of database searching with protein sequences rather than nucleotide ones. This is just meant to be an example of how to do this type of searching. The algorithms and scoring schemes for these two programs (FASTA and BLAST) will be explored in depth in the exercise for week 8.



9) Entering data from a pseudo gel.

Normally in a lab situation an individual reads a sequence directly off the gel. To do this one needs to know the orientation of the gel, how the lanes were set up etc., and then the gel is read from the bottom to the top picking off the sequence as one goes. To give you a flavor of this type of data, an autorad has been obtained, and a paper version of the data has been created from an actual gel and is included here. Use the paper version of this data to finish reading in the first 100 bases of a sequence through the use of SEQED. See section 5b for details on using the SEQED program.

You are working with known data here. To see how accurate you have been in reading this gel, do a similarity check on the two sequences with the GCG program GAP. Respond according to the example given below and on the next page. User input is shown in bold type. First copy over the necessary comparison file. In the example the xxx denotes your last name.

% gap

GAP uses the algorithm of Needleman and Wunsch to find the alignment of
two complete sequences that maximizes the number of matches and minimizes 
the number of gaps.

 GAP of what sequence 1 ?  gel.seq <rtn>

                 Begin (* 1 *) ? <rtn>
               End (*   100 *) ? <rtn>
              Reverse (* No *) ? <rtn>

 to what sequence 2 (* gel.seq *) ?  gelx.data <rtn>

                 Begin (* 1 *) ? <rtn>
               End (*   109 *) ? <rtn>
              Reverse (* No *) ? <rtn>

 What is the gap creation penalty (* 50 *) ?  <rtn>

 What is the gap extension penalty (* 3 *) ? <rtn>

 What should I call the paired output display file (* gel.pair *) ? xxx.pair<rtn>

Aligning .....-.

Information on the quality of the gapping is displayed.

Examine the results of the gapping process by looking at the output file, (your lastname).pair and finding out how well you read in the gel. The problems with getting your data accurate with nucleotide sequencing are so great that many labs have more than one person working with the same material and doing numerous gels just to be sure of their data. Gel reading errors most likely occur at the very beginning and end of the gel.

Send (your lastname).pair file over to the teacher account and then move back into the week4 subdirectory.

% rcp (your lastname).pair teacher@ribozyme:receive

% cd ..


10) Using SEQED to manipulate sequences

You have already seen how to enter sequence data using SEQED, now to explore how to manipulate data using this program and the results gathered from other software to perform a simulated cloning on the computer.

To explore this assume that you are interested in inserting the interesting region of the toxin.seq into the pbr322 plasmid. You will also be using a revised version of the standard GCG enzyme.dat file which more closely reflects the restriction enzymes found in a typical lab setting.

background - understanding restriction enzymes:

Look at this revised enzyme.dat file to see how complex a pattern various restriction enzymes can look for. Remember that there are a number of ambiguity codes for bases.

% cat enzyme.dat

Here is a small excerpt from the file. The information on the line gives the name of the restriction enzyme, the first number tells you the offset of the cutting site from the beginning of the recognition pattern on the top strand, the recognition pattern with the cut site marked by an ', the next number is the overhang from the cut site on the top strand where the bottom strand is cut. The underscore marks the cut site location on the bottom strand. In the example given below, two of the restriction enzymes have bottom cut sites that are different than the top ones. The advantage of such sticky ends is that it orientates inserted fragments in the cloning process. Blunt end cutters such as SspI and StuI can result in products which do not have the desired insert orientation.

      SspI       3 AAT'ATT       0
      StuI       3 AGG'CCT       0
      TaqI       1 T'CG_A        2
      XbaI       1 T'CTAG_A      4

The simplest form of cloning is to cut out a gene fragment with one restriction enzyme and to use the same enzyme to cut the plasmid for insertion. To be effective, it is necessary for the chosen restriction enzyme to cut the vector only once.

part 1 - collecting data on the toxin sequence:

To start off this process, first find out what you can about the toxin contained in the toxin.seq file. Record the location of the area of interest below (this is the CDS region).

% cat toxin.seq

area of interest: _______________________________________________________

To find out the desired information use MAPSORT. This program will determine restriction enzyme cut sites and produced fragment sizes.

% mapsort

MAPSORT finds the coordinates of the restrictions enzyme cuts in a
DNA sequence and sorts the fragments of the resulting digest by size.
MAPSORT can sort the fragments from a single or multiple enzyme 
digests.

 (Linear) MAPSORT of what sequence ?  toxin.seq <rtn>

                  Begin (* l *) ? <rtn>
                End (*   940 *) ? <rtn>

Is this sequence circular (* No *) ? <rtn>

 *** I read your enzyme data file "enzyme.dat"!! ***

Select the enzymes: Type nothing or " * " to get all enzymes. Type "?"
for help on what enzymes are available and how to select them.

                                   Enzyme(* * *):

What should I call the output file (* toxin.mapsort *) ? <rtn>

 Mapping .

Type off the results of this program. Information is given for those restriction enzymes that cut the sequence, where the cuts occur and how big the fragments are. The fragments are even arranged by size. Linear mapsort results give the starting and ending point of the sequence, in this case 0 and 940. If the restriction enzyme has cut the sequence only once, there will only be a single number given between the 0 and the 940. What you want is a restriction enzyme that produces at least two cuts in the sequence. For a restriction enzyme to be useful the location of these cuts should produce a fragment that contain the entire region of interest. From the results, select the restriction enzyme that cuts out the fragment you want and record its cut points below.

% cat toxin.mapsort

restriction enzyme to use: ______________________________________________

enzyme pattern: ______________________________________________________

enzyme cut points:      start  ______________________      end _________________


part 2 - collecting data on the pBR322 plasmid:

Look at a classic plasmid pBR322. pBR322 was once a very popular plasmid, but it had limitations. Two features of note are the ampicillin (beta-lactamase) and the tetracycline resistance gene areas. This allowed for testing of a successful cloning of a gene fragment into either of these two areas by simple wet lab techniques.

Type off the pbr322.seq file and record below the location of the two interesting features of the plasmid.

% cat pbr322.seq

ampicillin(beta-lactamase) region: _________________________________________

tetracycline resistance region: ____________________________________________

Determine how many of our working set of restriction enzymes cut pBR322 and where. To do this run the program MAPSORT with the command switch -once to restrict the results to single cutters (the ideal situation) and the command switch -cir to have the sequence be circular. Follow the example given below and on the next page. Use your lastname for the name of the produced mapsort file.

% mapsort -once -cir

MAPSORT finds the coordinates of the restrictions enzyme cuts in a
DNA sequence and sorts the fragments of the resulting digest by size.
MAPSORT can sort the fragments from a single or multiple enzyme
digests.

(Circular) MAPSORT of what sequence ? pbr322.seq <rtn>

              Begin (* 1 *) ? <rtn>
            End (*  4363 *) ? <rtn>

*** I read your enzyme data file "enzyme.dat"!! ***

Select the enzymes: Type nothing or "*" to get all enzymes. Type "?"
for help on what enzymes are available and how to select them.

                                      Enzyme(* * *): <rtn>

[This selects all the enzymes present in your refrigerator list.]

What should I call the output file (* pbr322.mapsort *) ? <rtn>

Mapping ...

Rename the output of this program to be that of (your lastname).mapsort and then print it off using the lpr on the lab's printer.

% mv pbr322.mapsort (your lastname).mapsort

% lpr (your lastname).mapsort

Look at the results of this program. The file contains a listing of those restriction enzymes from the enzyme.dat that cut the pBR322 plasmid only once and where that cut occurs along the sequence. At the end of the file is a listing of those enzymes from the enzyme.dat file that don't cut the sequence and those that were excluded due to multiple cuts. Note that only some of the 30 possible enzymes cut this sequence. Check to see if the restriction enzyme that can be used to clip out the desired region of the toxin sequence is a single cutter of pBR322 and where it cuts.

______________________________________________________________

______________________________________________________________

______________________________________________________________


part 3 - performing the cloning:

With all the necessary data collected, perform the actual cloning operation. Use the program SEQED. Start it by entering seqed. Respond to the prompt about SeqEd of what sequence ? with the name of the pBR322 sequence you are using (pbr322.seq).

% seqed

The sequence is loaded and the screen shows you the end of the sequence with the cursor blinking after base 4361. Enter 1 <rtn> to move to the front of the sequence. Now at the beginning of the sequence, it is time to explain how to proceed. To understand what values to be entered in the following steps, you need to understand the numbers reported to you by the MAPSORT program and the way in which SEQED operates.

SEQED inserts sequence data in the following manner. The program inserts the new sequence at the position you tell it to and moves the indicated base and the rest of the original sequence to the end of the newly included sequence. MAPSORT gives you the starting point of the restriction enzyme pattern that you are using not where the actual cut occurs. If the cut site is not just before the first character of the pattern, the value your recorded as the starting point on page 2x will have to be corrected. The difference between the start of the restriction enzyme recognition pattern and the actual cleavage point is called the offset. This is also true the case for the insertion point in the vector. The actual insertion position will be pattern start position plus the number of bases between the beginning of the pattern and the ` symbol in the enzyme pattern (recorded value + offset). The actual starting point of the sequence to be included is the recorded starting position plus the number of bases between the beginning of the pattern and the ` symbol in the enzyme pattern (recorded value + offset). The corrected ending position of the insert is the recorded ending position plus the number of bases between the beginning of the pattern and the ` symbol in the enzyme pattern minus one (recorded values + offset -1). Make these corrections and record the corrected values below.

corrected insertion point (value + offset): ___________________________________

corrected starting point (value + offset): ____________________________________

corrected ending point (value + offset -1): __________________________________

Enter the corrected insertion point at which the plasmid is to be cut and press RETURN. This move you along the sequence to the actual insertion point. The base that occurs just after the cut should be highlighted with the cursor. At this point press ctrl-d to go into the command mode of the program. At the: prompt. enter the following:

: include toxin.seq <rtn>

The program will come back with the following queries about the inserting of the toxin sequence.

The x in the example is the corrected starting point and the y the corrected ending point you recorded earlier above. User input is shown in bold type.

             seqed include of toxin.seq

              Begin (* 1 *) ?  x <rtn>
            End (*   940 *) ?  y <rtn>
           Reverse (* No *) ? <rtn>

first 50 bases: [the actual first 50 bases are shown] 

This line should start with the portion of the restriction enzyme pattern that is to the right of the cut site.

last 50 bases: [the actual last 50 bases are shown] 

This line should end with the portion of the restriction enzyme pattern that is to the left of the cut site.

is this what you want included (* yes *) ? <rtn>

Back at the: prompt, enter write clone.seq <rtn> to write your work out to a file and then enter quit to get out of the program.


part 4 - checking your work:

Checking your work. If everything has gone correctly, check your work with MAPSORT. Run the program as shown before, but use clone.seq as the input file, and your selected enzyme as the enzyme to be used. The resulting mapsort file should show two cut sites for your chosen enzyme and that the fragment between the cut sites is still big enough to hold the desired region of the sequence. If it doesn't, repeat the above sections until it does. An example of a successful result is shown on the next page. The first number in the size line is the desired fragment and the second is the size of the pBR322 plasmid.

% mapsort -cir

Cuts at:    xxxx    xxxx    xxxx
   Size:         933    4361

When your are satisfied with your work, rename the clone.seq file to (your lastname).clone and use rcp to ship it off to the teacher account.

% mv clone.seq (your lastname).clone

% rcp (your lastname).clone teacher@ribozyme:receive


11) Finishing up.

Use the editor, pico, to fill in the report. Send the report form, lastname.get, and lastname.blastn files over to the teacher account.

% mv week4s.week4s (your lastname).week4s

% pico (your lastname).week4s

% rcp (your lastname).week4s teacher@ribozyme:receive

% rcp (your lastname).get teacher@ribozyme:receive

% rcp (your lastname).blastn teacher@ribozyme:receive


This concludes your computing session for this week. Log off the computer.

% logout

Now exit the emulator program by selecting the Quit option from the File location on the control bar. You will be returned to the Launcher window screen.