'96 BC/BP 378

Week 6

Working with DNA - section 1. You will learn how to enter DNA sequence data for various uses; explore nucleotide primary sequence databases and the format of the data therein, do simple nucleotide database string searches and create effective data sets, and do a cloning operation manually on the computer.

Author:

Susan Jean Johns


Nucleic Acid Background Information

Biochemistry is the study of the molecular basis of life. Nucleic acids are a vital part of that study. Nucleic acids resemble proteins in that they are built from many smaller building blocks linked end to end. The building blocks of a nucleic acid, called nucleotides, are much more complex than any amino acid. Each nucleotide is composed of a phosphate group, a sugar moiety and either a purine or pyrimidine base. The nature of the sugar moiety determines the type of nucleotide. DNA molecules contain deoxyribose, while RNA molecules contain ribose. Three of the four nitrogenous bases are present in both DNA and RNA: adenine, guanine and cytosine. Thymine is only found in DNA and uracil only in RNA. Thymine and uracil are structurally similar to one another.

Nucleotides are linked together by regular 5'-3' phosphodiester bonds. These repeating sugar phosphate units are always linked together by the same chemical bonds and form the backbone of the molecule. The bases provide the molecule with its high degree of variability. Early studies of the composition of DNA showed that the amount of adenine (A) was similar, if not identical to, that of thymine (T). Likewise the amount of guanine (G) was similar, if not identical to, that of cytosine (C).

DNA molecules are actually two nucleotide strands held together by hydrogen bonding. One strand is the base complement of the other. DNA molecules can thus form double stranded helical structures. The backbone units are on the outside of the helix with the bases on the inside. This structure is stabilized by hydrogen bonding between the bases of the adjacent strands. When inverted repeats exist within a DNA molecule, denaturation followed by renaturation can form stem loop structures. RNA exists as a single stranded entity and can fold up into elaborate structures.

Chromosomes had long been known to be the bearers of heredity. As the structure of DNA became more fully understood, chromosomes have been redefined as a single genetically specific DNA molecule to which are attached a large number of proteins involved in maintaining structure and regulating gene expression. While RNA is the genetic material in some viruses, it is primarily involved in the translation of regions DNA sequences into proteins. Some RNA's even serve as catalysts (ribozymes).

In addition to their main chromosomes, many bacteria contain large numbers of tiny circular DNA molecules with only a few thousand base pairs. These tiny chromosomes were found to carry genes that convey resistance to antibiotics and were not linked to the main chromosomes. Known as plasmids, these substances resulted from the need for high numbers of enzymes to neutralize antibiotics without greatly increasing the size of the main chromosomes. Some plasmids make multiples until a cell contains 10 to 200 copies of the plasmid. These relaxed-control plasmids are the ones used in cloning work.


Background Information on Nucleotide Sequence Data Entry

In order to work with nucleotide sequence data on the computer, its relevant information has to be entered in a form the machine and analysis software can recognize. When that data doesn't exist in any other source, then the user must input the data. Here at WSU, the VADMS Computing Resource supports the use of the Genetics Computer Group software suite (GCG) for protein and nucleotide sequence analysis tasks. Therefore, in order to have nucleotide sequence data in a format that can be used by this software, it should be entered via the GCG's sequence entry program, SEQED. To better understand what happens in SEQED, here is some information about the standard way to organizing information for a primary sequence data file.

Three basic parts to a primary sequence entry exist. The first is the relevant reference or background information on the sequence. The nature of this section of the file depends on the database from which it was extracted or the verbosity of the person who created the file. If the file is from a database, there will be information on the name of the sequence, its source, its accession number, references and feature information. A sequence file that is not from a database may contain anything in its header section. The information placed there depends solely on the whims of its creator.

When you enter a sequence, think of the purpose for that data entry. If the data has the potential to be used for long period a time, i.e., is necessary information for a research group or a lab, be detailed when you enter this information. Give all the important facts about the sequence that you currently have. You can change this part of the file later if more information becomes available on the sequence. If the data is only to be used in an exercise for this class, be brief with the reference information.

Located between the header and sequence sections is what GCG refers to as the checksum line. It contains the filename of the sequence file, the length of the sequence, the date the file was created, the type of data it is (P for protein and N for nucleotide), and a number. This number is used in GCG programs to see if any scrambling of the data has occurred for whatever reason. After the checksum number are two periods. GCG uses the location of these two periods to signal the end of nonsequence material in the file, and the beginning of the actual sequence information.

The last part of the sequence file is the actual sequence data itself. Normally the data is shown in blocks of ten, with fifty characters to a line. Each data line is preceded by the position in the sequence of the first character in that line. For ease in reading the data, a blank line is placed between each actual sequence line. Examples of actual sequence files from the various VADMS supported databases will be given in the next section.


Background Information on Nucleotide Primary Sequence Databases

Large databases containing nucleotide primary sequence information exist. Normally you check the databases to see if the information you want has already been entered and if it matches the sequence data you need or want to use. In order to do that, you need to know something about the way databases are organized and how to search them.

Nucleotide sequence databases have pointers to allow you to extract sequence information. A number of different terms are similar in name but different in purpose. A sequence's access code is composed of a group of 6 to 10 alphanumeric characters, depending on its database of origin.

When a sequence is deposited into a database, it is given an accession number. Accession numbers do not change as the data is absorbed into different databases. Often the best way to search for newly published sequences is by accession numbers because their final access code may not have been determined before the paper went to press. Accession number searching is possible through GCG's STRINGSEARCH. The accession number given at the time the sequence is deposited is known as the primary accession number. If the sequence was developed from work on earlier sequences, those numbers will also be given and they are known as secondary accession numbers.

VADMS supports the following primary sequence nucleotide databases: EMBL, and GenBank. Of the two, the entries in EMBL have a more complete and logical organization method. GenBank is now the primary depository of nucleotide data in the world. EMBL is the database of choice for European investigators. EMBL uses a maximum of 8 characters in its access codes and GenBank uses 10.

An example of the GCG format of each is these databases is given on the next two pages. This is followed by an example of a cloning vector sequence from GenBank. If necessary the text has been modified to fit on a single page. The checksum line has been shown in bold to help you identify that line of the file.


An example file from the EMBL database.

DL;LECBPC     - Tomato chlorophyll a/b-binding protein gene Cab-1A, complete cds
ID   LECBPC standard; DNA; PLN; 535 BP.
XX
AC  M14445;
XX
DT   16-JUL-1988 (Rel. 16, Created)
DT   06-JUL-1989 (Rel. 20, Last updated, Version 1)
XX
DE   Tomato chlorophyll a/b-binding protein gene Cab-1A, complete cds.
XX
KW   chlorophyll binding protein.
XX
OS   Lycopersicon esculentum 
OC   Eukaryota; Plantae; Embryobionta; Magnoliophyta; Magnoliopsida;
OC   Asteridae; Solanales; Solanaceae.
XX
RN   [ 1 ]
RP   1-535
RA   Pichersky E., Bernatzky R., Tanksley S.D., Breidenbach R.B.,
RA   Kausch A.P., Cashmore A.R.;
RT   "Molecular characterization and genetic mapping of two clusters of
RT   genes encoding chlorophyll a/b-binding proteins in Lycopersicon
RT   esculentum (tomato)";
RL   Gene 40:247-258(1985).
XX
DR   SWISS-PROT; P14274; CB2A_LYCES.
XX
FH   Key            Location/Qualifiers
FH
FT   source         1. .535
FT                  /organism="Lycopersicon esculentum~
FT   CDS            38. .535
FT                  /note="chlorophyll a/b-binding protein Cab-1A"
XX
SQ   Sequence 535 BP; 125 A; 136 C; 131 G; 143 T; 0 other;

   Lecbpc  Length: 535  July 19, 1995 09:22  Type: N  Check: 5409 ..

       1  CCATAAAATA CTCAACACTT TTCTCTTAGT ATAAATCATG GCAGCTGCTG

      51  CAATGGCTCT TTCTTCCCCT TCATTTGCTG GACAGGCAGT CAAACTCTCA

     101  CCATCTGCCT CAGAAAATTC TGGAAATGGA AGGATCACTA TGAGAAAGGC

     151  TGTCGCCAAG TCTGCCCCAT CTAGCAGCCC ATGGAGCTTG GTCCATGCAC

     201  AAAGCATCTT GGCCATCTGG GCTTGCCAAG TTGTGTTGAT GGGAGCCGTT

     251  GAGGGATACC GCATTGCTGG TGGACCTCTT GGTGAGGTTG TCGACCCACT

     301  CTACCCCGGT GGCAGCTTCG ACCCATTAGG CCTTGCTGAA GACCCGGAGG

     351  CATTTGCTGA GCTTAAGGTT AAGGAGATCA AGAACGGCAG ACTTGCTATG

     401  TTCTCTATGT TTGGGTTCTT TGTTCAGGCC ATTGTTACCG GAAAGGGTCC

     451  ATTGGAGAAC CTTGCTGATC ACCTTGCAGA CCCTGTAAAC AACAACGCCT

     501  GGGCATTTGC CACAAACTTT GTTCCCGGAA AGTGA


An example file from the GenBank database.

DL;SYNHUMLYZ - Synthetic human lysozyme gene, complete cds
LOCUS SYNHUMLYZ 418 bp DNA SYN	12-JUN-1992
DEFINITION Synthetic human lysozyme gene, complete cds.
ACCESSION D00413
KEYWORDS    chemical synthesized oligomer; lysozyme; synthetic DNA.
SOURCE      Chemical synthesized oligomer DNAs, clone pHLY1.
  ORGANISM  Artificial gene
            Artificial sequences; Genes.
REFERENCE	1 (bases 1 to 418)
  AUTHORS	Muraki,M., Jigami,Y., Tanaka,H., Kishimoto,F., Agui,H., Ogino,S. and Nakasato,S.
  TITLE		Expression of synthetic human lysozyme gene in Escherichia coli
  JOURNAL	Agric. Biol. Chem. 49, 2829-2831 (1985)
COMMENT		The coding region was designed using the most frequently used
				codons in the genes of abundant proteins in yeast.
				NCBI gi: 220949
FEATURES 	         Location/Qualifiers
     source	         1. .418
		         /organism="Artificial gene"
     CDS	         22. .417
		         /note="lysozyme; NCBI gi: 220950"
		         /codon_start=1

		         /translation=UMKVFERCELARTLKRLGMDGYRGISLANWMCLAKWESGYNTRAT
		         NYNAGDRSTDYGIFQINSRYWCNDGKTPGAVNACQLSCSALLQDNIADAVACAKRVVR
		         DPQGIRAWVAWRNRCQNRDVRQYVQGCGV 
BASE COUNT	    105 a 85 c 111 g 117 t
ORIGIN	BamHI site.

 Synhumlyz  Length: 418  July 19, 1995 09:19  Type: N  Check: 9586 ..

       1  GATCCGTTAG GAGTTTAATC GATGAAGGTT TTCGAACGTT GTGAATTGGC

      51  CAGAACTTTG AAGAGATTGG GTATGGACGG TTACCGTGGT ATCTCTTTGG

     101  CTAACTGGAT GTGTTTGGCC AAGTGGGAAT CTGGTTACAA CACTAGAGCT

     151  ACTAACTACA ACGCCGGTGA CCGTTCTACT GACTACGGTA TCTTCCAAAT

     201  TAACTCTAGA TACTGGTGTA ACGACGGTAA GACTCCAGGC GCCGTTAACG

     251  CCTGTCAGTT GTCTTGTTCT GCTTTGTTGC AAGACAACAT CGCTGACGCC

     301  GTTGCCTGTG CTAAGAGAGT CGTTAGAGAC CCACAAGGTA TCAGAGCTTG

     351  GGTCGCTTGG CGTAACAGAT GTCAAAACAG AGACGTCAGA CAATACGTTC

     401  AAGGTTGTGG TGTCTAAT


An typedata screen trace of a cloning vector sequence from GenBank.

LOCUS       SYNPBR322    4361 bp    DNA   circular  SYN 29-JUN-1994
DEFINITION  Cloning vector pBR322, complete genome.
ACCESSION   J01749 K00005 L08654 M10282 M10283 M10286 M10356 M10784 M10785
            M10786 M33694 V01119
NID         g208958
KEYWORDS    ampicillin resistance; beta-lactamase; cloning vector;
            drug resistance protein; origin of replication; plasmid;
            tetracycline resistance.
SOURCE      Cloning vector plasmid pBR322 from E.coli; pBR322 DNA in pXf3 [4].
  ORGANISM  Cloning vector
            Artificial sequences; Cloning vehicles.
REFERENCE   1  (bases 1 to 3; 3259 to 4361)
  AUTHORS   Sutcliffe,J.G.
  TITLE     Nucleotide sequence of the ampicillin resistance gene of
            Escherichia coli plasmid pBR322
  JOURNAL   Proc. Natl. Acad. Sci. U.S.A. 75, 3737-3741 (1978)
  MEDLINE   79012484

////////////////////////////////////////////////////////////////////////////

FEATURES             Location/Qualifiers
     source          1. .4361
                     /organism="Cloning vector"
                     /sub_species="Cloning vector pBR322"
                     /sequenced_mol="DNA"
                     /tissue_lib="ATCC 31344, ATCC 37017"

////////////////////////////////////////////////////////////////////////////

      CDS             86. .1276
                     /gene="tet"
                     /codon_start=1
                     /transl_table=11
                     /product="tetracycline resistance protein"  

////////////////////////////////////////////////////////////////////////////

      CDS             complement(3293. .4153)
                     /gene="bla"
                     /note="E-286"
                     /codon_start=1
                     /transl_table=11
                     /product="beta-lactamase" 
////////////////////////////////////////////////////////////////////////////

BASE COUNT      983 a   1210 c   1134 g   1034 t
ORIGIN      EcoRI site.

 SYNPBR322  Length: 4361  June 24, 1996 13:44  Type: N  Check: 5483 ..

       1  TTCTCATGTT TGACAGCTTA TCATCGATAA GCTTTAATGC GGTAGTTTAT

      51  CACAGTTAAA TTGCTAACGC AGTCAGGCAC CGTGTATGAA ATCTAACAAT

////////////////////////////////////////////////////////////////////////////

    4351  GTCTTCAAGA A


Background Information on Restriction Enzymes and Simple Cloning Techniques

In order to work well with DNA sequences, it is necessary to cleave these sequences at specific points. Restriction enzymes allow you to do this. The fragments created by digesting a DNA sequence with a restriction enzyme can be easily separated by using an agarose gel. The most useful restriction enzymes are those that have rare recognition sites, therefore producing a small number of fragments that can be easily separated from one another on agarose gels.

Several different types of restriction enzymes recognition sites exist. Some enzymes cut both the reading and the complement DNA strands at the same place. These are known as blunt end cutters. Blunt end fragments do not stick together well. Other enzymes produce staggered cuts in the two strands producing sticky ends. Fragments from any source with sticky ends produced by the same restriction enzyme can stick together and later be permanently joined together with DNA ligase.

Here is a small list of the many possible restriction enzymes. The information on the line gives the name of the restriction enzyme, the first number tells you the offset of the cutting site from the beginning of the recognition pattern on the top (5 prime) strand, the recognition pattern with the cut site marked by an ', the next number is the overhang from the cut site on the top strand where the bottom (3 prime) strand is cut. The underscore marks the cut site location on the bottom strand. In the example below, data is given on thirty enzymes. Notice that these recognition sites are not always simple patterns. Many of these enzymes use DNA ambiguity codes to define their cut sites.

AatII   5 G_ACGT'C       -4  HhaI    3 G_CG'C    -2  PvuII   3 CAG'CTG   0
AccI    2 GT'mk_AC        2  HincII  3 GTy'rAC    0  Sau3AI  0 'GATC_    4
AluI    2 AG'CT           0  HindIII 1 A'AGCT_T   4  ScaI    3 AGT'ACT   0
BamHI   1 G'GATC_C        4  HinfI   1 G'AnT_C    3  SmaI    3 CCC'GGG   0
BglI    7 GCCn_nnn'nGGC  -3  HpaI    3 GTT'AAC    0  SphI    5 G_CATG'C -4
BglII   1 A'GATC_T        4  KpnI    5 G_GTAC'C  -4  SspI    3 AAT'ATT   0
EcoRI   1 G'AATT_C        4  NcoI    2 C'CATG_G   4  StuI    3 AGG'CCT   0
EcoRV   3 GAT'ATC         0  NotI    2 GC'GGCC_GC 4  TaqI    1 T'CG_A    2
HaeII   5 r_GCGC'y       -4  PstI    5 C_TGCA'G  -4  XbaI    1 T'CTAG_A  4
HaeIII  2 GG'CC           0  PvuI    4 CG_AT'CG  -2  XhoI    1 C'TCGA_G  4

The simplest form of cloning is cutting out a fragment with a single restriction enzyme; this cuts the DNA strands so that the desired section of the sequence was between two of its cut sites. This fragment would then be isolated and inserted into a plasmid which contained only a single cut site for the enzyme used to produce the original fragment. When this type of cloning works doesn't work, more complex cloning strategies have to be used.

Exercise for week 6

This series of exercises will help you enter DNA primary sequence data directly from the keyboard, work with sequences from the databases, read a sequencing gel and perform a simple computerized cloning task. Enter instructions in bold followed by pressing the ENTER key. The <rtn> symbol given in program examples means to press the ENTER key as well.

l) Activate the computer.

Activate the machine you want to use, make connections with ribozyme and log into your account.


2) Move to this week's subdirectory and copy over to it the necessary files.

% cd six

Now copy over all the files needed to do this week's exercise. They are located in the directory location $UGRAD_DIR/week6.

% cp $UGRAD_DIR/week6/* .


3) Run the demo that describes this week's activities.

This week's demo is unlike those of the past, it runs on ribozyme and is composed of text files with a few graphical additions. The demo is self-pacing in that it has a number of pause statements in body of the demo which you can use to control the flow of information. GCG is automatically started and the graphics device type set in the demo. You will need to issue the gcg and tek_plot command latter to have them available in your computer session. To start this process enter the following command.

% demo6

Background information is given on the nature of the demo you are viewing. GCG is activated and the graphics device type is set. A simple graphics file introduces the subjects to be covered in this week's activities: data entry, restriction enzymes and manual cloning. The reasoning behind the use of the SETKEYS program to redefine your keyboard is given, along with an example of the use of the SEQED program to enter a nucleotide sequence.

Take care when entering nucleotide sequences. A small mistake can greatly affect the results of analyses you run on generated sequences. One way to check the accuracy of your work when you are entering known sequence is to run a comparison of the data that you entered with that of the known sequence. This is done by running the GCG program GAP. An example of running GAP and the displaying of its output is given.

Nucleotide sequence data can be obtained in a number of ways. One method is to request the data from a server. The resulting file has to be modified to be compatible with the GCG software. Another way of obtaining nucleotide is to read the information directly off a sequencing gel. You will be shown a simple version of such a gel and given basic instructions on how to read it. The sequence data you read from the gel is known data and can be checked for accuracy by running the GAP program.

You will work with restrictions enzymes this week. You will generate a map of a classic plasmid. An example of what such a map looks like is given. By using the program MAPSORT you will gather data on the pUCl9 plasmid and a nucleotide sequence with a given region of interest. With the collected data and through the use of SEQED, you will perform a manual cloning. You will search additional nucleotide sequences of interest with these tools to see if they can be cloned into pUC19 as well.


4) Entering DNA sequencing data into the computer.

In the week 3 exercise, you entered a protein sequence into the computer. Protein sequences are best entered using the normal terminal keyboard, since you need so many of the keys to provide the necessary one-letter codes for the 20 most commonly used amino acids. DNA sequences, however, work with a much more limited set of codes, and are best handled by redefining the keyboard to put all the needed keystrokes into a convenient section of the keyboard for one-handed data entry.

Redefining a keyboard can create problems; the other keys no longer work in their normal fashion. Avoid this by setting up a subdirectory containing a file that redefines the keyboard. The keyboard is only redefined in this special place, not for the entire account. You can then do your nucleotide sequence entry in this special sub-directory and not affect terminal operation in the rest of your account. GCG 's SEQED looks to see if this special redefining file is present and acts accordingly.

A subdirectory for your nucleotide sequence entry work has been created in this week's subdirectory location with the name dna_entry. Enter the following command to move over into that subdirectory.

% cd dna_entry

Activate the GCG software suite by entering gcg. The welcome message for the software system appears on the screen.

% gcg

Now that you are in the special subdirectory for nucleotide data entry with the GCG system active, create the file that redefines the keyboard. Use the GCG program SETKEYS. SETKEYS asks you for the keys to use for the four bases and three common ambiguity codes plus a delete key. A file called set.keys is then created to contain this information. This file can be further edited if you need to have more keys defined at a later date. Use the example given below to prepare yourself to use this program. Give some thought as to how you would prefer the keys to be assigned prior to running the program. Some keys are not allowed to be used. The keys , and / are two of these. It may be necessary to select a remapping scheme and then go into SEQED and see that there aren't any warning messages from the scheme you have come up with. The example given below contains acceptable key reassignments. Record how you set it somewhere in your notes so that you can refer to it later in the class and use it again for entering other DNA sequences. User input is shown in bold type.

% setkeys

SETKEYS writes a file in your directory that redefines your 
keyboard's keys for sequence entry with the programs SEQED,
LINEUP, GELENTER, and GELASSEMBLE. The output file, called 
Set.Keys, can be edited if you want to use keys that were not defined in
your interactive session with SETKEYS.

Choose key(s) for each nucleotide:

What key(s) should mean G ?  j <rtn>
What key(s) should mean A ?  k <rtn>
What key(s) should mean T ?  1 <rtn>
What key(s) should mean C ?  ; <rtn>

Now choose key(s) for the common ambiguity codes:

What key(s) should mean R ?  i <rtn>
What key(s) should mean Y ?  o <rtn>
What key(s) should mean N ?  p <rtn>
What key(s) should mean <Delete> ?  ' <rtn>
SetKeys complete: output file is "/disk3/.../expxx/six/dna_entry/set.keys".

With the keyboard now ready, enter a typical nucleotide sequence. Most modern labs can easily produce reading gels containing between 300 and 500 bases. However to start with, just enter a 30-base sequence using the SEQED program. Refer to the earlier protein examples or to the manual for assistance before calling your instructor for help. Give the sequence the filename little.seq. Remember to use your new key assignments.

GGGTGGGACCCCTTTCGGGGTCCTGTTCAA

% seqed little.seq

When the sequence is completely entered, exit from the program and examine the file using type. Is the sequence 30 bases long? Does the sequence entered match that given above? If not, go back and correct it. Get help from your instructors if you have problems.

Given below is a nucleotide sequence of 300 bases to enter. Use the SEQED program to enter this sequence. Refer to the earlier protein example or to the manual for assistance before calling for help. Give the sequence the filename unknown.seq.

	  ACAACCGGCCCAACGACTCGATGAGGGAACTTTGGACACACTCGCAGCTC
	  ACAGGTGAACGATATGGCTCCAAGAAGAGTGTAGCCATCCTGACCAGCGG
	  TGTGACAGCCGGCGCCGCCGAGGAATTTACTTACATCATGAAGAGGCTGG
	  GCCGGGCCCTGGTCGTTGGTGAAGTGACAAGTGGAGGCTGCCAGCCACCA
	  CAGACCTACCACGTGGACGACACGCATCTCTATATCACCATCCCCACAGC
	  TCGCTCTGTGGGCGCCACGGACGGCAGTTCCTGGGAAGGGGTGGGTGTGA

% seqed unknown.seq

When the sequence is completely entered, exit from the program and examine it. There is always one nagging problem with nucleotide data entry, the sequences are so long that it is easy to make a mistake that could throw later analysis efforts off. Bases can be dropped or extras copies of correct ones inserted, so just seeing that the length is correct is not enough. One way to get around this is to double enter a sequence blindly. SEQED can be run in a checking mode in the following manner:

% seqed unknown.seq

When the unknown sequence is displayed on the screen, get into the command mode by entering Crtl-d. At the colon prompt enter Check /Blind. When the colon returns again after the sequence has disappeared, press the ENTER key to get into the entry mode. With the

cursor now above the top number line, re-enter the 300 bases. When a base is entered that doesn't agree with one entered previously in that position, a beep will occur and a ^ will appear under the position in question. Use the arrow keys to toggle back and forth between the two lines to see if you can spot the problem, and determine which of the lines is correct. The only line shown on the screen is the one currently being looked at. To use your delete key, position the cursor at the right of the character to be removed. Even this checking won't remove all errors if care is not taken in referring between the original data and the checking lines. Corrections made in the original blind sequence are what will be saved when the sequence is rewritten to an output file.

Once you are satisfied with your work, make a copy of this file in the sub-directory one level back along your directory tree by using the following instructions. With that task finished, move back up to that level of your account using the GCG term up.

% cp unknown.seq ../

% up

Now that you have the file, how accurate is it? The correct sequence is contained in a file known by the name of checking.seq. To determine the accuracy of your work, use the GCG program GAP to compare your sequence to this file. The workings of this program will be explained in detail in week 10. For now, just know that the program provides a means of comparing two sequences and determining how similar they are to one another. The higher the reported percentage of similarity, the better the two sequences agree with one another. A similarity value of 100% means that the sequences are identical.

Given below is a guide to use when running the GAP program. User input is shown in bold type.

% gap

GAP uses the algorithm of Needleman and Wunsch to find the alignment of
two complete sequences that maximizes the number of matches and minimizes
the number of gaps.

GAP of what sequence 1,?  unknown.seq <rtn>

                 Begin (* 1 *) ? <rtn>
               End (*   300 *) ? <rtn>
              Reverse (* No *) ? <rtn>

to what sequence 2 (* unknown.seq *) ?  checking.seq <rtn>

                 Begin (* 1 *) ? <rtn>
               End (*   300 *) ? <rtn>
              Reverse (* No *) ? <rtn>

What is the gap weight (* 5.00 *) ?  <rtn>

What is the gap length weight (* 0.30 *) ? <rtn>

What should I call the paired output display file (* unknown.pair *) ? <rtn>

Aligning ...........-..
Information on the quality of the gapping is then displayed.

%

Record on the next page the % similarity value from your GAP run. If your results were not 100% there are some problems with your sequence. If the similarity value is 100%, congratulations!

% similarity for GAP run: _________________________________________________

Type off the results of the program to see what its output looks like and to locate any trouble spots. The resulting output file shows both sequences (they appear in the order that they were entered into the program). Your unknown sequence will be on the top of each set of compared sequences. The data is shown in lines of 50 bases each with a comparison line between each compared 50-base region. Where a perfect match occurs between the two sequences, there will be a I symbol in this comparison line. A blank space shows a mismatch.

% cat unknown.pair

Entering nucleotide sequence data is not easy. The larger the sequence, the greater the chance for errors in the sequence. It has been estimated that sequence errors are as high as 1% in the GenBank database.


5) Converting data received directly from a server.

Sometimes it is necessary to get data files directly from a database server. The database servers on the networks have their information updated every evening while VADMS' sequence databases are updated bi-monthly or quarterly. Therefore, a newly published paper may refer to an accession number or access code for a sequence that is not locally available. Assume for the purposes of this section that you have discovered that the needed sequence is in GenBank and its access code is M31742. Use the example given below to submit a request for this sequence. User input is shown in bold type. Refer back to the exercise for week 2 for pine mailer information if you have problems.

% pine

Enter c to allow you to compose a message to be sent off.

   To      :retrieve@ncbi.nlm.nih.gov <rtn>
   Cc      :<rtn>
   Attchmnt:<rtn>
   Subject :<rtn>
----- Message Text -----

///////////////   empty space until the bottom of the screen ///////////////

^G Get Help  ^X Send      ^R Rich Hdr  ^Y PrvPg/Top ^K Cut Line  ^O Postpone
^C Cancel    ^D Del Char  ^J Attach    ^V NxtPg/End ^U UnDel Line^T To AddrBk

Enter in the following lines as the actual message to be sent.

   datalib genbank <rtn>
   begin  <rtn>
   m31742  <rtn>
   <rtn>

Enter ^x (Ctrl-x) to allow you to send off your request, followed by y to confirm that you really want to send it off. Enter q to quit the pine mailer program and confirm this decision with a y to the quitting prompt.

The speed of your response depends on the time of day the request was submitted and how busy the networks are. If you do not receive a return mail message in a few minutes, use the file called get_seq.seq for the next part of the exercise. Information that comes from this server over the networks is sent as mail messages. In order to use this data, go into pine read the message and extract it into a file. Once you are in pine and reading a message, you can extract it by entering e and answering the prompt with lastname.get. Because of the way in which your account is set up, the extracted mail message will be in your current directory location. After exiting pine, you can now work with the file in the pico editor it get it into usable shape.

% pico lastname.get

Do the following to the created file. Remove the mailing header information at the top of the file. In this case, it means removing all the lines prior to the one starting with LOCUS. Then move down in the file to where the actual sequence data is given. Between the ORGIN line and the sequence data should be a line containing just two periods, .., to assist the GCG reformatting process. Next note that the feature information above this section also contains some ".." notations. This will confuse the REFORMAT program. Therefore, add a space between these points of confusion. At the very end of the file is a line with two slashs. Remove this line. With the file so edited, exit the editor and go through the reformatting process on this modified file as shown below.

REFORMAT is an interesting program that can be run in a variety of ways. Here you will use the simplest aspects of the program. The program will ask you for the name of the file to work with. If the file is OK, there will be no error messages from the reformatting process. If you get one, go back and revise the file and repeat this process until there are no error messages. User input is shown in bold type. Note that the sequence you received from the server is all in lower case letters. To convert it to upper case use the -upper command switch.

% reformat

REFORMAT rewrites sequence file(s), symbol comparison table(s), or enzyme 
data file(s) so that they can be read by GCG programs.

REFORMAT what sequence file(s) ? lastname.get <rtn>

         [information is given here on the length of the sequence]
%

If no problems are reported in the reformatting process and upon typing the file off it looks like a regular GCG sequence file, your efforts have been successful. If something is wrong, get help from your lab instructor prior to continuing on to the next section of this exercise.

More information on using servers can be found in a handout in your carrel drawer. Refer to it in the future if you need to use this service to get needed sequences.


6) Entering data from a pseudo gel.

Normally in a lab situation an individual reads a sequence directly off a gel. To do this, one needs to know the orientation of the gel, how the lanes were set up etc., and then the gel is read from the bottom to the top. To give you a flavor of this type of data, an autorad has been obtained, and a paper version of the data has been created from an actual gel and is included in this exercise booklet. Use the paper version of this autorad data to read in the first 100 bases of the sequence.

To do this most effectively, move back to the directory location where you established the set keys definitions. Refer to the earlier section of this week's exercise to refresh your memory on how you defined the keyboard.

% cd data_entry

Now start up SEQED. Enter the first 100 bases of the paper version of the autoradiogram. Name your output file (your lastname).gel . Refer back to the instructions in section 4 on using SEQED for this purpose.

You are working with known data. The sequence has been given the name gelx.seq. To see how accurate you have been in reading this gel, do a similarity check on the two sequences with the GCG program GAP. Copy the (your lastname).gel file to the sub-directory one layer up and then return to that sub-directory one layer up with the term up. Respond according to the example given below. User input is shown in bold type.

% cp (your lastname).gel ../

% up

% gap

GAP uses the algorithm of Needleman and Wunsch to find the alignment of
two complete sequences that maximizes the number of matches and minimizes 
the number of gaps.

 GAP of what sequence 1 ?  gelx.seq <rtn>

                  Begin (* 1 *) ? <rtn>
                End (*   109 *) ? <rtn>
               Reverse (* No *) ? <rtn
 to what sequence 2 (* gelx *) ?  (your lastname).gel <rtn>

                  Begin (* 1 *) ? <rtn>
                End (*   100 *) ? <rtn>
               Reverse (* No *) ? <rtn>

 What is the gap weight (* 5.00 *) ?  <rtn>

 What is the gap length weight (* 0.30 *) ? <rtn>

 What should I call the paired output display file (* gelx.pair *) ? <rtn>

Aligning .....-.
Information on the quality of the gapping is then displayed.

%

Examine the results of the gapping process by looking at the output file, gelx.pair and finding out how well you read in the gel. The problems with getting your data accurate with nucleotide sequencing are so great that many labs have more than one person working with the same material and doing numerous gels just to make sure of their data.


7) Determining information from a cloning plasmid.

One common thing to do with a nucleotide sequence is to determine where various restriction enzymes cut it. There are numerous restriction enzymes. Some are the private property of drug firms. Others are widely available. The default enzyme.dat file that comes with GCG has almost 200 enzymes listed in it. Nothing is more frustrating than finding a restriction enzyme that does just what you want it, only you can't get any of it. In this week's subdirectory is a more realistic list of enzymes, those that probably every sequencing lab on this campus has in a refrigerator.

Look at this file to see how complex a pattern various restriction enzymes can look for. Remember that there are a number of ambiguity codes for bases. Review your plastic encased reference sheet for these codes.

% cat enzyme.dat

Here is a small excerpt from the file. The information on the line gives the name of the restriction enzyme, the first number tells you the offset of the cutting site from the beginning of the recognition pattern on the top strand, the recognition pattern with the cut site marked by an ', the next number is the overhang from the cut site on the top strand where the bottom strand is cut. The underscore marks the cut site location on the bottom strand. In the example given below, two of the restriction enzymes have bottom cut sites that are different than the top ones. The advantage of such sticky ends is that it orientates inserted fragments in the cloning process. Blunt end cutters such as SspI and StuI can result in products which do not have the desired insert orientation.

		SspI       3 AAT'ATT       0
		StuI       3 AGG'CCT       0
		TaqI       1 T'CG_A        2
		XbaI       1 T'CTAG_A      4

Look at a classic plasmid pBR322. This plasmid is found in the GenBank database and has the access code, synpbr322. You might have also noticed that its sequence was given as vecbase example data file in the front of this handout. pBR322 was once a very popular plasmid, but it had limitations. Two features of note are the ampicillin and the tetracycline resistance gene areas. This allowed for testing of a successful cloning of a gene fragment into either of these two areas by simple wet lab techniques.

The simplest form of cloning is to cut out a gene fragment with one restriction enzyme and to use the same enzyme to cut the plasmid for insertion. To be effective, it is necessary for the chosen restriction enzyme to cut the vector only once. Determine how many of our working set of restriction enzymes cut pBR322 and where. To do this run the program MAPSORT with the command switch -once to restrict the results to single cutters and the command switch -cir to have the sequence be circular. Follow the example given below. Use your lastname for the name of the produced mapsort file.

% mapsort -once -cir

MAPSORT finds the coordinates of the restrictions enzyme cuts in a
DNA sequence and sorts the fragments of the resulting digest by size.
MAPSORT can sort the fragments from a single or multiple enzyme
digests.

 (Circular) MAPSORT of what sequence ? gb:synpbr322 <cr>

             Begin (* 1 *) ? <rtn>
              End (*  4363 *) ? <rtn>

   *** I read your enzyme data file "enzyme.dat"!! ***

   Select the enzymes: Type nothing or "*" to get all enzymes. Type "?"
   for help on what enzymes are available and how to select them.

                                      Enzyme(* * *): <rtn>

   [This selects all the enzymes present in your refrigerator list.]

   What should I call the output file (* synpbr322.mapsort *) ? <rtn>

   Mapping ...

%

Rename the output of this program to be that of (your lastname).mapsort and then print it off using the lpr on the lab's printer.

% mv synpbr322.mapsort (your lastname).mapsort

% lpr (your lastname).mapsort

Look at the results of this program. The file contains a listing of those restriction enzymes from the enzyme.dat that cut the pBR322 plasmid only once and where that cut occurs along the sequence. At the end of the file is a listing of those enzymes from the enzyme.dat file that don't cut the sequence and those that were excluded due to multiple cuts. Note that only some of the 30 possible enzymes cut this sequence. Record the names of these cutting enzymes and their respective cut sites below.

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

Now use the information from pBR322 data file at the beginning of the handout (page 7) and contained in your mapsort output file to fill in the representation of the plasmid on the next page. When looking at the pBR322 data file the tetracycline resistance region is easy to find. The ampicillin region is harder to find because another name is used to describe that area, beta-lactamase.

This is a circular plasmid, so be sure to denote the start and stop point of the sequence as a small line at the top middle of the circle. Consider the immediate right of this point to be position 1 of the sequence and the left position 4361. Draw in the approximate locations of the two gene resistant areas. Use a curved box that reaches inside the circle as a means to showing this feature. Label each box with a shorten version of the drug name so that you can tell them apart. Then add all the cut sites of the single cutter enzymes to the picture. Be sure to label each cut site with its respective restriction enzyme's name.

Using the information displayed above, answer the following questions.


1) How many enzymes only cut the plasmid once? _______________________________

2) Which restriction enzymes cut the tetracycline resistance area? 

______________________________________________________________________________

3) Which restriction enzymes cut the ampicillin resistance area? 

______________________________________________________________________________

4) After referring the enzyme.dat file or the enzyme listing on page 8 of this 
exercise, which of the restriction enzymes that cut drug resistance areas have 
sticky ends?

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

Now you know why pBR322 is no longer in favor. It was far too limited in the number of restriction enzymes that could be used with it.


8) Working with restriction enzymes (a manual cloning experience)

section 8a
To do a manual cloning, we need a vector to use. In this case, you will use the popular cloning vector, pUCl9. To find the sequence of pUC19 in the databases, do a STRINGSEARCH run on the synthetic part of the GenBank database using the instructions given next.

% stringsearch

STRINGSEARCH identifies sequences by searching  with character patterns
such as "globin" or "human" in the sequence documentation.

STRINGSEARCH through what sequence(s)  (* GenEMBL:* *) ?  gb_sy:* <rtn>

Do you want to search through:
    A) definitions
    B) complete sequence records

Please choose one (* A *):  a <rtn>

Search for what text patterns ?  puc19 <rtn>

What should I call the output file (* gb_sy.strings *) ?  puc.look <rtn>

*** Gb_sy:Arpt7e19 ***

//////////////////////////////////////////////////////////////////////

    Sequences searched:     2210
Sequences with matches:        7
       Patterns sought:    puc19

           Output file: puc.look

%

Type off the results of your search. Record below the access code for the pUCl9 sequence you want to use [Note - you do not want the complement sequence - it is the high copy number plasmid you are after.]

% cat puc.look

pUC19 access code: ___________________________________________________________

Now that you have the access code, look at the information contained in this file. The vector is very popular because it produces multiple copies of inserted materials and its polylinker region contains single restriction enzyme cut sites. Use typedata to find out the type of information that you can collect from this source.

% typedata gb_sy:code

What you really need is information on the location of the polylinker region and which restriction enzymes cut pUCl9 there. Since the database entry doesn't contain that data, you will need to find it by running MAPSORT on this data set.

% mapsort -once -cir

MAPSORT finds the coordinates of the restrictions enzyme cuts in a
DNA sequence and sorts the fragments of the resulting digest by size.
MAPSORT can sort the fragments from a single or multiple enzyme
digests.

 (Circular) MAPSORT of what sequence ? gb_sy:code <rtn>

              Begin (* 1 *) ? <rtn>
              End (*  2686 *) ? <rtn>

*** I read your enzyme data file "enzyme.dat"!! ***

Select the enzymes: Type nothing or "*" to get all enzymes. Type "?"
for help on what enzymes are available and how to select them.

                                   Enzyme(* * *): <rtn>

[This selects all the enzymes present in your refrigerator list.]

What should I call the output file (* code.mapsort *) ? <rtn>

Mapping ...

%

Rename the output to be that of (your lastname-puc).mapsort and then print it off using the lpr on the lab's printer.

% mv code.mapsort (your lastname-puc).mapsort

% lpr (your lastname-puc).mapsort

The beauty of a polylinker region is that it is designed to contain a number of single cutter sites within a relatively small area (less than 100 bases), have a detectable marker for successful insertions and a promoter to insure the production of any down stream gene product. Examine your produced hard copy to determine the following pieces of information: possible location of the polylinker region, the restriction enzymes that cut it. and where they cleave it. Record this information below.

possible polylinker location: ________________________________________________

enzyme(s) that cut and where: ________________________________________________

______________________________________________________________________________

______________________________________________________________________________

The simplest type of cloning is one in which the same restriction enzyme that cuts the plasmid is used to produce the fragment to be inserted into it.

section 8b
Now collect data on the material to be inserted in the plasmid. Sometimes it is easier to have bacteria produce gene products than to collect the material from the natural source. In this case, we are interested in making a large batch of the paralytic neurotoxin protein from Pyemotes tritici. You have the mRNA sequence to work with, toxin.seq, it is in this week's subdirectory. Type off this file on the screen and note the type of information contained therein. To confirm the success of the cloning process, it would be nice to know the size of the protein we are interested in. In order to produce a fragment to insert, we need to know which of the single cutters of pUC19 cut this sequence and if any of them produce a fragment that will contain the coding for the desired protein. Record below the location of the coding region.

% cat toxin.seq

CDS region: _________________________________________________________________

To find out the desired information use MAPSORT. This program will determine restriction enzyme cut sites and produced fragment sizes.

% mapsort

MAPSORT finds the coordinates of the restrictions enzyme cuts in a
DNA sequence and sorts the fragments of the resulting digest by size.
MAPSORT can sort the fragments from a single or multiple enzyme digests.

 (Linear) MAPSORT of what sequence ?  toxin.seq <rtn>

                  Begin (* l *) ? <rtn>
                   End (*   940 *) ? <rtn>

Is this sequence circular (* No *) ? <rtn>

*** I read your enzyme data file "enzyme.dat"!! ***

Select the enzymes: Type nothing or " * " to get all enzymes. Type "?"
for help on what enzymes are available and how to select them.

		     Enzyme(* * *):
Enter in the names of the restriction enzymes you recorded earlier that cut the polylinker region of pUC19 one at a time. When you are finished with your list, just press the ENTER key and the program will move on to the next step.

What should I call the output file (* toxin.mapsort *) ? <rtn>

Mapping .

%
Type off the results of this program. Information is given for those restriction enzymes that cut the sequence, where the cuts occur and how big the fragments are. The fragments are even arranged by size. Linear mapsort results give the starting and ending point of the sequence, in this case 0 and 940. If the restriction enzyme has cut the sequence only once, there will only be a single number given between the 0 and the 940. What you want is a restriction enzyme that produces at least two cuts in the sequence. For a restriction enzyme to be useful the location of those cuts should produce a fragment that contain the entire region of interest. From the results, select the restriction enzyme that cuts out the fragment you want and record its cut points below.

% cat toxin.mapsort

restriction enzyme to use: ___________________________________________________

Enzyme cut points:        start  ___________________  end  ___________________

Look at the enzyme.dat file and record below the pattern that your chosen enzyme cuts at and where that cut actually occurs (use the ' mark to show this point).

enzyme pattern: _________________________   cut location: ____________________

Now determine where your chosen restriction enzyme actually cuts the polylinker region of the pUC19 plasmid by running MAPSORT on that sequence using only the name of your chosen enzyme. Use the example below as a guide. User input is shown in bold. Enter the access code you found for pUC19 where it has code and the name of your enzyme where it has enz.

% mapsort -cir

MAPSORT finds the coordinates of the restriction enzyme cuts in a
DNA sequence and sorts the fragments of the resulting digest by size.
MAPSORT can sort the fragments from a single or multiple enzyme 
digests.

 (Linear) MAPSORT of what sequence ?  gb_sy:code <rtn>

               Begin (* 1 *) ? <rtn>
              End (* 2686 *) ? <rtn>

Is this sequence circular (* No *) ? yes

*** I read your enzyme data file "enzyme.dat"!! ***

Select the enzymes: Type nothing or *** to get all enzymes. Type ~?"
for help on what enzymes are available and how to select them.

                                    Enzyme(* * *):  enz <rtn>

When the next prompt for an enzyme appears, press ENTER to continue with the analysis.

What should I call the output file (* code.mapsort *) ? puc19.map2 <rtn>

Mapping .

%

Type off the results of this run. Since this is a single cutter of pUC19 and the plasmid is circular, the cut site number is repeated. Record this location below.

Enzyme cut site in pUC19: ____________________________________________________

section 8c
With all the necessary data collected, perform the actual cloning operation. Use the program SEQED. Start it by entering seqed. Respond to the prompt about SeqEd of what sequence ? with the name of the pUCl9 sequence you are using (gb_sy:code).

% seqed

The sequence is loaded and the screen shows you the end of the sequence with the cursor blinking after base 2686. Enter 1 <rtn> to move to the front of the sequence. Now at the beginning of the sequence, it is time to explain how to proceed. To understand what values to be entered in the following steps, you need to understand the numbers reported to you by the MAPSORT program and the way in which SEQED operates.

SEQED inserts sequence data in the following manner. The program inserts the new sequence at the position you tell it to and moves the indicated base and the rest of the original sequence to the end of the newly included sequence. MAPSORT gives you the starting point of the restriction enzyme pattern that you are using not where the actual cut occurs. If the cut site is not just before the first character of the pattern, the value your recorded as the starting point on page 21 will have to be corrected. This is also true the case for the insertion point in the vector. The actual insertion position will be pattern start position plus the number of bases between the beginning of the pattern and the ` symbol in the enzyme pattern. The actual starting point of the sequence to be included is the recorded starting position plus the number of bases between the beginning of the pattern and the ` symbol in the enzyme pattern. The corrected ending position of the insert is the recorded ending position plus the number of bases between the beginning of the pattern and the ` symbol in the enzyme pattern minus one. Make these corrections and record the corrected values below.

corrected insertion point: ______________________________________________________

corrected starting point: _______________________________________________________

corrected ending point: _________________________________________________________

Enter the corrected insertion point at which the plasmid is to be cut and press ENTER. This move you along the sequence to the actual insertion point. The base that occurs just after the cut should be highlighted with the cursor. At this point press ctrl-d to go into the command mode of the program. At the: prompt. enter the following:

: include toxin.seq <rtn>

The program will come back with the following queries about the inserting of the toxin sequence.

The x in the example is the corrected starting point and the y the corrected ending point you recorded earlier above. User input is shown in bold type.

                seqed include of toxin.seq

                 Begin (* 1 *) ?  x <rtn>
               End (*   940 *) ?  y <rtn>
              Reverse (* No *) ? <rtn>

first 50 bases: [the actual first 50 bases are shown] 

This line should start with the portion of the restriction enzyme pattern that is to the right of the cut site.

last 50 bases: [the actual last 50 bases are shown]

This line should end with the portion of the restriction enzyme pattern that is to the left of the cut site.

is this what you want included (* yes *) ? <rtn>

Back at the: prompt, enter write clone.seq <rtn> to write your work out to a file and then enter quit to get out of the program.

section 8d
Checking your work. If everything has gone correctly, check your work with MAPSORT. Run the program as shown before, but use clone.seq as the input file, and your selected enzyme as the enzyme to be used. The resulting mapsort file should show two cut sites for your chosen enzyme and that the fragment between the cut sites is still big enough to hold the desired region of the sequence. If it doesn't, repeat section 8c until it does. An example of a successful result is shown on the next page. The first number in the size line is the desired fragment and the second is the size of the pUCl9 plasmid.

% mapsort -cir

Cuts at: 396 1329 396

Size: 933 2686

When your are satisfied with your work, rename the clone.seq file to (your lastname).clone and use rcp to ship it off to the teacher account.

% mv clone.seq (your lastname).clone

% rcp (your lastname).clone teacher@ribozyme:receive


9) Finding suitable cut sites to use in cloning attempts

Not all attempts at cloning are as successful as the one you just did. The selection of proper plasmids and restriction enzymes to do what you want can be tricky. The following sequences have been selected to allow you to experience cloning attempts a little more realistically.

Four sequences have been chosen for this section of the exercise and given the names, ctry1.seq, ctry2.seq, ctry3.seq and ctry4.seq. Determine if any of the restriction enzymes in your refrigerator set will cut out the region of interest of the four sequences. Then determine if you can clone the region of interest fragment into the pUCl9 plasmid.

First type off the files and record below the location of their respective regions of interest. Running MAPSORT on these files will let you know if any restriction enzyme(s) from the refrigerator set will do the job you want done. You can either view the results of these mapsort runs on the screen or create hard copy of them on the printer. Record your findings below. Lastly, you will compare the name(s) of the restriction enzymes that work for you with those single cutters that exist in the polylinker region of pUCl9 and see if any of these produced fragments can be successfully cloned into the plasmid without additional modifications to the resultant fragment. Use the examples given previously in this exercise as guides for performing these tasks.

Region of interest in ctry1: ________________________________________________

Mapsort results on ctry1: ___________________________________________________

______________________________________________________________________________

Region of interest in ctry2: ________________________________________________

Mapsort results on ctry2: ___________________________________________________

______________________________________________________________________________

Region of interest in ctry3: ________________________________________________

Mapsort results on ctry3: ___________________________________________________

______________________________________________________________________________

Region of interest in ctry4: ________________________________________________

Mapsort results on ctry4: ___________________________________________________

______________________________________________________________________________

Which sequence(s) can be cloned into the pUCl9 polylinker region?

______________________________________________________________________________


10) Another cloning attempt (optional)

Use a ctry sequence that can be cloned into pUCl9 from section 9 and perform that operation. Repeat the relevant sections of 8b, 8c and 8d to accomplish the task. Record the necessary data in space provided below. Name your cloning attempt clone2.seq.

Region of interest: _____________________________________________________________

Restriction enzyme to use: ______________________________________________________

Enzyme cut points:          start _______________          end __________________

enzyme pattern: _____________________      cut location: ________________________

Enzyme cut site on pUC19: _______________________________________________________

Check your work. If everything has gone correctly, you can check your results with the MAPSORT program. Run the program as shown before, this time use clone2.seq as the input file, your selected enzyme as the enzyme to be used. The resulting mapsort file should show two cut sites for your chosen enzyme and that the fragment between the cut sites is still big enough to hold the desired region of the sequence. If it doesn't, work on it until it does.

When your are satisfied with your work, rename the clone2.seq file to (your lastname).clone2 and use rcp to ship it to the teacher account.


11) Printing off a plasmid map.

With all the work you have done with cloning attempts this week, adding a plasmid map to you structural data file makes sense. To see what the GCG form of a plasmid map looks like use the week6.image file. First, rename this file to reflect your own lastname and then print it off on the teaching lab printer.

% mv week6.image (your lastname).image6

% lpr (your lastname).image6

Pick up your hardcopy at the printer. The image shown is that of a plasmid map of pBR322. Most biochemical supply catalogues use plasmid maps as a means of getting visual information across about the plasmids they sell. Their content is slightly different than that of a GCG plasmid map, but the general idea is the same. Save this information.


12) Finishing up.

Rename the report form to your last name, go into the file using the pico editor and fill in all the questions expect those dealing with surfing the nets. The surfing questions are for extra credit and will give you an idea of the type of nucleotide structural information is available on the internet.

% mv week6.week6 (your lastname).week6

% pico (your lastname).week6

If you don't intend to do the extra credit, rcp over your report form to the teacher account and log off of the machine. Otherwise, don't rcp over the report form and continue on with the extra credit optional portion of the exercise.

% rcp (your lastname).week6 teacher@ribozyme:receive

This concludes your computing session for this week. Log off ribozyme, get out of the emulator and back to the overlapping windows screen.

% logout

Press the alt and x keys together. This will cause the screen to ask if you really want to exit the program. Respond with y to get out of the teemtalk emulator and return to the overlapping windows screen.

Extra Credit (Optional) - Surfing the Nets for Nucleotide Structures

Back at the windows screen, you can explore the Nets for nucleotide structures. We will go back to the Molecules R US web site. This site contains data for all the entries contained in the PDB x-ray database, some of which are for nucleotide structures. To go there, select the Netscape icon (the large N) with the arrow and press the left mouse button. The arrow changes to an hourglass while the connection is being made to the VADMS home page. Use the arrow to select the Bookmarks menu, and the FORM for PDB query: Molecules R US entry from this menu.

You are now connected to the Molecules R US home page. Depending on network traffic, it may take a moment for their logo to appear. This is a form driven system. Note the empty white box beside the Enter search keyword line. Move the arrow to the beginning of this box and press the left mouse button. You are now ready to enter either a PDB access code for a structure or a keyword to search the database with. The PDB access code for a hammerhead ribozyme is 1rmn. Type this in the box and press ENTER.

The results of the PDB database search for the 1rmn code is shown on the screen in blue text. There is only one data file with that access code. Keyword searches would have produced a number of hits to choose from. Move the arrow to that line (it turns to a hand in this process), and click the left mouse button.

You have reached the form to actually request a structural image of the desired access code. Position the arrow on the Submit Request box and press the left mouse button.

This process launches the RasMol modelling program. This appears as a black box on the screen with its program's name at the top. Now there is a structure in the RasMol window, a wireframe representation in default CPK colors. Use the scroll bar at the bottom of the screen to rotate the molecule. Try out the various options under the Display and Colours menus. The ones that are most meaningful for nucleotide structures are shapely and group under Colours. In the Display options, Ribbons and Strands produce the same image. When you are finished exploring, select the Exit option from the File menu of the RasMol window to close the RasMol program.

Click twice on the Back button at the top of the screen to go back to the initial query screen. At this point you can either exit the program or explore with the following access codes (1rrn - rotate this one with the bottom scroll bar 1/4 inch to the left to get the best view of the structure, 1sun - this is an intron from bacteriophage T4, 1tra - a naturally ocurring ribozyme [this is a very large structure and it will be slow].

If you decide to explore the other access codes listed above, you will need to move the arrow to the white box and click the left mouse button. Then use the Backspace key to remove the text found in the box and type in the new code. Repeat the instructions given above.

To exit the program select the File option from the top of the screen and select its Exit option. This will return you to the overlapping windows screen.

After you have checked out the images, get back on ribozyme. Move to the six sub-directory and finish filling out the report form for the week with your comments on surfing the nets for structural data. Rcp this file over to the teacher account and log off the system.

% pico (your lastname).week6

% rcp (your lastname).week6 teacher@ribozyme:receive

This concludes your computing session for this week. Log off ribozyme, get out of the emulator and back to the overlapping windows screen.

% logout

Press the alt and x keys together. This will cause the screen to ask if you really want to exit the program. Respond with y to get out of the teemtalk emulator and return to the overlapping windows screen.