'96 BC/BP 378

Week 3

Working with proteins - section 1. You will learn how to enter protein sequence data for various uses; explore protein primary sequence databases and the format of the data therein; do simple protein database string searches and create effective data sets; and determine protein molecular weights, isoelectric points, phobic regions and trypsin cut sites.

Author:

Susan Jean Johns


Protein Background Information

Biochemistry is the study of the molecular basis of life. In that study, proteins play a crucial role. Proteins catalyze reactions, and transport and store small molecules and ions. They also contribute to coordinated motion, mechanical support, immune protection, generation and transmission of nerve, and they control growth .

The basic structural units of proteins are amino acids. There are twenty different amino acids from which all proteins are constructed. Each of these twenty amino acids varies in the size, shape, charge, hydrogen-bonding capacity and chemical reactivity of its respective side chain. Individual amino acids are linked to one another via peptide bonds, -CO-NH-. Proteins are precisely defined amino acid sequences whose composition is specified by genes.


                          amino acid basics  

An individual amino acid can be represented in the following manner.  The R 
shown here represents the side chain of the individual amino acid.    

                                  R   
                                  |      
 -amino end                  NH2-CH-COOH                -carboxyl end  

The individual side chains of the amino acids vary from one another with respect 
to a number of different characteristics.  These simple building blocks are 
capable of forming complex three-dimensional structures that allow proteins to 
perform so many biological functions.  

In peptide linkages the -carboxyl group of one amino acid is joined to the -amino 
group of the next amino acid.  Many amino acids joined together form a polypeptide 
chain or a protein.  Because the ends of the individual amino acids are different 
from one another, such a polypeptide chain has direction.  An individual amino acid 
unit in a polypeptide chain is known as a residue.  To insure that all proteins 
are reported in the same manner, a convention has been established that has the amino 
end serving as the beginning of the chain and the sequence then proceeds through order 
of the component amino acids until the carboxyl end of the chain is reached.  

Such a polypeptide chain has the following representation:                   

                                     R  
                                     |        
      amino end          H -- [- NH-CH-CO -] -- OH                carboxyl end        
                                            n  

Polypeptide chains consist of a regular repeating part known as the main chain 
or backbone, the -NH-CH-CO- part of the representation given above and the 
variable part consisting of the individual side chains, or the R of the 
representation.  The mean molecular weight of an amino acid is about 110.  
Protein mass is measured in daltons. A dalton is equal to one atomic mass unit.  
A protein with a molecular weight of 20,000 therefore has a mass of 20,000 daltons, 
or 20kd (kilodaltons) would contain about 180 to 200 amino acids.                       


Some proteins contain disulfide bonds. These are cross-linkages formed between polypeptide chains or parts of the same chain due to the oxidation of the amino acid, cysteine. Such disulfide bonds are usually a characteristic of extracellular proteins.

Proteins have primary structure, the order of the individual amino acids that comprise the protein and the location, if any, of its disulfide bonds. Proteins also have secondary structure. This refers to the spatial arrangement of amino acid residues that are near one another in the primary sequence. Such arrangements can produce periodic structures due to steric interactions and result in such constructions as helixes, sheets and turns. The tertiary structure of a protein refers to the spatial relationship of amino acid residues that are far apart from one another in the primary sequence. When a protein is composed of more than one primary sequence, or chain, there are additional relationships between these chains or sub-units with one another known as quaternary structures.

Proteins, therefore have two different types of available data. Data based on their primary sequence and that derived from determining their actual structural conformation through x-ray crystallography. Since the understanding of proteins is so vital to the understanding of the molecular basis of life, both of these types of data are being collected into databases for use by the molecular biology community. These databases provide reported sequences and structures for comparison and analysis work.

By understanding the information contained in a protein's primary sequence, its relationship to other proteins, its philic and phobic regions and its possible secondary structure can be determined. If a protein's structure has been determined, it can be visualized to better understand the spatial nature of the protein and its behavior.

Knowledge of protein sequences is important for the following reasons. First, such sequence information is essential in determining the mechanism of biological reactions. Second, relating sequence information and three-dimensional structure provides the rules that govern protein folding. The sequence serves as link between the genetic message in DNA and the three-dimensional structure that performs the protein's biological function. Third, alterations in a sequence can produce abnormal function and disease. Finally, a protein's sequence can reveal information about its evolutionary history.

Use of the computer to determine some of these relationships for proteins will be explored in the next three weeks laboratory sessions.


Background Information on Protein Sequence Data Entry

In order to work with a protein on the computer its relevant information has to be entered into the computer in a form the machine and analysis software can recognize. When that data does not exist in any other source then you must input the data they are interested in. Here at WSU, the VADMS Computing Resource supports the use of the Genetics Computer Group software suite (GCG) for protein and nucleotide analysis tasks. Therefore, in order to have protein sequence data in a format that can be used by this software, it is entered via the GCG's sequence entry program, SEQED. To better understand what is happening in the program, SEQED, here is some information about the standard means of organizing information for a primary sequence data file.

There are three basic parts to a primary sequence entry. The first is the relevant reference or background information on the sequence. The nature of this section of the file depends on the database from which it was extracted or the verboseness of the person who created the file. If the file is from a database, there will be information on the name of the sequence, its source, its accession number, references and feature information about the sequence. A sequence file that is not from a database may contain anything in its header section. The information placed there depends solely on the whims of its creator. Hopefully, at a minimum, the name of the sequence and some information about its preparation or features will be there. When you enter a sequence, think of the purpose for that data entry. If the data has the potential to be used for long period, i.e., necessary information for a research group or a lab, be detailed when you enter this information. Give all the important facts about the sequence that you currently have. You can change this part of the file later if you wish as more information becomes available on the sequence. If the data is only to be used in an exercise for this class, be brief with the reference information.

Located between the header and sequence sections is what GCG refers to as the checksum line. It contains the filename of the sequence file, the length of the sequence, the date the file was created, the type of data it is (P for protein and N for nucleotide), and a number. This number is used in GCG programs to see if any scrambling of the data has occurred for whatever reason. After the checksum number are two periods. GCG uses the location of these two periods to signal the end of nonsequence material in the file, and the beginning of the actual sequence information.

The last part of the sequence file is the actual sequence data itself. Normally the data is shown in blocks of ten, with fifty characters to a line. Each data line is preceded by the position in the sequence of the first character in that line. For ease in reading the data, a blank line is placed between each actual sequence line. Examples of actual sequence files from the various VADMS supported databases are given in the next section.


Background Information on Protein Primary Sequence Databases

There are large databases containing protein primary sequence information. You don't have to enter in every sequence that you are interested in. Normally you check the databases to see if the information you want has already been entered and if it matches the sequence data you need or want to use. In order to do that you need to know something about the way databases are organized and how to search them.

The information stored in protein sequence databases has pointers to allow you to extract the desired sequence information. A number of different terms are similar in name but different in purpose. A sequence's access code is composed of a group of 6 to 10 alphanumeric characters, depending on its database of origin. In the PIR databases, the extent to which an access code is composed of characters shows the reliability of the data. For well-established sequences with verified data, access codes are mainly letters. Unverified sequences have more numbers than letters in their access codes, with really questionable data having an * before its title information.

At the time when a sequence is deposited into a database, it is given an accession number. Protein sequences may have two letters at the beginning of their accession numbers followed by 4 numbers. Accession numbers do not change as the data is absorbed into different databases. Often the best way to search for newly published sequences is through their accession numbers because their final access code may not have been determined before the paper went to press. Accession number searching is possible through GCG's STRINGSEARCH. The accession number given at the time the sequence is deposited is known as the primary accession number. If the sequence was developed from work on earlier sequences, those numbers will also be given and they are known as secondary accession numbers.

VADMS supports the following primary sequence protein databases: NRL_3D, Protein and SwissProtein. Of these databases, NRL_3D contains only protein sequences whose x-ray structures have been solved. It thereby serves as a bridge between the primary sequence databases and the structural one, PDB. You can find information in NRL_3D about the location of secondary structural elements such as helixes and sheets. Protein and SwissProtein are just for primary sequence information. Of the two, the entries in SwissProtein have a more complete and logical organization method to them. Protein has been divided into three sections, PIR1, PIR2, and PIR3. The most complete protein entries are in PIR1 and they work their way down to the least complete entries in PIR3.

Unfortunately, each of the three major sources of protein primary sequence databases use their own slightly different forms of data entry and access codes. SwissProtein uses a code with a maximum length of 10, Nrl_3d appears to have a maximum length of 8, and Protein a length of 6. Each has their own way of creating access code names and determining what information is placed in the first line of the data entry.

A note on the NRL_3D database. In some x-ray structures there is more than one copy of a protein in the crystal's unit cell. The NRL_3D database treats this situation as if each copy was a different protein. It treats proteins with multiple chains or substrate proteins in the same manner. A single x-ray structure file could result in numerous NRL_3D database files with access codes that are identical for the first four characters of the name. It is impossible from this naming convention to determine if the files are just multiple copies of the same protein or distinct sequences.

An example of the GCG format of each of these databases is given on the next three pages. If necessary the text has been modified to fit on a single page. The checksum line has been shown in bold to help you identify that line of the file.


An example file from the NRL_3D database.

P1;9INSA      -  insulin, chain A - pig           
ENTRY            9INSA      #type complete
TITLE            insulin, chain A - pig
ORGANISM         #formal_name Sus scrofa domestica #common_name domestic pig
REFERENCE        A50814
   #authors      Badger, J.; Dodson, G.G.
   #submission   submitted to the Brookhaven Protein Data Bank, October 1991
   #cross-references PDB:9INS
REFERENCE

#authors Badger, J.; Harris, M.R.; Reynolds, C.D.; Evans, A.C.; Dodson, E.J.; Dodson, G.G.; North, A.C.T. #journal Acta Crystallogr. (1991) B47:127 #title Structure of the pig insulin dimer in the cubic crystal. REFERENCE #authors Badger, J.; Caspar, D.L.D. #journal Proc. Natl. Acad. Sci. U.S.A. (1991) 88:622 #title Water structure in cubic insulin crystals. REFERENCE #authors Dodson, E.J.; Dodson, G.G.; Lewitova, A.; Sabesan, M. #journal J. Mol. Biol. (1978) 125:387 #title Zinc-free cubic pig insulin: crystallization and structure determination. COMMENT Resolution: 1.7 angstroms COMMENT R-value: 0.178 COMMENT Determination: X-ray diffraction KEYWORDS Hormone FEATURE 1-10 #region helix (right hand alpha)\ 12-17 #region helix (right hand 3-10)\ 6-11 #disulfide_bonds\ 7 #disulfide_bonds interchain (to 9INSB:7)\ 20 #disulfide_bonds interchain (to 9INSB:19) SUMMARY #length 21 #molecular-weight 2384 #checksum 7820 SEQUENCE 9insa Length: 21 November 9, 1994 20:32 Type: P Check: 7820. 1 GIVEQCCTSI CSLYQLENYC N

Notice the secondary structure information in the feature portion of the file. It also gives disulfide bond locations.


An example file from the Protein database.

P1;CCSP       -  cytochrome c - spinach           
ENTRY            CCSP       #type complete
TITLE            cytochrome c - spinach
ORGANISM         #formal_name Spinacia oleracea #common_name spinach
DATE             #text_change 19-May-1994
ACCESSIONS       A00065
REFERENCE        A00065
   #authors      Brown, R.H.; Richardson, M.; Scogin, R.; Boulter, D.
   #journal      Biochem. J. (1973) 131:253-256
   #title        The amino acid sequence of cytochrome c from Spinacea
                   oleracea L. (spinach).
   #cross-references MUID:73229096
   #contents     cv. Monster Viroflay
   #accession    A00065
      ##molecule_type protein
      ##residues      1-111 ##label BRO
CLASSIFICATION   #superfamily cytochrome c; cytochrome c homology
KEYWORDS         acetylation; electron transfer; heme; methylation;
                   mitochondrion; oxidative phosphorylation; respiratory chain
FEATURE
   1                  #modified_site acetylated amino end (Ala) #status
                        experimental\
   22,25              #binding_site heme (Cys) (covalent) #status
                        experimental\
   26,88              #binding_site heme iron (His, Met) (axial ligands)
                        #status predicted\
   80,94              #modified_site N6,N6,N6-trimethyllysine (Lys) #status
                        experimental
SUMMARY          #length 111  #molecular-weight 12054  #checksum 9939
SEQUENCE

  Ccsp  Length: 111  November  9, 1994 20:41  Type: P  Check: 9939

       1  ATFSEAPPGN KDVGAKIFKT KCAQCHTVDL GAGHKQGPNL NGLFGRQSGT 

      51  AASYSYSAAN KNKAVIWSED TLYEYLLNPK KYIPGTKMVF PGLKKPQDRA 

     101  DLIAYLKDST Q

This particular entry gives information on the amino acid residues involved in binding the heme group within this protein.


An example file from the SwissProtein database.

P1;14KD_DAUCA - 14 KD PROLINE-RICH PROTEIN
ID   14KD_DAUCA     STANDARD;      PRT;   137 AA.
AC   P14009;
DT   01-JAN-1990 (REL. 13, CREATED)
DT   01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)
DT   01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE)
DE   14 KD PROLINE-RICH PROTEIN.
OS   DAUCUS CAROTA (CARROT).
OC   EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; APIALES;
OC   UMBELLIFERAE.>
RN   [1]
RP   SEQUENCE FROM N.A.
RA   ALEITH F., RICHTER G.;
RL   PLANTA 183:17-24(1990).
DR   EMBL; X15436; DC215G.
DR   PIR; S35714; S35714.
FT   DOMAIN       30     49       PRO-RICH.
SQ   SEQUENCE   137 AA;  14392 MW;  96106 CN;

14kd_Dauca  Length: 137  November  9, 1994 20:41  Type: P  Check: 2374.

       1  MGSKNSASVA LFFTLNILFF ALVSSTEKCP DPYKPKPKPT PKPTPTPYPS 

      51  AGKCPRDALK LGVCADVLNL VHNVVIGSPP TLPCCSLLEG LVNLEAAVCL 

     101  CTAIKANILG KNLNLPIALS LVLNNCGKQV PNGFECT

This feature section points out a region of the sequence with a high concentration of prolines in it.


Background Information on Extracting Data from a Protein's Primary Sequence

Once the primary sequence of a protein is in the computer, additional data can be extracted from that sequence to shed more light on the protein's nature. Some protein characteristics that depend on sequence information are extinction coefficient, molecular weight, isoelectric point, the hydrophobic nature of the protein and its proteolytic cut sites. With the exact order of the component amino acids defined in the sequence, other information that is based on this data can be determined.

Extinction coefficient:
The extinction coefficient of a protein depends on it composition. The extent to which a protein solution absorbs at 280 nm depends on its cysteine, tyrosine and tryptophan content. This absorption characteristic is the basis for 280 nm monitoring of fractions collected from column chromatography runs.

Molecular weight:
The molecular weight of a protein depends on the exact composition of the protein. By having the exact order of the amino acids defined in the sequence, the number of each amino acid can be determined. This number is then multiplied by the residue weight for each amino acid and the sum of these products determined. By adding 18.0 to this sum, the exact molecular weight of the protein can be determined.

Isoelectric point:
The isoelectric point of a protein is the pH at which its net charge is zero. This depends on the actual number of basic and acidic amino acids present in the protein. The isoelectric point for a protein can be used to determine which spot is which on an isoelectric focusing gel. Care must be taken with the computer determination of this value however. The programs doing this just add up the number of basic and acidic amino acids and then determine the isoelectric point. They do not make any attempt to take into account the actual folding pattern of the protein, if it has disulfide bridges or not, or if oppositely charged groups are either next to one another or have formed salt bridges. Any or all of these factors influence the isoelectric point of a native protein. The number generated by the computer for the isoelectric point is at best only an estimate.

Hydrophobicity:
Individual amino acids have set ways in which they react to water. Amino acids with philic or water loving side chains form hydrogen bonds with water. Those amino acids with phobic or water hating side chains tend to cluster together, forming hydrogen bonds between them. When amino acids are linked together into a protein, the hydrophobic nature of their side chains (now the only regions of the protein capable of interaction with its surrounding environment - outside of the vastly overwhelmed charges at the ends of the protein chain) have a great deal of influence on how the protein behaves.

Proteins attempt to fold in response to their hydrophobic side chains. Phobic side chains tend to gather together in the inner region of the protein. Philic side chains tend to be on the outside surface of the protein where they can interact with the surrounding environment. There may be individual phobic or philic amino acids in places where they might not be expected due to their specific sequence, however the overall hydrophobicity of the given region should match this general convention - phobic for the inner portions of a protein and philic for the surface.

The hydrophobicity of a section of a protein is determined by the following method. A window size is chosen. Starting at the beginning of the protein the amino acids within this window are converted to their respective hydrophobicity values based on the selected hydrophobicity scale being used and then summed and divided by the size of the window. This average value is then either recorded or shown on a plot. The window is then incremented down the sequence by one and the process repeated. The resulting file or plot is then examined to determine the hydrophobic nature of the protein.

Proteolytic enzyme cut sites:
Proteolytic enzymes cleave proteins at exact points in their sequences. How many cut sites a given enzyme produces depends on the actual primary sequence of the protein and the where that particular enzymes cleaves proteins. Trypsin cleaves proteins on the carboxyl side of Arginine, R, and Lysine, K, residues. This means that there will be either an arginine or a lysine at the carboxyl ends of the produced fragments.

The following is a listing of the various proteolytic enzymes and their respective cut sites. The ` mark shows where the enzyme actually cuts a protein sequence.

   Chymo   1       F'     .         ! Chymotrypsin
   Chymo   1       W'     .         !     "
   Chymo   1       Y'     .         !     "
   CnBr    1       M'     .         ! Cyanogen Bromide
   NH2OH   1       N'G    .         ! Hydroxylamine
   NTCB    0      'C      .         ! NTCB + Ni
   pH2.5   1       D'P    .         ! pH 2.5
   ProEn   1       P'     .         ! Proline Endopeptidase
   Staph   1       E'     .         ! Staphylococcal Protease
   Tryp    1       K'     .         ! Trypsin
   Tryp    1       R'     .         !    "
Notice that in most cases the cleavage site immediately follows the desired amino acid. Any computer software looking for such cleavage sites is in reality looking for a one-character string pattern. Some of the enzymes may cut after more than one amino acid, but, they always look for the same amino acids at which to cleave the protein. Computer software is very good at locating such patterns, the more complex the better.

Exercise for week 3

This series of exercises acquaints you with entering protein sequence data for a number of different uses. Various protein databases and their file formats will be explored. Molecular weights, isoelectric points, hydrophobicity plots and trypsin cuts sites will be determined for a number of proteins. Databases will be searched for information of interest and effective data sets made. From this point on in the class each week's laboratory session will start with a demonstration illustrating the points to be covered during the week's work. Instructions in bold should be entered followed by pressing the ENTER key. The <rtn> symbol given in program examples means to press the ENTER key as well.

l) Activate the computer.

Move the mouse to get out of screen saver mode. A screen appears that shows overlapping windows. An example of this screen is shown below.


2)Select the RIBOZYME icon

From the teemtalk window, select the RIBOZYME icon by moving the cursor arrow with the mouse over to the RIBOZYME icon and pressing the left mouse button twice.

Successful connection to ribozyme is denoted by the appearance of a teemtalk screen followed by a Ribozyme information line and a login: prompt.


3)Log onto ribozyme.

Once the login: prompt appears, enter your account name to the login: prompt, and then your password to the Password: prompt.


4) Move to this week's subdirectory and copy over to it the necessary files.

% cd three

Now copy over all the files needed to do this week's exercise. They are located in the directory location $UGRAD_DIR/week3.

% cp $UGRAD_DIR/week3/* .


5) Run the demo that describes this week's activities.

From now on in the course a demo will give you an overview of the material that you will be working with during the upcoming week's laboratory sessions. This week's demo is graphical and deals with viewing proteins as a collection of linked amino acids and seeing how the individual amino acid's properties of charge and hydrophobicity affect the resulting protein.

Graphical demos are actually run on different computer. To reach this machine and get yourself in a directory location in an account from where you can run the demo, enter the following command.

% model1

Now get into MacroModel and view the demo for week three. MacroModel is the graphics program that you used in week one to visualize the demos describing what the course was going to be about. Entering mmv30 starts up the program. You respond to the question about a script file with week3.log and the question about doing a batch process with n.

$ mmv30

week3.log

n

The demo shows you the building blocks of proteins, the standard 20 amino acids. These are shown in a manner similar to that given on your reference card. They are in alphabetical order, seven each in the top two rows and six in the bottom one. On the reference card the amino acids are shown going down the page instead of across it.

Next the peptide bond formation is shown and a sequence is generated for a small peptide, the leucine form of enkephalin. You will be working with this sequence later in the exercise.

Amino acids have the properties of charge and hydrophobicity, therefore so do the proteins and peptides that are built from them. You will see the amino acids color coded with respect to both their charge and hydrophobicity and then view how these characteristics are carried over to the enkephalin sequence.

Proteins are cut into smaller fragments by proteolytic enzymes such as trypsin. These proteolytic enzymes cleave proteins at very specific locations. They cleave proteins similar to the way restriction enzymes cleave nucleotides. Trypsin cleaves proteins on the carboxyl sides of arginine, R, and lysine, K. A trypsin cleavage of a theoretical somatostatin molecule is shown. The lysine residues are shown in yellow and the rest of the molecule in aqua. The first cut occurs after the first lysine. This is shown by the bond being broken and that section of the structure being moved away from the rest of the molecule. The bond after the second lysine is then broken and the middle segment is moved to the right to show the clear breaks between the three sections.

Proteins fold into three dimensional structures. A folded structure of somatostatin is shown. This image is reduced down to a size that will work well in the next step. A stereo view is generated and you rotate the structure and generate another stereo view of the molecule before you stop the demo. Instructions for creating the final stereo view is given below.

When the terminal beeps at you there will be a stereo image on the screen. Get a stereo viewer from your instructor and check the image out. When you are finished, select the STREO button again. A single image returns. Select the Rot Y button. Respond with 60 for the angle of rotation. Select the STREO button again. A new stereo image of your rotated molecule appears. When you are finished looking at it, select STOP, and respond with y to the two questions to end the demo.

Logout of this machine by entering the command, logout, to the dollar sign prompt.

$ logout

Back on ribozyme, one of the files that were copied over at the beginning of the exercise contain images of some the information shown in the preceding demo. To see what this data looks like in a slightly different format use the week3.images file. First, rename this file to reflect your own lastname and then print it off on the teaching lab printer.

% mv week3.images (your lastname).images3

% lpr (your lastname).images3

Pick up your hardcopy at the printer. The images shown are of the 20 amino acids color coded according to charge and the stereo view of somatostatin. Save this information. Start compiling a collection of the various molecular images you will be shown in the course.


6) Entering protein primary sequence data into the computer.

There are two ways you can enter sequence data from scratch with the VADMS system. The first way uses the editor to create a file. The second way uses the GCG program SEQED. This program will create a sequence file in compatible GCG format. A brief description of its use is given later in this exercise. The second of these methods, using SEQED, will be covered in this exercise and used throughout the course.

The method you choose to enter a sequence with depends on which software package you plan to use most extensively, and how comfortable you feel with editing software. Simple changes will allow a file created with one data entry method to be usable in another software system. In this course you will only be using the SEQED program from the GCG package for sequence data entry.

Invoke the GCG package by entering gcg. The GCG software package is set up to run programs simply by entering their name. You will be using the program SEQED to enter your protein sequences.

% gcg

  	The GCG welcome message appears on the screen.

Welcome to the WISCONSIN PACKAGE Version 8.1-UNIX, August 1995 Installed on irix Copyright 1982, 1983, 1984, 1985, 1986, 1987, 1989, 1991, 1992, 1994, 1995 Genetics Computer Group, Inc. All rights reserved. Published research assisted by this software should cite: Program Manual for the Wisconsin Package, Version 8, September 1994, Genetics Computer Group, 575 Science Drive, Madison, Wisconsin, USA 53711 Databases available: GenBank Release 94.0 ( 4/96) EMBL (Abridged) Release 43.0 ( 5/95) PIR-Protein Release 45.0 ( 6/95) SWISS-PROT Release 31.0 ( 3/95) NRL_3D Release 19.0 ( 6/95) PROSITE Release 12.2 ( 3/95) Restriction Enzymes (REBASE) ( 6/95) Help is available with the command % genhelp or by calling (608) 231-5200 or sending e-mail to Help@GCG.Com

This process may seem to take a relatively long time. A large number of logical and symbol assignments are being made so that you can easily use the package. Once this is done, you only need to enter a program's name in order to run it. A copy of the GCG manual is located inside the desk drawer of each carrel in the lab.

In this section you will use the GCG program SEQED to enter 5 simple protein sequences into the computer in GCG usable format. The respective sequences are given below and on the next page. They are all actual peptides or proteins of biological interest. Don't let the length of the sequence fool you. A number of hormones that regulate vital bodily functions are only a few amino acids long. Many biological poisons or toxins are also relatively small proteins.

     sequence 1 - leucine form of enkephalin     YGGFL

     sequence 2 - adipokinetic hormone           QLNYSPDW

     sequence 3- somatostatin                    AGCKNFFWKTFTSC

     sequence 4 - heat stable entertoxin ST-2    NTFYCCELCCYPACAGCN

     sequence 5- melittin                        GIGAVLKVLTTGLPALISWISRKKRQQ

Activate the SEQED program by entering its name at the prompt.

% seqed

The screen changes and displays a double set of lines numbered from 0 to either 70 or 100. Since a filename was not given when the program was started, the software prompts you for the name of sequence to work with. Respond as shown below. Use the following names for the five sequences, mo1.seq, mo2.seq, mo3.seq, mo4.seq and mo5.seq. User input shown in bold type.

SEQED of what sequence ? mo1.seq <rtn>

Because there is no such file in your present directory location to work with, the program starts prompting you for information to insert in the header or comment portion of the new file. The cursor moves to the top of the screen. Enter some comment lines. Use the information given on the sequence data line if nothing else. When you are finished entering comment information, press Ctrl-d to return to sequence entry mode.

When the cursor moves to the first position on the top of the two lines, enter in the sequence. As you enter the actual sequence, the cursor moves to the right and each new character appears on the screen. Symbols appear as you move along on the lower line as well. When you are finished, press Crtl-d to go into the command mode of the program.

Being in the command mode is denoted by the presence of a colon in the bottom left-hand corner of the screen. Save your efforts to an output file by entering exit. The program will write the file with the same name you gave it earlier, mo1.seq, and return a notice that the file contains so many residues, and then quit.

Type off the results of your efforts. Notice the P after the Type: term in the checksum line. This indicates that the data set is a protein. The length of your generated sequence should match that shown on the preceding page. If not, you have a problem and should ask your instructor for assistance.

% cat mo1.seq

Repeat this process with the other four proteins given on the previous page. When you are finished with this section you should have five sequence files to work with.

SEQED is a complex program with many different options. For more complete information on its operation, consult the GCG manual located in your carrel drawer.


7) Determining the Molecular Weights of the 5 sequences

Now that you have 5 protein sequences to work with, determine their respective molecular weights. There are two ways of doing this. The method that works with a sequence actually residing in your account is to use the program Peptidesort. This program is intended for more complex things than producing molecular weight determinations, however, it can be tricked into just giving the user information on the molecular weight of the sequence and its composition. An example of using this program in this mode is given below.

% peptidesort

PEPTIDESORT shows the peptide fragments from a digest of an amino acid
sequence.  It sorts the peptides by weight, position, and HPLC retention
at pH 2.1, and shows the composition of each peptide.  It also prints
a summary of the composition of the whole protein.

PEPTIDESORT of what protein sequence ?  mo1.seq <rtn>
                 Begin (* 1 *) ?  <rtn>
                End (*    5 *) ?  <rtn>

Select the enzymes:  Type nothing or "*" to get all enzymes.  Type "?"
for help on what enzymes are available and how to select them.

                                      Enzyme(* * *):  press space bar<rtn>

What should I call the output file (* mo1.pepsort *)<rtn>
Use this example as a guide to process all five of your protein sequences through the program. After the first pass through the program you can use the keyboard's up arrow to recall the previous command and not have to reenter the name of the program again.

When all five proteins have been run through peptidesort, use the following search command to seek out only the information that you are interested in from the resulting output data files. Record the results below.

% grep Molec *.pepsort

mo1.seq molecular weight: _____________________________________________________

mo2.seq molecular weight: _____________________________________________________

mo3.seq molecular weight: _____________________________________________________

mo4.seq molecular weight: _____________________________________________________

mo5.seq molecular weight: _____________________________________________________


8) Solving a problem using these techniques

Now that you can enter a sequence and determine its molecular weight, use these skills to devise a way to determine the average residue weight of an amino acid. Review your problem set for last week for the definition of residue weight and its relationship to molecular weight. Describe your technique and the results of your efforts in a file you create with the pico editor called (your lastname).ave. Include in this file the sequence file you generated to solve the problem. Use rcp to send off this file to the teacher account.

% rcp (your lastname).ave teacher@ribozyme:receive


9) Working with Protein Primary Sequence Databases

You don't always have to work only with data that you have generated yourself. You can use pre-existing data found in databases. To use this type of data you need to able to search the databases for the information that you want to use. Use the GCG program STRINGSEARCH.

By using STRINGSEARCH, information contained in the header portion of the data file can be searched. Two levels of searching are available. The first is the fastest, but only looks at the definition information. Definitions in this case contain the name of the organism, name of material, the sequence length, and possibly the date. The definitions lines for the Protein and SwissProtein databases also contain the primary accession number for the sequences. This type of searching may miss sequences of interest unless you know just the right keywords to search for. The second level is much slower, however it will look through everything in the reference section to find the desired character string. This type of search may find more information than you really want.

To explore the databases, use the STRINGSEARCH program to look into all the protein databases for definition information on melittin. You are familiar with melittin already since you have generated a sequence file for it. Melittin is a neurotoxin which is the active ingredient in bee stings. Give the output file you generate the name mels.look to match the example given on the next page. We will just look through the SwissProtein database this time around. The short version of the name for the SwissProtein database is sw. User input is shown in bold type.

% stringsearch

STRINGSEARCH identifies sequences by searching sequence documentation
with character patterns such as 'globin' or 'human'.

STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? sw:*<rtn> Do you want to search through: A) definitions B) complete sequence records Please choose one (* A *): a <rtn> Search for what text patterns ? melittin <rtn> What should I call the output file (* sw.Strings *) ? mels.look<rtn> *** Swissprotein:Mel1_Apime ***

////////////////////////////////////////////////////////////////////// *** Swissprotein:Mel_apifl *** MELITTIN 26bp Sequences searched: 43470 Sequences with matches: 4 Patterns sought: melittin Output file: mels.look %

Now type off the results of your STRINGSEARCH search. The output can be used as a listing file for names of desired sequences.

% cat mels.look

Look closely at the results. In every database search the user needs to review the results to see that only useful information is included in the output file. This means checking the results for any incomplete or partial sequences. It also means insuring that all the data is consistent. You should only work with similar forms of the desired protein. Your search may have found other proteins besides the one you were seeking information on which use the same term in their definition line. Remove these from your search list.

All the sequence lengths are not the same. In fact there is quite a range of values. Now look at the descriptions given. Notice the words precursor or precur and fragment. The precursor form of a protein includes additional sections of the protein that are not present in the mature version. Unless a majority of the sequences found are also precursor forms or you are studying precursor forms of a given protein, it is best to exclude them from the results. The other option you have is to modify all these precursor forms into mature forms and then use them. Sequence fragments are also a problem as you can never be sure what part of the protein they are actually from.

Remove any problem sequences in the results file. The best way to do this is to edit the file with the pico editor and remove the questionable lines.

% pico mels.look

There is another possible problem with your data set, duplicate files. Remember how the NRL_3D database treats multiple copies of the same protein within a unit cell. There is yet another reason why there may be duplicate copies of a sequence in your data set. The same protein from different species may be identical. Because there is no way currently known for one sequence to have multiple access codes and valid data would be lost if these sequences were simply eliminated, they are kept. The practice of doing research based primarily on species groupings is also a reason to keep identical sequences with different access codes. The different databases, by the very manner in which they collect their data, also insure that they contain copies of that data found in other databases. In this case, keep the copy of the sequence from the database whose format you prefer and delete the copy from the less desired database(s).

GCG uses the checksum to determine if the data in a sequence have been corrupted. Since this number is generated based on the sequence's actual amino acid content, it is highly unlikely that two different sequences would have the same checksum. The checksum number then can be used to determine if there are any duplicate sequences in your data set.

GCG only generates a checksum line when it either displays data from a database on the terminal screen or creates a file containing a sequence from a database. The GCG program that performs this function is FETCH.

FETCH can work from a file containing a listing of filenames or access codes found in a database such as the STRINGSEARCH output file you have generated, mels.look.

FETCH, like most GCG programs, can be run in a command line mode. This means that some programs that require simple input from the user can be given this information on the same line as the name of the program and the program automatically does whatever is needed. It can also be run interactively, with the user giving all the necessary data when prompted for it. However, for practice, run FETCH in automatic mode. FETCH creates the following type of filenames when it is run this way. The database name become the extension and the access code for the sequence becomes the filename. When using a file that is actually a listing of other files or sequences, it is necessary to use the @ symbol before the filename in order for GCG to process the data within the desired file correctly. An example of its use is given below.

% fetch @try.fil

The following type of information is returned to your terminal screen.

FETCH copies GCG sequences or data files from the GCG database
into your directory or displays them on your terminal screen.

A30484.Pir2 Hsxl4.Pir1 %

Generate copies of the sequences contained in your mels.look file by using FETCH in the manner given above, replacing @try.fil with @mels.look. Once you have copies of these files in your own account use the search utility to locate the checksum lines and compare the numbers from the various files to locate duplicate files. This would be done by the command given on the next page. Record the results of your searches below.

% grep Check *.sw

duplicate sequences: _______________________________________________________

Remove duplicate sequences. Edit the file with the pico editor to remove any duplicate sequence lines.

% pico mels.look

Once mels.look has been revised for the last time, delete the files that you brought over with FETCH from the databases to your own account. Use the del version of the UNIX rm command to remove the files you no longer need. The del command prompts you for confirmation of the deletion process.

% del *.sw

There is another way to use the FETCH program. The term typedata has been defined to run FETCH and have the program produce a screen trace of the information instead of creating a data file. Now that the files have been removed from your account, use typedata to display on the screen one of the remaining hits from databases given in the mels.look file. In the example given below, xxx:yyyyy represents the name of the desired sequence to be looked at., xxx is the database it comes from, the : is the necessary separator so that the software knows that it is dealing with a database, and yyyyy its access code.

% typedata xxx:yyyyy

If typedata is used instead of the file producing form of FETCH then you would have to carefully examine each file for the desired information instead of using the search utility to find out what you need to know.

Now that the output file has been revised to contain only valid sequences, use it to determine the molecular weights of the found sequences. This is done with the program MOL_WT. This is a locally produced program that only works on sequences in a database. It will not work on sequences such as the ones you generated earlier with the seq extension. You can use the list of melittin sequences as input for the program only if you precede its filename by an @ sign. This symbol tells the program that you are using a file that contains a listing sequences to be worked with. An example of using MOL_WT is given below. Use this as a guide for determining the range of melittin molecular weights. This program is slow in coming up, so be patient.

% mol_wt

Mol_WT what sequence(s) ? @mels.look <rtn>

What should I call the output file (*  *) ?  mels.wts<rtn>

%

Type off the results of this run and determine the range of the molecular weights for these proteins. Record the observed range on the next page.

% cat mels.wts

Observed range: _____________________________________________________________


10) Determining isoelectric points for the entered sequences

GCG contains a protein called ISOELECTRIC which determines protein isoelectric points. This is a graphics program. In order to use it, you must let the GCG package know what type of graphics device you are using. Enter the special VADMS term tek_plot.

% tek_plot

Information is shown on the screen about setting the graphics devices to a tektronix's terminal. Now run the program five times to get the isoelectric point of the five sequences you entered earlier in this week's laboratory session, mo1.seq, mo2.seq, mo3.seq, mo4.seq and mo5.seq. Use the example given below as a guide for running this program.

% isoelectric

ISOELECTRIC plots the charge as a function of pH for any peptide 
sequence.

 Process set to plot with TEK4107 attached to term:
 using the tekd graphic interface.
 ISOELECTRIC of what protein sequence ? mo1.seq<rtn>

                 Begin (* 1 *) ? <rtn>
               End (*     5 *) ? <rtn>

When your TEK4107 attached to tty is ready, press <RETURN>.<rtn>

When you are finished recording your observations either below or on the next page, press the ENTER key and get out of the program. Then enter the term clearplot and respond to its prompt by pressing the ENTER key. This will clear the screen making it ready for a new plot. Run each of the sequences through ISOELECTRIC.

Record the found isoelectric points below.

mo1.seq isoelectric point: _______________________________________________________

mo2.seq isoelectric point: _______________________________________________________

mo3.seq isoelectric point: _______________________________________________________

mo4.seq isoelectric point: _______________________________________________________

mo5.seq isoelectric point: _______________________________________________________

11) Determining the hydrophobic nature of the entered sequences

Now that the isoelectric points have been determined, determine the nature of the 5 entered proteins with respect to the hydrophobicity of their component amino acid side chains. The program to be used is PK23. This particular program was developed locally for Kyte-Doolittle hydrophobicity plots where the user could do more than one plot on the screen at one time, selecting the window sizes to be used for each plot. The default window size for such a plot is 7. This won't work with all the sequences you have entered, since the smallest one is only 5 amino acids long. Use a window size of 3 for the first two sequences and window sizes of 3 and 7 for the last three sequences. Use the example given on the next page to run the PK23 program. Record your observations on the phobic nature of the five sequences in the space provided for that purpose on the next page.

PK23 is a graphics program. However, since you have already told GCG that you want graphics output, you don't need to do it again this lab period unless you change the type of graphics output you want. Here is an example of running the program on a protein with two windows. The resulting plot is also shown.

% pk23

 Process set to plot with TEK4107 attached to term:
 using the tekd graphic interface.

Kyte - Doolittle plotting program

Please enter the filename.ext p:ccho <rtn>

                  Begin (*   1 *) ? <rtn>

                End (*     104 *) ? <rtn>

 enter number of window sizes to be  tried: 2<rtn>

 Average of hydrophilicity over how many acids (* 7 *) ?  3<rtn>

 Average of hydrophilicity over how many acids (* 7 *) ?  7<rtn>

 values are -3.8333338 to 3.9000008

 When your TEK4107 attached to tty is ready, press <RETURN>.<rtn>

Run each of the five sequences through the program using the instructions given on the previous page. Remember to use clearplot to clear the screen, making it ready for each new plot. Record the results below.

mo1.seq phobic nature: _________________________________________________________

mo2.seq phobic nature: _________________________________________________________

mo3.seq phobic nature: _________________________________________________________

mo4.seq phobic nature: _________________________________________________________

mo5.seq phobic nature: _________________________________________________________

12) Checking for Trypsin Cut Sites

Proteins are cut into smaller fragments by proteolytic enzymes such as trypsin. These proteolytic enzymes cleave proteins at specific sites. They cleave proteins similar to the way restriction enzymes cleave nucleotides. Trypsin cleaves proteins after each arginine, R, and/or lysine, K residue found in a protein sequence. When the computer looks for trypsin cleavage points, it checks the sequence for the location of each arginine and lysine residues in the sequence. It does this by treating a sequence as a character string and checks for the position in that string of any R and/or K characters. It then divides the sequence into fragments based on what it has found. The location of proteolytic enzyme cut sites is a simple form of pattern searching that is used throughout sequence analysis. For more information on the proteolytic enzymes, refer to your reference card. It will show you where various enzymes cleave protein sequences.

Remember when you used PEPTIDESORT to determine the molecular weights of the protein sequences that you entered? Its main purpose is to provide information on where proteolytic enzymes clip proteins into fragments. To do this the program uses a list of available proteolytic enzymes and information on where they cut a protein. What the program actually does is search for the character string that represents a given cut site. When it finds a cut site, it stores that information. After the entire sequence has been searched, it divides the initial sequence into parts based on the location of the found cut sites. This is done for each of the selected proteolytic enzymes. The results are organized in a number of ways: according to position, weight, and HPLC retention times (small fragments to largest ones).

% peptidesort

PEPTIDESORT shows the peptide fragments from a digest of an amino acid
sequence.  It sorts the peptides by weight, position, and HPLC retention 
at pH 2.1, and shows the composition of each peptide.  It also prints
a summary of the composition of the whole protein.

PEPTIDESORT of what protein sequence ?  mo1.seq<rtn>
                 Begin (* 1 *) ?  <rtn>
                End (*    5 *) ?  <rtn>

Select the enzymes:  Type nothing or "*" to get all enzymes.  Type"?"
for help on what enzymes are available and how to select them.

                                      Enzyme(* * *):  tryp<rtn>

   TRYP      TRYP
   "TRYP" selected 2 enzymes new total: 2.
                                      Enzyme: <rtn>

What should I call the output file (*mo1.pepsort *)  mo1.tryp<rtn>

%
Check the results of this run by typing off the produced file and see if trypsin cuts the mo1.seq anywhere. This is done in the following manner.

% more mo1.tryp

Try this analysis on the rest of the sequences that you generated earlier. Record below those sequences which trypsin actually cleaves.

trypsin cleaves: _______________________________________________________________
13) Looking at a problem

One class of proteins serves as metal ion traps. There are known as metallathioneins. There are a number of different types of these proteins. The ones we are interested in are called metallothionein-ii. These can be found by using the STRINGSEARCH program . sw is the short version for the logical name for the SwissProtein database. Quotes are used in the example given on the next page because we are looking for exactly the phrase found between the quotes. The space at the end of the phrase is required to limit the search to only those hits of interest.

% stringsearch

STRINGSEARCH identifies sequences by searching sequence documentation
with character patterns such as 'globin' or 'human'.

 STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ?  sw:*<rtn>

 Do you want to search through:
     A) definitions
     B) complete sequence records

 Please choose one (* A *):  a <rtn>

 Search for what text patterns ?  "metallothionein-ii "<rtn>

 What should I call the output file (* sw.Strings *) ?  metal.look<rtn>

//////////////////////////////////////////////////////////////////////

     Sequences searched:    43470
 Sequences with matches:       12
        Patterns sought: metallothionein-ii 

            Output file: metal.look

%
Use the data set generated in this manner to find out what you can about these proteins. Look at the actual data files with typedata. Check the range of their molecular weights. Are they phobic or philic in nature? Use PK23 to find out. See if trypsin cleaves them into smaller fragments. While you have the PEPTIDESORT data in front of you, read it carefully to see if there is anything unusual about the amino acid composition of these proteins. Create your own report form on this investigation using the pico editor. Name it in the following matter, (your lastname).metal, where (your lastname) represents your own last name. Be sure to include your name within the report as well so your instructors can tell who it is from in case of similar last names among the students taking the course. Use rcp to get your report file to the teacher account.

% pico (your lastname).metal

% rcp (your lastname).metal teacher@ribozyme:receive

14) Extra Credit (optional)

Use the skills that you have developed to search the SwissProtein database for the names of antifreeze sequences. Remove the names of sequences that have the terms precursor and fragment in them. Also remove those lines that have the terms proteins, clone or clones in them. Determine the range of molecular weights as well as any other information you can gather on the proteins listed in your revised search list. Create your own report form on what you have discovered about the antifreeze protein with the pico editor. Name it in the following matter, (your lastname).anti, where your lastname represents your own last name. Include your name within the report so your instructors can tell who it is from. Use rcp to get your report file to the teacher account.


15) Finishing up.

Rename the report form for this exercise to have your last name, go into the file with the editor, pico, to fill in the report and then rcp it to the teacher account. Be sure that you have sent over the earlier files that were called for with the extensions ave, metal and anti (optional).

% mv week3.week3 (your lastname).week3 

% pico (your lastname).week3 

% rcp (your lastname).week3 teacher@ribozyme:receive

% rcp (your lastname).ave teacher@ribozyme:receive

% rcp (your lastname).metal teacher@ribozyme:receive

% rcp (your lastname).anti teacher@ribozyme:receive

This concludes your computing session for this week. Log off ribozyme, get out of the emulator and back to the overlapping windows screen.

% logout

Press the alt and x keys together. This will cause the screen to ask if you really want to exit the program. Respond with y to get out of the teemtalk emulator and return to the overlapping windows screen.