'97 BC/BP 578

Week 5

Modelling Series

Learning about ways to estimate protein secondary structures and the usefulness of these techniques for other purposes.

Author:

Susan Jean Johns

Protein Secondary Structure Information<

The determination of protein secondary structure has been an intriguing puzzle for some time. When Linus Pauling made his prediction in 1948 that proteins would be composed of alpha helix and beta sheet units, no protein structures had yet been determined. His prediction was based solely on the idea that the potential hydrogen bonding possible in such structures would increase their stability and make them more probable. Improvements in x-ray diffraction techniques made it possible to solve protein three-dimensional structures. The predicted sub units were there.

As more and more structures were determined, the beginnings of possible folding patterns were observed. Soluble globular proteins have started to be understood in general terms. "The principle underlaying the structure of helices, sheets, and turns is the simultaneous formation of hydrogen bonds by buried peptide groups and the retention of single residue conformations close to those of minimum energy. The shape of the helix and sheet structures make these structural elements pack together in a small number of relative orientations. The links between secondary structures tend to be right-handed and short, and do not form knots." As a result, globular proteins usually fold into a few common patterns. These proteins can roughly be grouped into four classes: all-alpha, all beta, mixed alpha/beta formed from beta-alpha-beta units, and alpha + beta where the helix and sheet units are segregated.

The possibility of a given section of a peptide folding to form a helix, a sheet or a turn is primarily dependent on the preferred conformations of the constituent residues and the packing quality of the surface formed. Prediction schemes have been devised, with relative success, which are based on only local or semi-local sequence patterns due to this local characteristic of folding forces. Once past these generalities, the detailed mechanisms of folding is still only vaguely understood.

Even as the body of determined structures grows, questions remain as to relationship between solved crystal structures and proteins in solution. What effect do ionic conditions have on secondary structure? What effect does protein concentration have? Do crystals with different space groups produce the same or similar protein structures? Do x-ray and NMR (solution) structure determinations on the same protein agree with one another? If not, what are the causes of the differences?



Secondary Structure Prediction

Indications of the possible structures of proteins came from initial studies on polypeptides. As protein structures were solved it appeared that the confirmation of a residue in a protein was the same as in the homopolymeric form. This correlation is far from perfect, however.

After a number of structures had been determined, various groups attempted to do statistical studies on the data to determine any preference on the part of an individual amino acid for a given secondary structure type. These efforts resulted in the empirical prediction schemes of Chou-Fasman and Garnier-Robson. The Chou-Fasman method is a group of rules applied to a given structure. It is an ambiguous method that has proven difficult to automate. The Garnier-Robson method is based on consistent application of information theory with auxiliary information from circular dichroism used to bias its prediction. This method is unambiguous and easy to automate.

Another approach is to look for periodicity in regular secondary structures. Such information is best seen from helical wheel diagrams where the view down the helical axis groupings shows similar kinds of amino acids. The regular appearance of apolar residues spaced 3 or 4 residues apart could be a pattern that alpha helices recognize, while sheets might look for uniformly apolar sections - if completely buried within a protein - or alternating polar and apolar residues if on the surface. Some proteins display these patterns to a certain extent. Such studies have resulted in the prediction scheme of Lim and Eisenberg's hydrophobic moment technique.

Other scientists have looked at all the possible structural conformations for various sequence sections that exist in the known structures and tried to form prediction schemes based on their findings. The thought is that a similar sequence will have similar secondary structures wherever it is found. To do this, a measure of homology (or similarity) needs to be established between the studied sequences and a weighing of possible conformations found in order to form a final prediction. The algorithms of Nishikawa and Ooi, Levin, and Sweet are all based on this theme. The differences resulting from the comparison choices made and the scoring systems used.



Prediction Reliability

Studies done on the reliability of the various prediction schemes show disheartening results. Depending on whether three or four secondary structural elements are used, random chance would result in either a 25% or a 33% chance of a prediction being correct. The different approaches touched on here only improve the chances to 45% to 55% of the prediction being correct. Reported higher percentages usually are the result of a biased data set, and not an actual improvement in the technique devised.

As a user of such prediction schemes you should be cautious in the applying and interpreting their results. It is best to use these predictions only in cases where other types of potential confirming evidence is available, such as the presence of antigen producing regions, or estimates derived from physical data.



Selected Molecule List

Select a molecule that most closely fits the general type of work that you are doing in the wet lab or plan to use in your project and continue to use it throughout the semester. Use the same selected molecule for this exercise as you did for week 3.

1) higher plant ribulose bisphosphate carboxylase/oxygenase, small subunit only

2) mammalian P21 ras proto-oncogene transforming protein

3) mammalian basic fibroblast growth factor

4) fungal superoxide dismutase

Week 5 Exercise

This series of exercises will acquaint you with the various methods on the computer for estimating protein secondary structure. Some of these programs use primary sequence data, others are based on statistical analysis or interpretation of crystallographic results.

l) Activate the computer

Pressing any key changes the terminal from screen saver mode to active.



2) Select the RIBOZYME icon and log onto ribozyme.

From the Launcher window, select the RIBOZYME icon and press the mouse button twice. Successful connection to ribozyme is denoted by the appearance of a ribozyme information line and a login: prompt. Once the login: prompt appears, log on to the machine by entering first your account name and then your password to the Password: prompt.



3) Create a subdirectory to keep this week's work in.

To keep data in separate working areas create subdirectories. This is done with the mkdir command. Create the following subdirectory in your account.

% mkdir week5

Move into that location using the following command line.

% cd week5

Copy over the files needed for this week's activities.

% cp $GRAD_DIR/week5m/*.* .


4) Locating structural information in NRL_3D files.

The NRL_3D database contains primary sequences of all the solved protein structures in the PDB database. This database is a good source of information on PDB protein files that can be searched using sequence analysis software.

Lysozyme was one of the earliest proteins whose crystal structure was determined. There are numerous lysozyme structures in the PDB database from various sources. Limit the files you look at to lysozymes from chicken egg white with secondary structure assignments, by using multiple passes of GCG's STRINGSEARCH program. To use this program activate the GCG software suite.

% gcg

The welcoming message for the package appears on the screen. Now start the process of doing string searches on a series of ever-decreasing datasets. Use the instructions here as a guide through the process. User input is shown in bold type.

% stringsearch

StringSearch identifies sequences by searching for character patterns
such as "globin" or "human" in the sequence documentation.

 STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? nrl_3d:*<rtn>

 Do you want to search through:

     A) definitions
     B) complete sequence annotation

 Please choose one (* A *):<cr>

 Search for what text patterns ?  lysozyme<rtn>

 What should I call the output file (* nrl_3d.strings *) ? lyso.strings<rtn>
A listing of the hits is produced on the screen. Looking at this listing shows how many possible data files you may be dealing with.

     Sequences searched:     xxxx
 Sequences with matches:      xxx
        Patterns sought: lysozyme

            Output file: lyso.strings

The list of lysozyme data files contained in lyso.strings is all the lysozyme structures currently in the NRL_3D database. Not all these files are from egg white or even from chicken for that matter. To restrict the lysozyme files to just those that are from chicken, repeat the process this time using the output file from the first search as the source of the data for the second search. To use the first output file as input for the second run, an @ symbol is placed directly before the name of the first output file.

% stringsearch

StringSearch identifies sequences by searching for character patterns
such as "globin" or "human" in the sequence documentation.

 STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? @lyso.strings<rtn>

 Do you want to search through:

     A) definitions
     B) complete sequence annotation

 Please choose one (* A *):<rtn>

 Search for what text patterns ?  chicken<rtn>

 What should I call the output file (* lyso.strings *) ? lyso2.strings<rtn>
A listing of the hits is produced on the screen. Looking at this listing shows that the number of possible data files has been reduced.

     Sequences searched:      xxx
 Sequences with matches:       xx
        Patterns sought:  chicken

            Output file: lyso2.strings

There still are too many files to deal with. Type off the results of the second search with the more command. Press the space bar to advance to the next screen of information after you have carefully examined the data presented.

% more lyso2.strings

Many of the remaining files are the results of mutation studies. Looking closely at these definition lines reveals that by using the search phrase 17) - chicken 129aa it would be possible to restrict the output even more and zero on the desire data. Putting this search phrase within double quotes insures that the entire phrase including the spaces between terms will be used in the searching process.

% stringsearch

StringSearch identifies sequences by searching for character patterns
such as "globin" or "human" in the sequence documentation.

 STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? @lyso2.strings<rtn>

 Do you want to search through:

     A) definitions
     B) complete sequence annotation

 Please choose one (* A *):<rtn>

 Search for what text patterns ?  "17) - chicken 129aa"<rtn>

 What should I call the output file (* lyso2.strings *) ? lyso3.strings<rtn>
A listing of the successful hits is shown.

     Sequences searched:       xx
 Sequences with matches:        x
        Patterns sought: 17) - chicken 129aa

            Output file: lyso3.strings

Now that the list is manageable, the question is: do any of these data sets contain secondary structure assignments? The terms to be searched for this time are helix, sheet and turn and the complete sequence annotation needs to be looked at to find this out.

% stringsearch

StringSearch identifies sequences by searching for character patterns
such as "globin" or "human" in the sequence documentation.

 STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? @lyso3.strings<rtn>

 Do you want to search through:

     A) definitions
     B) complete sequence annotation

 Please choose one (* A *):  b<rtn>

 Search for what text patterns ? helix,sheet,turn<rtn>

  What should I call the output file (* lyso3.strings *) ? lyso4.strings<rtn>
A listing of the first hit for each of search terms found in the data files is shown.


     Sequences searched:        x
 Sequences with matches:        x
        Patterns sought: helix sheet turn

            Output file: lyso4.strings

Type off these results with the cat command.

% cat lyso4.strings

Any of these data files would serve the purpose of this exercise. One of them is already in your account. It was copied over into your week5 subdirectory at the start of the exercise. This file will be used to do additional analysis on model1 using software that works directly on the PDB formatted data files. Do a directory listing to determine which lysozyme file is already in your account and record that access code below.

% ls -la

selected lysozyme access code:___________________________________________

Search the selected data set for the secondary structure assignments with the following commands. Record your results below. The typedata program is a version of FETCH that sends the results to the screen instead of a file. The xxxx in these commands represents the chosen lysozyme access code.

% typedata nrl_3d:xxxx | grep helix 

% typedata nrl_3d:xxxx | grep sheet 

% typedata nrl_3d:xxxx | grep turn 

helix locations: __________________________________________________

sheet locations: __________________________________________________

turn locations: ___________________________________________________

Repeat this process looking for a similar listing of possible PDB files to use for your selected molecule. After you have located a listing of possible PDB access codes for your selected molecule, choose one of the possible files to determine its secondary structural information. Here the xxxx represents the access code for your selected molecule in the NRL_3D database.

selected molecule access code: ______________________________________

% typedata nrl_3d:xxxx | grep helix

% typedata nrl_3d:xxxx | grep sheet

% typedata nrl_3d:xxxx | grep turn

helix locations: __________________________________________________

sheet locations: __________________________________________________

turn locations: ___________________________________________________

Now use the instructions in section 9 of week 4's exercise (gopher access of PDB data) to bring over the same PDB coordinate data into your account. If your chosen access code is longer than 4 characters, use only the first four for your gopher search. Save the file with a pdb extension.



5) Doing secondary structure prediction from sequence data.

Having seen the secondary structure assignments for the lysozyme and your selected molecule from the solved structures, determine what sort of assignments can be derived from computer software. The GCG program to use for this task is called PEPTIDESTRUCTURE. Run PEPTIDESTRUCTURE on both the lysozyme data and your selected molecule data. An example run is given below. Use this information as a guide when running your own analysis. In the example nrl_3d:xxxx represents the access code of your chosen lysozyme data set. User input is shown in bold type.

% peptidestructure

PeptideStructure makes secondary structure predictions for a peptide
sequence. The predictions include (in addition to alpha, beta, coil, and
turn) measures for antigenicity, flexibility, hydrophobicity, and surface<
probability.  PlotStructure displays the predictions graphically.


  PEPTIDESTRUCTURE for what peptide sequence ? nrl_3d:xxxx<rtn>

                  Begin (* 1 *) ? <rtn>
                End (*   129 *) ? <rtn>

 Calculate hydrophilicity according to

     H)opp-Woods or
     K)yte-Doolittle

 Please choose one (* K *) : <rtn>

  What should I call the output file (* xxxx.p2s *) ? <rtn>
Use the more command to display the results of this run one screen's worth at a time. The very first part of this output is given on the next page. In the example xxxx represents the access code for your NRL_3D file.

% more xxxx.p2s

This file needs some explaining. After the descriptive material at the top stating what sequence was used and references for the determinations that were made, the file lists the results of the run. The results are reported for each residue of the sequence. There are nine different columns of data. The first and second columns report the location of the residue in the sequence and the residue's name. Columns seven and eight are the secondary structure prediction columns. An h in either of these columns denotes a helical structure prediction, b stands for a sheet, and t for a turn. Column seven contains the Chou-Fasman predictions. This scheme uses upper and lower case letters to denote the strength of the prediction. Strong predictions are in upper case and weak are in lower case. Column eight is the Garnier-Osguthorpe-Robson secondary prediction results. Only upper case letters are used in this prediction scheme. The rest of the reported results deal with other protein characteristics that can be determined from primary sequence analysis and are not of interest to the present line of inquiry.

PEPTIDESTRUCTURE of: nrl_3d:xxxx  check: 2383  from: 1  to: 129

lysozyme (EC 3.2.1.17) - chicken

Hydrophilicity (Kyte-Doolittle) averaged over a window of: 7
Surface Probability according to Emini
Chain Flexibility according to Karplus-Schulz
Secondary Structure according to Chou-Fasman
Secondary Structure according to Garnier-Osguthorpe-Robson
Antigenicity Index according to Jameson-Wolf

                    Date: January 20, 1996 20:44

Pos  AA  GlycoS  HyPhil  SurfPr  FlexPr  CF-Pred GORPred AI-Ind ..

 1     K       .  -0.675   0.476   1.000       .       H  -0.450
 2     V       .   0.360   0.730   1.000       .       H   0.450
 3     F       .  -0.117   0.306   1.000       .       H  -0.150
 4     G       .   0.400   0.265   1.000       .       H   0.450
 5     R       .  -0.700   0.295   0.981       t       H  -0.400
 6     C       .  -0.357   0.344   0.971       t       H  -0.100
 7     E       .  -0.214   0.351   0.954       H       H  -0.300
 8     L       .  -0.529   0.181   0.934       H       H  -0.600
 9     A       .  -1.443   0.334   0.912       H       H  -0.600
10     A       .  -0.529   0.386   0.911       H       H  -0.600

///////////////////////////////////////////////////////////////

Look carefully are the result of this run and record the locations of the various secondary structural elements below.

lysozyme prediction results (Chou-Fasman):

helix locations: __________________________________________________

sheet locations: __________________________________________________

turn locations: ___________________________________________________

lysozyme prediction results (GOR):

helix locations: __________________________________________________

sheet locations: __________________________________________________

turn locations: ___________________________________________________

Repeat this process with your selected molecule and record the location of its various secondary structural elements below and on the next page.

selected molecule prediction results (Chou-Fasman):

helix locations: __________________________________________________

sheet locations: __________________________________________________

turn locations: ___________________________________________________

selected molecule prediction results (GOR):

helix locations: __________________________________________________

sheet locations: __________________________________________________

turn locations: ___________________________________________________


6) Getting secondary structure predictions off the nets

VADMS has in its computing resource a mail server utility, MSU. With MSU you use the Internet for various types of sequence analysis including secondary structure predictions. MSU requires a very basic form of sequence file.

You need a data file to work with for both your chosen lysozyme sequence and your selected molecule and to have these data files in your own account. Use the GCG program FETCH to bring these sequences over into your account. In the FETCH command lines given below nrl_3d:xxxx is the name of your chosen lysozyme access code and nrl_3d:yyyy is your selected molecule access code. This form of the FETCH command puts in the name of the input file as a command switch option.

% fetch -in=nrl_3d:xxxx

% fetch -in=nrl_3d:yyyy

With the desired files in your account, run the sequence conversion tool readseq. Use the instructions given below to do the conversion. The purpose of this step is to strip down the data to a format that can be used with the MSU program. The example shown uses a representation of your lysozyme file. User input is shown in bold. Repeat this process on your selected molecule file.

% readseq
readSeq (1Feb93), multi-format molbio sequence reader.

Name of output file (?=help, defaults to display):
xxxx.pro<rtn>
         1. IG/Stanford           10. Olsen (in-only)
         2. GenBank/GB            11. Phylip3.2
         3. NBRF                  12. Phylip
         4. EMBL                  13. Plain/Raw
         5. GCG                   14. PIR/CODATA
         6. DNAStrider            15. MSF
         7. Fitch                 16. ASN.1
         8. Pearson/Fasta         17. PAUP/NEXUS
         9. Zuker (in-only)       18. Pretty (out-only)

Choose an output format (name or #):
8<rtn>

Name an input sequence or -option:
xxxx.nrl_3d<rtn>

Name an input sequence or -option:
<rtn>    [this exits you from the program]   

In MSU the number of servers shown depends on the nature of the sequence being used. It will show 15 initially for a protein sequence and will display more if you care to register with the various servers. The servers you are interested in using are ProteinPredict and nnpredict, both use neural net approaches to generating the prediction. However, ProteinPredict has been acting funny lately, so you will only get data only from nnpredict. Follow the example given here to submit your sequences for analysis by this network tool. User input is shown in bold. In the example, xxxx.pro represents your chosen lysozyme sequence and yyyy.pro your selected molecule sequence.

% msu

**************** MSU (Mail Server Utility) *********************
Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library

Sequence loaded: None

Options: L - Load sequence H - Retrieve HELP file from server R - Register with service O - Other options Q - Quit (exit program) Enter L, H, R, O, or Q to quit:

The utility loads and you have a number of options. Select l so that you can load your sequence. Then enter the name of the file you want to use to the utility's prompt.

Enter L, H, R, O, or Q to quit: l<rtn>
Enter file name: xxxx.pro<rtn>

The sequence is loaded and a series of options appears as to where to send off the data for further analysis. Check to make sure that the correct number of residues have loaded. If not, repeat the process with a file you are sure has been modified to work with this utility.

**************** MSU (Mail Server Utility) *********************
Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library

Sequence loaded: xxxx.pro
Protein, 129 residues, 1-KVFGR...RGCRL-129

Services: (press RETURN for next page)
  1 - EMBL BLITZ server
  2 - BIOCCELERATOR
  3 - FLASH
  4 - NCBI BLAST
  5 - EMBL FASTA server
  6 - NBRF/PIR FASTA server
  7 - CBRG (ETH Zuerich)
  8 - BLOCKS server
  9 - MotifFinder
 10 - ProteinPredict

Options:

L - Load sequence S - Set sequence limits (1 - 129) H - Retrieve HELP file from server R - Register with service O - Other options Q - Quit (exit program) Enter 1-15, L, S, H, R, O, or Q to quit:

At this point, press the RETURN key to move on to the next page of services listings.

Enter 1-15, L, S, H, R, O, or Q to quit: <rtn>

The following screen appears after you have pressed the RETURN key.

   **************** MSU (Mail Server Utility) *********************
Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library

Sequence loaded: xxxx.pro
Protein, 129 residues, 1-KVFGR...RGCRL-129

Services: (press RETURN for next page)
 11 - nnpredict
 12 - GenomeNet BLAST
 13 - GenomeNet FASTA
 14 - GenQuest (Q)
 15 - ProDom

Options:
  L - Load sequence                    S - Set sequence limits (1 - 129)
  H - Retrieve HELP file from server   R - Register with service
  O - Other options                    Q - Quit (exit program)

Enter 1-15, L, S, H, R, O, or Q to quit: 

At this point, enter 11 to send off your data to the nnpredict server.

Enter 1-15, L, S, H, R, O, or Q to quit: 11<rtn>

The utility responds by asking you the necessary questions to send off a request to this server. For this pass just accept the default values for the prediction.

Service nnpredict
Neural network secondary protein structure prediction

Select prediction options:
  1 - n
  2 - a
  3 - b
  4 - a/b

1-4 or ? [1]:<rtn>

Request mailed to nnpredict@celeste.ucsf.edu at Tue Jan 23 19:28:38 1996
The reply should soon arrive in your mailbox

PRESS <RETURN> TO CONTINUE...

At this point press the RETURN key and go through this process again. This time use your converted selected molecule data file. Load in your second sequence and submit the requests to the same server with this new data. After you have submitted the second request, exit the utility in the following manner. Press the RETURN key to get back to a utility screen where you enter q and press the RETURN key again to return to the machine prompt.

You may have mail waiting for you when you exit the MSU utility. Nnpredict can respond very quickly at times when network traffic is light, much slower when traffic is heavy.

These files require explanation. The nnpredict folks send back just the minimum of information in an e-mail message that looks like the one on below. It repeats the send sequence and then gives the prediction. You will have to count along your sequence to figure out where the various predicted secondary structures are. Nnpredict now has a web site as well.

From nnpredict@celeste.ucsf.eduTue Jan 23 19:49:24 1996
Date: Tue, 23 Jan 1996 19:35:07 -0800 (PST)

From: the nnpredict server <nnpredict@celeste.ucsf.edu> To: teacher@ribozyme.vadms.wsu.edu Cc: nnpredict-request@celeste.ucsf.edu Subject: Reply to your nnpredict query Sequence: >2MLTA, GIGAVLKVLTTGLPALISWIKRKRQ Secondary structure prediction for 2MLTA, (option n): ---HEEEEEH----HHHHHHH---- (H = helix, E = strand, - = no prediction) Sequence: nnpredict was written by Donald Kneller. Copyright (C) 1991 Regents of the University of California. Try the NEW nnpredict Web page: http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html

Record the nnpredict results below after your have received them from the server.

nnpredict lysozyme prediction results:

helix locations: __________________________________________________

sheet locations: __________________________________________________

turn locations: ___________________________________________________

nnpredict selected molecule prediction results:

helix locations: __________________________________________________

sheet locations: __________________________________________________

turn locations: ___________________________________________________


7) Determining other secondary structure values from a PDB file.

While you wait for your prediction results to arrive, explore using actual PDB files to get automated secondary structures predictions based on actual coordinate data. The software to do this is on model1. Therefore, after you have moved your PDB files over to model1, start a telnet session on model1.

The software on model1 assumes that you have exactly the protein or protein chain you want to use in your PDB data file and none other. If necessary, edit your PBD file(s) so they only contain the desired data.

First, ftp the PDB files over to your account on ribozyme in the following manner.

% ftp model1.vadms.wsu.edu

When the model1 machine prompt appears, enter your account name and password on that computer. You are using FTP protocol to transfer a file from ribozyme to model1. Follow these commands to move the file. Replace the bcsxx of the example with our own account name. In the example given below xxxxx.pdb represents your chosen lysozyme PDB file and yyyy.pdb your selected molecule PDB file. User input is shown in bold type.

Connected to model1.vadms.wsu.edu
220 model1.vadms.wsu.edu MultiNet FTP Server Process 3.4(14) at Wed 24-Jan-96 11
:59AM-PST
Name (model1.vadms.wsu.edu:bcsxx): bcsxx<rtn>
331 user name (bcsxx) ok. Password, please.
 Password:(enter your own password<rtn>)

230 User BCSXX logged into DISK1:{BCSXX} at WED 24-JAN-96 12:01PM-PST, job 23b. Remote system type is VMS. ftp>type ascii<rtn> 200 Type A ok. ftp>put xxxx.pdb<rtn> local: xxxx.pdb remote xxxx.pdb 200 Port 10.72 at Host 134.121.43.151 accepted. 150 ASCII Store of DISK1:[BCSXX]XXXX.PDB:1 started. 226 Transfer completed. xxxxx (8) bytes transferred. xxxxx bytes sent in 0.03 seconds (1077.52 Kbytes/s) ftp>put yyyy.pdb<rtn> local: xxxx.pdb remote yyyy.pdb 200 Port 10.72 at Host 134.121.43.151 accepted. 150 ASCII Store of DISK1:[BCSXX]YYYY.PDB:1 started. 226 Transfer completed. xxxxx (8) bytes transferred. xxxxx bytes sent in 0.03 seconds (1077.52 Kbytes/s) ftp>quit<rtn> 221 QUIT command received. Goodbye.

Now, you can either logout of ribozyme and log into model1 or you can open a telnet session on model1. Since no modelling will be done this time, a telnet session is a good option. Instructions are given below. User input is shown in bold type. When the Username: prompt appears log in as usual.

% telnet model1<rtn>

Trying 134.121.12.92...
Connected to model1.vadms.wsu.edu
Escape character is '^]'

    Welcome to OpenVMS VAX 6.1

Username:

You will be running two pieces of software that directly analyze the PDB files to determine secondary structural assignments. The first one of these is called define_s. Before running the program, copy over a couple of files to allow the program to run, these have been given the logical names, de_s and de_ch to help you copy them into your account. Use the following two command lines to copy these files into your account.

$ copy de_s *.*

$ copy de_ch *.*

To do the desired define_s analysis, enter the command line given below. Some initial data appears on the screen relating to the naming of various generated output files and questions about the various types of information the resulting output file will contain. These queries have already been answered for you and can be ignored. A listing of the numbers appears, one to a line as the program moves through the sequence. The program will pause at the last residue number. These calculations may take a while depending on the complexity and length of the molecule being studied. Two prompts appear at the end of the run asking if certain generated files are to be deleted. If you are curious about their contents, reply with n otherwise respond with y. In the command line given below, XXXX represents the access code of your chosen lysozyme PDB file. At the end of the run you will be asked to remove some files, do so.

$ @define_s XXXX

The xxxx.sss file contains the results you are interested in. This software uses the letter B to denote beta sheet, and @ for alpha helix. Ignore any other symbols that may appear in the output. The desired results are given in lines that contain the term ELEMNT. Do a search of the output file for the term ELEMNT to get your results. In VMS the term sea is similar to grep in UNIX.

$ sea xxxx.sss elemnt

The resulting data lines of interest are shown on the screen. These lines have the type of secondary structural unit given as the second column on the line and its location is the last two columns on the line. An example of output from a define_s analysis on another protein is given below.

REMARK  ELEMNT = SYMBOL,HP(X,1),APT(X,1),IS,IEE
ELEMNT B    16.189    12.911     4.031     8.169     6.391    16.699  1    6
ELEMNT @     7.546     3.291    15.254     3.898     5.580    -2.612  6   18
ELEMNT B     3.882     5.449    -6.729     5.205    13.377    -2.551 19   22
ELEMNT @     7.451    14.902    -1.520     0.676    12.012    10.115 22   31
ELEMNT B     4.694    12.472    10.125    29.800    13.665     8.034 31   39
ELEMNT B     4.953    11.444    10.856    22.227    18.508     4.900 31   37

Looking at this example listing the following secondary structural elements are located: sheets from 1 -6, 19-22, 31-39, and helixes from 6 - 18, 22-31.

Record below the define_s results for your chosen lysozyme PDB file.

define_s lysozyme prediction results:

helix locations: __________________________________________________

sheet locations: __________________________________________________

Repeat the process, this time using your selected molecule PDB file. Record the results of this run below.

define_s selected molecule prediction results:

helix locations: __________________________________________________

sheet locations: __________________________________________________

With the results of the define_s run in hand, do a DSSP determination. This software has been set up to run by giving the computer the term dssp followed by the name of the PDB access code of the file to be used. xxxx represents the PDB access code for your chosen lysozyme structure in the command line given below.

$ dssp xxxx

Informational lines appear on the screen relating to the setting of various flags within the program. The screen will hold at FLAGACCESS DONE for a while until the calculations are completed and then produce a PRINTOUT DONE line followed by a statement showing the deletion of a file. The larger the protein being used the longer the wait. Your results are located in a file called xxxx.dssp, where the xxxx denotes the PDB access code used in the run.

The results of this program are produced in a file that is 132 characters wide. It is best to produce hard copy of your results on the Commons printer to look at . While the printer in the teaching lab doesn't handle 132 characters, the information we want is within the first 80 characters of the printed line. The command cpr followed by the name of your desired filename will cause the file to be printed on the printer in the teaching lab. A trailer page will help determine whose output is whose at the printer because it gives the username of the account that sent the preceding pages of text.

$ cpr xxxx.dssp

Examine your results. To get the required information, look at the letter appearing under the s in the structure column. The letters G and H denote helical findings, B. S and E are for sheets and T for turns. Record your results below.

dssp lysozyme prediction results:

helix locations: __________________________________________________

sheet locations: __________________________________________________

turn locations: ___________________________________________________

dssp selected molecule prediction results:

helix locations: __________________________________________________

sheet locations: __________________________________________________

turn locations: ___________________________________________________

This completes your work on model1. Log off of model1 and you will be back in your original session on ribozyme.

$ logout


8) Comparing the various techniques.

So far you have completed a number of analysis runs to determine secondary structural assignment. You have used a secondary structure prediction program based on primary sequence to get two structural predictions. You have looked at actual coordinate data for secondary structural assignments with the dssp and define_s programs. Plus you have gone out on the nets for another structural prediction and looked at the reported assignments in the published structure. It is now time to compare all this data with one another and see how well they agree.

Below is a composite listing of all the results from these various methods for the crambin protein. In this comparison, h is for a predicted helical residue, s for a predicted sheet residue, t for a turn predicted residue and x for spots where the residue was predicted as both a helix and a sheet. Look carefully at this comparison.

      DSSP:  ss sshhhhhhhhhhhtt  hhhhhhhhhs ss sss   hhh  
  define_s: sssssshhhhhhhhhhhhsssxhhhhhhhhxssssssss       
    author: ssss hhhhhhhhhhhhhh   hhhhhhhh ssss     tttt  
   CF pred: sssssssss ttsssssstt   sssssssssssstt   ttt   
  GOR pred: tttssssssttttttt  tt  sssssttttssssstt  ttt   
 nnpredict:       ss               ssss   sssss
  Sequence: TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN 
            1        1         2         3         4      
                     0         0         0         0

The best you can say about these results is that they don't agree. There appears to be agreement between the types of analysis (i.e., those from coordinate data analysis more or less agree with one another as do those based on primary sequence techniques). There are only two residues whose structure is the same in all six methods.

This type of comparison requires editing to generate the resulting data and can be very useful. However, it doesn't take into account protein folding and how the component structural elements might interact with one another. To do that requires some sort of graphical analysis method.

To generate a graphical analysis of these results, do the following steps. Start with the example.mol-template file you copied over to your account at the beginning of the exercise. This file contains the bare bones data for creating a postscript file like the one at the end of the exercise. Make a copy of this file so that you will have one to put your selected molecule results in.

% cp example.mol-template select.mol-template

The file divides a page up into six areas in which to show the results of the various analysis runs. In this file any sheet sections will be colored red, helical sections green and known turn sections blue. There is a coil section defined that handles any undefined regions of the sequence. The final edited version of the file will be run through the program molscript to produce the final image.

Molscript has some limitations that need to be taken into account when creating a graphical comparison file. The most important of these is that it will not draw a sheet section with less than 3 residues in it. This may mean that you will have to make decisions as to how to enter your data points when very small sheet sections are predicted close to one another.

To use this technique, organize your data in the following manner. List your predicted structural elements in the order they occur in the sequence. Then check to see if there are any small predicted sections (three is the smallest for all structures except coils). Either expand small sections or drop them. Fill in the gaps between the structural elements with coil sections. A coil section needs to overlap the ending and beginning residues of the predicted elements they connect.

For example, the nnpredict results for crambin shows three sheet sections, 7-8, 24-27 and 31-35. In this example, if you want to show the first predicted sheet you would have to expand the size of that sheet to three residues. Then the coils would have to be put in. This would result in the following coil and sheet lines in the file.

 coil from 1 to 6;             strand from 6 to 8; 
 coil from 8 to 24;            strand from 24 to 27; 
 coil from 27 to 31;           strand from 31 to 35; 
 coil from 35 to 46;  

A section of the example.mol-template file is shown below. The file contain six sections, the first one for the Chou-Fasman prediction results is shown here. Those areas that need to be changed are shown with x's. After you have determined the number of sheet, helix, turn and coil entries you need, then you start to edit the file.

First, you change the xxxx in the xxxx.pdb line to reflect the access code of the PDB file you are using. You either make more copies of the coil line or delete it as the situation warrants. Usually, you will need to make copies of this line in the file and modify each line to reflect the various coil sections that you will need to connect all the structural elements of the protein together.

Repeat this process with the sheet section. The molscript program uses the term strand instead of sheet. If there are no sheets predicted then don't have any lines after the set planecolour red; line. Move on to the helix and turn sections, and do the same thing. Remove any lines after the set planecolour line if there aren't any predicted structural elements of that type in the given prediction.

After the structural elements have been entered, change the xxxxx at the bottom of the section to reflect the name of the protein being used for this comparison. Change all six of the sections of this file to enter the data you gathered from the following sources: Chou-Fasman prediction, GOR prediction, define_s assignments, dssp assignments, nnpredict prediction and the reported assignments from the NRL_3D database.

! this is an attempt at using molscript
plot
  area 50.0 550.0 250.0 750.0;
  read mol "xxxx.pdb";
  transform atom *
    by centre position atom *
    ;

  coil from x to xx;

  set planecolour red;
    strand from x to xx;

  set planecolour green;
    helix from x to xx;

  set planecolour blue;
    turn from x to xx;

   set depthcue 0.0, labelsize 20.0;
   label 0.0 -15.0 0.0 "xxxxx CF";

end_plot

When you have whipped the example.mol-template file into shape, enter the following command to create the final image. Use (your last name)-lyso.ps for the filename of lysozyme graphical comparison file.

% molscript <example.mol-template> (your lastname)-lyso.ps

Watch the program as it progresses to see if there are any problems. If there are, ask for help from your instructor. You should also check to see if the structural elements you entered really appear on the screen. If an element is too small, weird things will happen to the resulting output and they will not match what you entered.

Print off the file on the teaching lab printer with the following command.

% lpr (your lastname)-lyso.ps

If the image looks ok, repeat this process with your selected molecule using the select.mol-template file this time. Use your last name-sel.ps for the selected molecule graphical comparison file.



9) Finishing up.

Use the editor, pico to fill in the report form and send it over to the teacher account. Also send over to the teacher account the following files:

% mv week5m.week5m (your lastname).week5m

% pico (your lastname).week5m

% rcp (your lastname).week5m teacher@ribozyme:receive

% rcp (your lastname)-lyso.ps teacher@ribozyme:receive

% rcp (your lastname)-sel.ps teacher@ribozyme:receive

This concludes your computing session for this week. Log off the computer.

Now exit the emulator program by selecting the Quit option from the File location on the control bar. You will be returned to the Launcher window screen.

References

Define_S, F.M. Richards and C.E. Kundrot, Proteins 3: 71-84 (1988).

DSSP, W. Kabsch and C. Sander. Biopolymers 22: 2577-2637 (1983).

von Heijne, G., Sequence Analysis in Molecular Biology, Treasure Trove or Trivial Pursuit, pp. 81-93. Academic Press, Inc. (1987).

Per J. Kraulis, "MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures", Journal of Applied Crystallography (1991) vol 24, pp 946-950.



Internet resources used:

nnpredict site:
http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html