Modelling Series
Learning about ways to estimate protein secondary structures and the usefulness of these techniques for other purposes.
Author:
Susan Jean Johns
The determination of protein secondary structure has been an intriguing puzzle for some time. When Linus Pauling made his prediction in 1948 that proteins would be composed of alpha helix and beta sheet units, no protein structures had yet been determined. His prediction was based solely on the idea that the potential hydrogen bonding possible in such structures would increase their stability and make them more probable. Improvements in x-ray diffraction techniques made it possible to solve protein three-dimensional structures. The predicted sub units were there.
As more and more structures were determined, the beginnings of possible folding patterns were observed. Soluble globular proteins have started to be understood in general terms. "The principle underlaying the structure of helices, sheets, and turns is the simultaneous formation of hydrogen bonds by buried peptide groups and the retention of single residue conformations close to those of minimum energy. The shape of the helix and sheet structures make these structural elements pack together in a small number of relative orientations. The links between secondary structures tend to be right-handed and short, and do not form knots." As a result, globular proteins usually fold into a few common patterns. These proteins can roughly be grouped into four classes: all-alpha, all beta, mixed alpha/beta formed from beta-alpha-beta units, and alpha + beta where the helix and sheet units are segregated.
The possibility of a given section of a peptide folding to form a helix, a sheet or a turn is primarily dependent on the preferred conformations of the constituent residues and the packing quality of the surface formed. Prediction schemes have been devised, with relative success, which are based on only local or semi-local sequence patterns due to this local characteristic of folding forces. Once past these generalities, the detailed mechanisms of folding is still only vaguely understood.
Even as the body of determined structures grows, questions remain as to relationship between solved crystal structures and proteins in solution. What effect do ionic conditions have on secondary structure? What effect does protein concentration have? Do crystals with different space groups produce the same or similar protein structures? Do x-ray and NMR (solution) structure determinations on the same protein agree with one another? If not, what are the causes of the differences?
Indications of the possible structures of proteins came from initial studies on polypeptides. As protein structures were solved it appeared that the confirmation of a residue in a protein was the same as in the homopolymeric form. This correlation is far from perfect, however.
After a number of structures had been determined, various groups attempted to do statistical studies on the data to determine any preference on the part of an individual amino acid for a given secondary structure type. These efforts resulted in the empirical prediction schemes of Chou-Fasman and Garnier-Robson. The Chou-Fasman method is a group of rules applied to a given structure. It is an ambiguous method that has proven difficult to automate. The Garnier-Robson method is based on consistent application of information theory with auxiliary information from circular dichroism used to bias its prediction. This method is unambiguous and easy to automate.
Another approach is to look for periodicity in regular secondary structures. Such information is best seen from helical wheel diagrams where the view down the helical axis groupings shows similar kinds of amino acids. The regular appearance of apolar residues spaced 3 or 4 residues apart could be a pattern that alpha helices recognize, while sheets might look for uniformly apolar sections - if completely buried within a protein - or alternating polar and apolar residues if on the surface. Some proteins display these patterns to a certain extent. Such studies have resulted in the prediction scheme of Lim and Eisenberg's hydrophobic moment technique.
Other scientists have looked at all the possible structural conformations for various sequence sections that exist in the known structures and tried to form prediction schemes based on their findings. The thought is that a similar sequence will have similar secondary structures wherever it is found. To do this, a measure of homology (or similarity) needs to be established between the studied sequences and a weighing of possible conformations found in order to form a final prediction. The algorithms of Nishikawa and Ooi, Levin, and Sweet are all based on this theme. The differences resulting from the comparison choices made and the scoring systems used.
Studies done on the reliability of the various prediction schemes show disheartening results. Depending on whether three or four secondary structural elements are used, random chance would result in either a 25% or a 33% chance of a prediction being correct. The different approaches touched on here only improve the chances to 45% to 55% of the prediction being correct. Reported higher percentages usually are the result of a biased data set, and not an actual improvement in the technique devised.
As a user of such prediction schemes you should be cautious in the applying and interpreting their results. It is best to use these predictions only in cases where other types of potential confirming evidence is available, such as the presence of antigen producing regions, or estimates derived from physical data.
Select a molecule that most closely fits the general type of work that you are doing in the wet lab or plan to use in your project and continue to use it throughout the semester. Use the same selected molecule for this exercise as you did for week 3.
1) higher plant ribulose bisphosphate carboxylase/oxygenase, small subunit only
2) mammalian P21 ras proto-oncogene transforming protein
3) mammalian basic fibroblast growth factor
4) fungal superoxide dismutase
Week 5 Exercise
This series of exercises will acquaint you with the various methods on the computer for estimating protein secondary structure. Some of these programs use primary sequence data, others are based on statistical analysis or interpretation of crystallographic results.
l) Activate the computer
Pressing any key changes the terminal from screen saver mode to active.
2) Select the RIBOZYME icon and log onto ribozyme.
From the Launcher window, select the RIBOZYME icon and press the mouse button twice. Successful connection to ribozyme is denoted by the appearance of a ribozyme information line and a login: prompt. Once the login: prompt appears, log on to the machine by entering first your account name and then your password to the Password: prompt.
3) Create a subdirectory to keep this week's work in.
To keep data in separate working areas create subdirectories. This is done with the mkdir command. Create the following subdirectory in your account.
% mkdir week5
Move into that location using the following command line.
% cd week5
Copy over the files needed for this week's activities.
% cp $GRAD_DIR/week5m/*.* .
The NRL_3D database contains primary sequences of all the solved protein structures in the PDB database. This database is a good source of information on PDB protein files that can be searched using sequence analysis software.
Lysozyme was one of the earliest proteins whose crystal structure was determined. There are numerous lysozyme structures in the PDB database from various sources. Limit the files you look at to lysozymes from chicken egg white with secondary structure assignments, by using multiple passes of GCG's STRINGSEARCH program. To use this program activate the GCG software suite.
% gcg
The welcoming message for the package appears on the screen. Now start the process of doing string searches on a series of ever-decreasing datasets. Use the instructions here as a guide through the process. User input is shown in bold type.
% stringsearch
StringSearch identifies sequences by searching for character patterns
such as "globin" or "human" in the sequence documentation.
STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? nrl_3d:*<rtn>
Do you want to search through:
A) definitions
B) complete sequence annotation
Please choose one (* A *):<cr>
Search for what text patterns ? lysozyme<rtn>
What should I call the output file (* nrl_3d.strings *) ? lyso.strings<rtn>
A listing of the hits is produced on the screen. Looking at this
listing shows how many possible data files you may be dealing with.
Sequences searched: xxxx
Sequences with matches: xxx
Patterns sought: lysozyme
Output file: lyso.strings
The list of lysozyme data files contained in lyso.strings is all the lysozyme structures currently in the NRL_3D database. Not all these files are from egg white or even from chicken for that matter. To restrict the lysozyme files to just those that are from chicken, repeat the process this time using the output file from the first search as the source of the data for the second search. To use the first output file as input for the second run, an @ symbol is placed directly before the name of the first output file.
% stringsearch
StringSearch identifies sequences by searching for character patterns
such as "globin" or "human" in the sequence documentation.
STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? @lyso.strings<rtn>
Do you want to search through:
A) definitions
B) complete sequence annotation
Please choose one (* A *):<rtn>
Search for what text patterns ? chicken<rtn>
What should I call the output file (* lyso.strings *) ? lyso2.strings<rtn>
A listing of the hits is produced on the screen. Looking at this listing
shows that the number of possible data files has been reduced.
Sequences searched: xxx
Sequences with matches: xx
Patterns sought: chicken
Output file: lyso2.strings
There still are too many files to deal with. Type off the results of the second search with the more command. Press the space bar to advance to the next screen of information after you have carefully examined the data presented.
% more lyso2.strings
Many of the remaining files are the results of mutation studies. Looking closely at these definition lines reveals that by using the search phrase 17) - chicken 129aa it would be possible to restrict the output even more and zero on the desire data. Putting this search phrase within double quotes insures that the entire phrase including the spaces between terms will be used in the searching process.
% stringsearch
StringSearch identifies sequences by searching for character patterns
such as "globin" or "human" in the sequence documentation.
STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? @lyso2.strings<rtn>
Do you want to search through:
A) definitions
B) complete sequence annotation
Please choose one (* A *):<rtn>
Search for what text patterns ? "17) - chicken 129aa"<rtn>
What should I call the output file (* lyso2.strings *) ? lyso3.strings<rtn>
A listing of the successful hits is shown.
Sequences searched: xx
Sequences with matches: x
Patterns sought: 17) - chicken 129aa
Output file: lyso3.strings
Now that the list is manageable, the question is: do any of these data sets contain secondary structure assignments? The terms to be searched for this time are helix, sheet and turn and the complete sequence annotation needs to be looked at to find this out.
% stringsearch
StringSearch identifies sequences by searching for character patterns
such as "globin" or "human" in the sequence documentation.
STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? @lyso3.strings<rtn>
Do you want to search through:
A) definitions
B) complete sequence annotation
Please choose one (* A *): b<rtn>
Search for what text patterns ? helix,sheet,turn<rtn>
What should I call the output file (* lyso3.strings *) ? lyso4.strings<rtn>
A listing of the first hit for each of search terms found in the data
files is shown.
Sequences searched: x
Sequences with matches: x
Patterns sought: helix sheet turn
Output file: lyso4.strings
Type off these results with the cat command.
% cat lyso4.strings
Any of these data files would serve the purpose of this exercise. One of them is already in your account. It was copied over into your week5 subdirectory at the start of the exercise. This file will be used to do additional analysis on model1 using software that works directly on the PDB formatted data files. Do a directory listing to determine which lysozyme file is already in your account and record that access code below.
% ls -la
selected lysozyme access code:___________________________________________
Search the selected data set for the secondary structure assignments with the following commands. Record your results below. The typedata program is a version of FETCH that sends the results to the screen instead of a file. The xxxx in these commands represents the chosen lysozyme access code.
% typedata nrl_3d:xxxx | grep helix % typedata nrl_3d:xxxx | grep sheet % typedata nrl_3d:xxxx | grep turn
helix locations: __________________________________________________ sheet locations: __________________________________________________ turn locations: ___________________________________________________
selected molecule access code: ______________________________________
% typedata nrl_3d:xxxx | grep helix % typedata nrl_3d:xxxx | grep sheet % typedata nrl_3d:xxxx | grep turn
helix locations: __________________________________________________ sheet locations: __________________________________________________ turn locations: ___________________________________________________
5) Doing secondary structure prediction from sequence data.
Having seen the secondary structure assignments for the lysozyme and your selected molecule from the solved structures, determine what sort of assignments can be derived from computer software. The GCG program to use for this task is called PEPTIDESTRUCTURE. Run PEPTIDESTRUCTURE on both the lysozyme data and your selected molecule data. An example run is given below. Use this information as a guide when running your own analysis. In the example nrl_3d:xxxx represents the access code of your chosen lysozyme data set. User input is shown in bold type.
% peptidestructure
PeptideStructure makes secondary structure predictions for a peptide
sequence. The predictions include (in addition to alpha, beta, coil, and
turn) measures for antigenicity, flexibility, hydrophobicity, and surface<
probability. PlotStructure displays the predictions graphically.
PEPTIDESTRUCTURE for what peptide sequence ? nrl_3d:xxxx<rtn>
Begin (* 1 *) ? <rtn>
End (* 129 *) ? <rtn>
Calculate hydrophilicity according to
H)opp-Woods or
K)yte-Doolittle
Please choose one (* K *) : <rtn>
What should I call the output file (* xxxx.p2s *) ? <rtn>
Use the more command to display the results of this run one screen's
worth at a time. The very first part of this output is given on the next page.
In the example xxxx represents the access code for your NRL_3D file.
% more xxxx.p2s
This file needs some explaining. After the descriptive material at the top stating what sequence was used and references for the determinations that were made, the file lists the results of the run. The results are reported for each residue of the sequence. There are nine different columns of data. The first and second columns report the location of the residue in the sequence and the residue's name. Columns seven and eight are the secondary structure prediction columns. An h in either of these columns denotes a helical structure prediction, b stands for a sheet, and t for a turn. Column seven contains the Chou-Fasman predictions. This scheme uses upper and lower case letters to denote the strength of the prediction. Strong predictions are in upper case and weak are in lower case. Column eight is the Garnier-Osguthorpe-Robson secondary prediction results. Only upper case letters are used in this prediction scheme. The rest of the reported results deal with other protein characteristics that can be determined from primary sequence analysis and are not of interest to the present line of inquiry.
PEPTIDESTRUCTURE of: nrl_3d:xxxx check: 2383 from: 1 to: 129
lysozyme (EC 3.2.1.17) - chicken
Hydrophilicity (Kyte-Doolittle) averaged over a window of: 7
Surface Probability according to Emini
Chain Flexibility according to Karplus-Schulz
Secondary Structure according to Chou-Fasman
Secondary Structure according to Garnier-Osguthorpe-Robson
Antigenicity Index according to Jameson-Wolf
Date: January 20, 1996 20:44
Pos AA GlycoS HyPhil SurfPr FlexPr CF-Pred GORPred AI-Ind ..
1 K . -0.675 0.476 1.000 . H -0.450
2 V . 0.360 0.730 1.000 . H 0.450
3 F . -0.117 0.306 1.000 . H -0.150
4 G . 0.400 0.265 1.000 . H 0.450
5 R . -0.700 0.295 0.981 t H -0.400
6 C . -0.357 0.344 0.971 t H -0.100
7 E . -0.214 0.351 0.954 H H -0.300
8 L . -0.529 0.181 0.934 H H -0.600
9 A . -1.443 0.334 0.912 H H -0.600
10 A . -0.529 0.386 0.911 H H -0.600
///////////////////////////////////////////////////////////////
Look carefully are the result of this run and record the locations of the various secondary structural elements below.
lysozyme prediction results (Chou-Fasman):
helix locations: __________________________________________________ sheet locations: __________________________________________________ turn locations: ___________________________________________________
helix locations: __________________________________________________ sheet locations: __________________________________________________ turn locations: ___________________________________________________
selected molecule prediction results (Chou-Fasman):
helix locations: __________________________________________________ sheet locations: __________________________________________________ turn locations: ___________________________________________________
helix locations: __________________________________________________ sheet locations: __________________________________________________ turn locations: ___________________________________________________
VADMS has in its computing resource a mail server utility, MSU. With MSU you use the Internet for various types of sequence analysis including secondary structure predictions. MSU requires a very basic form of sequence file.
You need a data file to work with for both your chosen lysozyme sequence and your selected molecule and to have these data files in your own account. Use the GCG program FETCH to bring these sequences over into your account. In the FETCH command lines given below nrl_3d:xxxx is the name of your chosen lysozyme access code and nrl_3d:yyyy is your selected molecule access code. This form of the FETCH command puts in the name of the input file as a command switch option.
% fetch -in=nrl_3d:xxxx % fetch -in=nrl_3d:yyyy
With the desired files in your account, run the sequence conversion tool readseq. Use the instructions given below to do the conversion. The purpose of this step is to strip down the data to a format that can be used with the MSU program. The example shown uses a representation of your lysozyme file. User input is shown in bold. Repeat this process on your selected molecule file.
% readseq
readSeq (1Feb93), multi-format molbio sequence reader.
Name of output file (?=help, defaults to display):
xxxx.pro<rtn>
1. IG/Stanford 10. Olsen (in-only)
2. GenBank/GB 11. Phylip3.2
3. NBRF 12. Phylip
4. EMBL 13. Plain/Raw
5. GCG 14. PIR/CODATA
6. DNAStrider 15. MSF
7. Fitch 16. ASN.1
8. Pearson/Fasta 17. PAUP/NEXUS
9. Zuker (in-only) 18. Pretty (out-only)
Choose an output format (name or #):
8<rtn>
Name an input sequence or -option:
xxxx.nrl_3d<rtn>
Name an input sequence or -option:
<rtn> [this exits you from the program]
In MSU the number of servers shown depends on the nature of the sequence being used. It will show 15 initially for a protein sequence and will display more if you care to register with the various servers. The servers you are interested in using are ProteinPredict and nnpredict, both use neural net approaches to generating the prediction. However, ProteinPredict has been acting funny lately, so you will only get data only from nnpredict. Follow the example given here to submit your sequences for analysis by this network tool. User input is shown in bold. In the example, xxxx.pro represents your chosen lysozyme sequence and yyyy.pro your selected molecule sequence.
% msu **************** MSU (Mail Server Utility) ********************* Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library Sequence loaded: NoneOptions: L - Load sequence H - Retrieve HELP file from server R - Register with service O - Other options Q - Quit (exit program) Enter L, H, R, O, or Q to quit:
The utility loads and you have a number of options. Select l so that you can load your sequence. Then enter the name of the file you want to use to the utility's prompt.
Enter L, H, R, O, or Q to quit: l<rtn> Enter file name: xxxx.pro<rtn>
The sequence is loaded and a series of options appears as to where to send off the data for further analysis. Check to make sure that the correct number of residues have loaded. If not, repeat the process with a file you are sure has been modified to work with this utility.
**************** MSU (Mail Server Utility) ********************* Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library Sequence loaded: xxxx.pro Protein, 129 residues, 1-KVFGR...RGCRL-129 Services: (press RETURN for next page) 1 - EMBL BLITZ server 2 - BIOCCELERATOR 3 - FLASH 4 - NCBI BLAST 5 - EMBL FASTA server 6 - NBRF/PIR FASTA server 7 - CBRG (ETH Zuerich) 8 - BLOCKS server 9 - MotifFinder 10 - ProteinPredict Options:L - Load sequence S - Set sequence limits (1 - 129) H - Retrieve HELP file from server R - Register with service O - Other options Q - Quit (exit program) Enter 1-15, L, S, H, R, O, or Q to quit:
At this point, press the RETURN key to move on to the next page of services listings.
Enter 1-15, L, S, H, R, O, or Q to quit: <rtn>
The following screen appears after you have pressed the RETURN key.
**************** MSU (Mail Server Utility) ********************* Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library Sequence loaded: xxxx.pro Protein, 129 residues, 1-KVFGR...RGCRL-129 Services: (press RETURN for next page) 11 - nnpredict 12 - GenomeNet BLAST 13 - GenomeNet FASTA 14 - GenQuest (Q) 15 - ProDom Options: L - Load sequence S - Set sequence limits (1 - 129) H - Retrieve HELP file from server R - Register with service O - Other options Q - Quit (exit program) Enter 1-15, L, S, H, R, O, or Q to quit:
At this point, enter 11 to send off your data to the nnpredict server.
Enter 1-15, L, S, H, R, O, or Q to quit: 11<rtn>
The utility responds by asking you the necessary questions to send off a request to this server. For this pass just accept the default values for the prediction.
Service nnpredict Neural network secondary protein structure prediction Select prediction options: 1 - n 2 - a 3 - b 4 - a/b 1-4 or ? [1]:<rtn> Request mailed to nnpredict@celeste.ucsf.edu at Tue Jan 23 19:28:38 1996 The reply should soon arrive in your mailbox PRESS <RETURN> TO CONTINUE...
At this point press the RETURN key and go through this process again. This time use your converted selected molecule data file. Load in your second sequence and submit the requests to the same server with this new data. After you have submitted the second request, exit the utility in the following manner. Press the RETURN key to get back to a utility screen where you enter q and press the RETURN key again to return to the machine prompt.
You may have mail waiting for you when you exit the MSU utility. Nnpredict can respond very quickly at times when network traffic is light, much slower when traffic is heavy.
These files require explanation. The nnpredict folks send back just the minimum of information in an e-mail message that looks like the one on below. It repeats the send sequence and then gives the prediction. You will have to count along your sequence to figure out where the various predicted secondary structures are. Nnpredict now has a web site as well.
From nnpredict@celeste.ucsf.eduTue Jan 23 19:49:24 1996 Date: Tue, 23 Jan 1996 19:35:07 -0800 (PST)From: the nnpredict server <nnpredict@celeste.ucsf.edu> To: teacher@ribozyme.vadms.wsu.edu Cc: nnpredict-request@celeste.ucsf.edu Subject: Reply to your nnpredict query Sequence: >2MLTA, GIGAVLKVLTTGLPALISWIKRKRQ Secondary structure prediction for 2MLTA, (option n): ---HEEEEEH----HHHHHHH---- (H = helix, E = strand, - = no prediction) Sequence: nnpredict was written by Donald Kneller. Copyright (C) 1991 Regents of the University of California. Try the NEW nnpredict Web page: http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html
Record the nnpredict results below after your have received them from the server.
nnpredict lysozyme prediction results:
helix locations: __________________________________________________ sheet locations: __________________________________________________ turn locations: ___________________________________________________
helix locations: __________________________________________________ sheet locations: __________________________________________________ turn locations: ___________________________________________________
While you wait for your prediction results to arrive, explore using actual PDB files to get automated secondary structures predictions based on actual coordinate data. The software to do this is on model1. Therefore, after you have moved your PDB files over to model1, start a telnet session on model1.
The software on model1 assumes that you have exactly the protein or protein chain you want to use in your PDB data file and none other. If necessary, edit your PBD file(s) so they only contain the desired data.
First, ftp the PDB files over to your account on ribozyme in the following manner.
% ftp model1.vadms.wsu.edu
When the model1 machine prompt appears, enter your account name and password on that computer. You are using FTP protocol to transfer a file from ribozyme to model1. Follow these commands to move the file. Replace the bcsxx of the example with our own account name. In the example given below xxxxx.pdb represents your chosen lysozyme PDB file and yyyy.pdb your selected molecule PDB file. User input is shown in bold type.
Connected to model1.vadms.wsu.edu 220 model1.vadms.wsu.edu MultiNet FTP Server Process 3.4(14) at Wed 24-Jan-96 11 :59AM-PST Name (model1.vadms.wsu.edu:bcsxx): bcsxx<rtn> 331 user name (bcsxx) ok. Password, please. Password:(enter your own password<rtn>)230 User BCSXX logged into DISK1:{BCSXX} at WED 24-JAN-96 12:01PM-PST, job 23b. Remote system type is VMS. ftp>type ascii<rtn> 200 Type A ok. ftp>put xxxx.pdb<rtn> local: xxxx.pdb remote xxxx.pdb 200 Port 10.72 at Host 134.121.43.151 accepted. 150 ASCII Store of DISK1:[BCSXX]XXXX.PDB:1 started. 226 Transfer completed. xxxxx (8) bytes transferred. xxxxx bytes sent in 0.03 seconds (1077.52 Kbytes/s) ftp>put yyyy.pdb<rtn> local: xxxx.pdb remote yyyy.pdb 200 Port 10.72 at Host 134.121.43.151 accepted. 150 ASCII Store of DISK1:[BCSXX]YYYY.PDB:1 started. 226 Transfer completed. xxxxx (8) bytes transferred. xxxxx bytes sent in 0.03 seconds (1077.52 Kbytes/s) ftp>quit<rtn> 221 QUIT command received. Goodbye.
Now, you can either logout of ribozyme and log into model1 or you can open a telnet session on model1. Since no modelling will be done this time, a telnet session is a good option. Instructions are given below. User input is shown in bold type. When the Username: prompt appears log in as usual.
% telnet model1<rtn>
Trying 134.121.12.92...
Connected to model1.vadms.wsu.edu
Escape character is '^]'
Welcome to OpenVMS VAX 6.1
Username:
You will be running two pieces of software that directly analyze the PDB files to determine secondary structural assignments. The first one of these is called define_s. Before running the program, copy over a couple of files to allow the program to run, these have been given the logical names, de_s and de_ch to help you copy them into your account. Use the following two command lines to copy these files into your account.
$ copy de_s *.* $ copy de_ch *.*
To do the desired define_s analysis, enter the command line given below. Some initial data appears on the screen relating to the naming of various generated output files and questions about the various types of information the resulting output file will contain. These queries have already been answered for you and can be ignored. A listing of the numbers appears, one to a line as the program moves through the sequence. The program will pause at the last residue number. These calculations may take a while depending on the complexity and length of the molecule being studied. Two prompts appear at the end of the run asking if certain generated files are to be deleted. If you are curious about their contents, reply with n otherwise respond with y. In the command line given below, XXXX represents the access code of your chosen lysozyme PDB file. At the end of the run you will be asked to remove some files, do so.
$ @define_s XXXX
The xxxx.sss file contains the results you are interested in. This software uses the letter B to denote beta sheet, and @ for alpha helix. Ignore any other symbols that may appear in the output. The desired results are given in lines that contain the term ELEMNT. Do a search of the output file for the term ELEMNT to get your results. In VMS the term sea is similar to grep in UNIX.
$ sea xxxx.sss elemnt
The resulting data lines of interest are shown on the screen. These lines have the type of secondary structural unit given as the second column on the line and its location is the last two columns on the line. An example of output from a define_s analysis on another protein is given below.
REMARK ELEMNT = SYMBOL,HP(X,1),APT(X,1),IS,IEE ELEMNT B 16.189 12.911 4.031 8.169 6.391 16.699 1 6 ELEMNT @ 7.546 3.291 15.254 3.898 5.580 -2.612 6 18 ELEMNT B 3.882 5.449 -6.729 5.205 13.377 -2.551 19 22 ELEMNT @ 7.451 14.902 -1.520 0.676 12.012 10.115 22 31 ELEMNT B 4.694 12.472 10.125 29.800 13.665 8.034 31 39 ELEMNT B 4.953 11.444 10.856 22.227 18.508 4.900 31 37
Looking at this example listing the following secondary structural elements are located: sheets from 1 -6, 19-22, 31-39, and helixes from 6 - 18, 22-31.
Record below the define_s results for your chosen lysozyme PDB file.
define_s lysozyme prediction results:
helix locations: __________________________________________________ sheet locations: __________________________________________________
define_s selected molecule prediction results:
helix locations: __________________________________________________ sheet locations: __________________________________________________
$ dssp xxxx
Informational lines appear on the screen relating to the setting of various flags within the program. The screen will hold at FLAGACCESS DONE for a while until the calculations are completed and then produce a PRINTOUT DONE line followed by a statement showing the deletion of a file. The larger the protein being used the longer the wait. Your results are located in a file called xxxx.dssp, where the xxxx denotes the PDB access code used in the run.
The results of this program are produced in a file that is 132 characters wide. It is best to produce hard copy of your results on the Commons printer to look at . While the printer in the teaching lab doesn't handle 132 characters, the information we want is within the first 80 characters of the printed line. The command cpr followed by the name of your desired filename will cause the file to be printed on the printer in the teaching lab. A trailer page will help determine whose output is whose at the printer because it gives the username of the account that sent the preceding pages of text.
$ cpr xxxx.dssp
Examine your results. To get the required information, look at the letter appearing under the s in the structure column. The letters G and H denote helical findings, B. S and E are for sheets and T for turns. Record your results below.
dssp lysozyme prediction results:
helix locations: __________________________________________________ sheet locations: __________________________________________________ turn locations: ___________________________________________________
helix locations: __________________________________________________ sheet locations: __________________________________________________ turn locations: ___________________________________________________
$ logout
So far you have completed a number of analysis runs to determine secondary structural assignment. You have used a secondary structure prediction program based on primary sequence to get two structural predictions. You have looked at actual coordinate data for secondary structural assignments with the dssp and define_s programs. Plus you have gone out on the nets for another structural prediction and looked at the reported assignments in the published structure. It is now time to compare all this data with one another and see how well they agree.
Below is a composite listing of all the results from these various methods for the crambin protein. In this comparison, h is for a predicted helical residue, s for a predicted sheet residue, t for a turn predicted residue and x for spots where the residue was predicted as both a helix and a sheet. Look carefully at this comparison.
DSSP: ss sshhhhhhhhhhhtt hhhhhhhhhs ss sss hhh
define_s: sssssshhhhhhhhhhhhsssxhhhhhhhhxssssssss
author: ssss hhhhhhhhhhhhhh hhhhhhhh ssss tttt
CF pred: sssssssss ttsssssstt sssssssssssstt ttt
GOR pred: tttssssssttttttt tt sssssttttssssstt ttt
nnpredict: ss ssss sssss
Sequence: TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
1 1 2 3 4
0 0 0 0
The best you can say about these results is that they don't agree. There appears to be agreement between the types of analysis (i.e., those from coordinate data analysis more or less agree with one another as do those based on primary sequence techniques). There are only two residues whose structure is the same in all six methods.
This type of comparison requires editing to generate the resulting data and can be very useful. However, it doesn't take into account protein folding and how the component structural elements might interact with one another. To do that requires some sort of graphical analysis method.
To generate a graphical analysis of these results, do the following steps. Start with the example.mol-template file you copied over to your account at the beginning of the exercise. This file contains the bare bones data for creating a postscript file like the one at the end of the exercise. Make a copy of this file so that you will have one to put your selected molecule results in.
% cp example.mol-template select.mol-template
The file divides a page up into six areas in which to show the results of the various analysis runs. In this file any sheet sections will be colored red, helical sections green and known turn sections blue. There is a coil section defined that handles any undefined regions of the sequence. The final edited version of the file will be run through the program molscript to produce the final image.
Molscript has some limitations that need to be taken into account when creating a graphical comparison file. The most important of these is that it will not draw a sheet section with less than 3 residues in it. This may mean that you will have to make decisions as to how to enter your data points when very small sheet sections are predicted close to one another.
To use this technique, organize your data in the following manner. List your predicted structural elements in the order they occur in the sequence. Then check to see if there are any small predicted sections (three is the smallest for all structures except coils). Either expand small sections or drop them. Fill in the gaps between the structural elements with coil sections. A coil section needs to overlap the ending and beginning residues of the predicted elements they connect.
For example, the nnpredict results for crambin shows three sheet sections, 7-8, 24-27 and 31-35. In this example, if you want to show the first predicted sheet you would have to expand the size of that sheet to three residues. Then the coils would have to be put in. This would result in the following coil and sheet lines in the file.
coil from 1 to 6; strand from 6 to 8; coil from 8 to 24; strand from 24 to 27; coil from 27 to 31; strand from 31 to 35; coil from 35 to 46;
A section of the example.mol-template file is shown below. The file contain six sections, the first one for the Chou-Fasman prediction results is shown here. Those areas that need to be changed are shown with x's. After you have determined the number of sheet, helix, turn and coil entries you need, then you start to edit the file.
First, you change the xxxx in the xxxx.pdb line to reflect the access code of the PDB file you are using. You either make more copies of the coil line or delete it as the situation warrants. Usually, you will need to make copies of this line in the file and modify each line to reflect the various coil sections that you will need to connect all the structural elements of the protein together.
Repeat this process with the sheet section. The molscript program uses the term strand instead of sheet. If there are no sheets predicted then don't have any lines after the set planecolour red; line. Move on to the helix and turn sections, and do the same thing. Remove any lines after the set planecolour line if there aren't any predicted structural elements of that type in the given prediction.
After the structural elements have been entered, change the xxxxx at the bottom of the section to reflect the name of the protein being used for this comparison. Change all six of the sections of this file to enter the data you gathered from the following sources: Chou-Fasman prediction, GOR prediction, define_s assignments, dssp assignments, nnpredict prediction and the reported assignments from the NRL_3D database.
! this is an attempt at using molscript
plot
area 50.0 550.0 250.0 750.0;
read mol "xxxx.pdb";
transform atom *
by centre position atom *
;
coil from x to xx;
set planecolour red;
strand from x to xx;
set planecolour green;
helix from x to xx;
set planecolour blue;
turn from x to xx;
set depthcue 0.0, labelsize 20.0;
label 0.0 -15.0 0.0 "xxxxx CF";
end_plot
When you have whipped the example.mol-template file into shape, enter the following command to create the final image. Use (your last name)-lyso.ps for the filename of lysozyme graphical comparison file.
% molscript <example.mol-template> (your lastname)-lyso.ps
Watch the program as it progresses to see if there are any problems. If there are, ask for help from your instructor. You should also check to see if the structural elements you entered really appear on the screen. If an element is too small, weird things will happen to the resulting output and they will not match what you entered.
Print off the file on the teaching lab printer with the following command.
% lpr (your lastname)-lyso.ps
If the image looks ok, repeat this process with your selected molecule using the select.mol-template file this time. Use your last name-sel.ps for the selected molecule graphical comparison file.
9) Finishing up.
Use the editor, pico to fill in the report form and send it over to the teacher account. Also send over to the teacher account the following files:
% mv week5m.week5m (your lastname).week5m % pico (your lastname).week5m % rcp (your lastname).week5m teacher@ribozyme:receive % rcp (your lastname)-lyso.ps teacher@ribozyme:receive % rcp (your lastname)-sel.ps teacher@ribozyme:receive
This concludes your computing session for this week. Log off the computer.
Now exit the emulator program by selecting the Quit option from the File location on the control bar. You will be returned to the Launcher window screen.
Define_S, F.M. Richards and C.E. Kundrot, Proteins 3: 71-84 (1988).
DSSP, W. Kabsch and C. Sander. Biopolymers 22: 2577-2637 (1983).
von Heijne, G., Sequence Analysis in Molecular Biology, Treasure Trove or Trivial Pursuit, pp. 81-93. Academic Press, Inc. (1987).
Per J. Kraulis, "MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures", Journal of Applied Crystallography (1991) vol 24, pp 946-950.
Internet resources used: