Estimating protein secondary structure and physical attributes: Learning about various methods and the usefulness and limitations of these predictions.
Authors:
Susan Jean Johns and Steven M. Thompson
The determination of protein secondary structure has been an intriguing puzzle. When Linus Pauling made his prediction in 1948 that proteins would be composed of alpha helix and beta sheet units, no protein structures had yet been determined. His prediction was based solely on the idea that the potential hydrogen bonding possible in such structures would increase their stability and make them more probable. Improvements in x-ray diffraction techniques made it possible to solve protein three-dimensional structures and the predicted subunits were present.
As more and more structures were determined, the beginnings of possible folding patterns were observed. Soluble globular proteins have started to be understood in general terms. "The principle underlaying the structure of helices, sheets, and turns is the simultaneous formation of hydrogen bonds by buried peptide groups and the retention of single residue conformations close to those of minimum energy. The shape of the helix and sheet structures make these structural elements pack together in a small number of relative orientations. The links between secondary structures tend to be right-handed and short, and do not form knots." As a result, globular proteins usually fold into a few common patterns. These proteins can roughly be grouped into four classes: all alpha, all beta, mixed alpha/beta formed from beta-alpha-beta units, and alpha + beta where the helix and sheet units are segregated.
The formation of a peptide into a helix, a sheet, or a turn primarily depends on the preferred conformations of the constituent residues and the packing quality of the surface formed. Prediction schemes have been devised that have relative success which are based on only local or semi-local sequence patterns. Once past these generalities, the detailed mechanisms of folding is only vaguely understood.
Indications of the possible secondary structures of proteins came from initial studies on polypeptides. As protein structures were solved it appeared that the conformation of residues in proteins was similar to their homopolymeric form. This correlation is far from perfect, however.
Even as the body of determined structures grows, questions remain as to what the relationship is between solved crystal structures and proteins in solution. What effect do ionic conditions have on secondary structure? What effect does protein concentration have? Do crystals with different space groups produce the same or similar protein structures? Do x-ray and NMR structure determinations on the same protein agree with one another? If not, why not?
After a number of structures had been determined, various research groups attempted to do statistical studies on this data to determine preferences of different individual amino acid to have given secondary structure types. These efforts resulted in the empirical prediction schemes of Chou-Fasman and Garnier-Robson. The Chou-Fasman method is a group of rules applied to a given sequence. It is an ambiguous method that has proven difficult to automate. The Garnier-Robson method is based on the consistent application of information theory with auxiliary information from circular dichroism (CD) used to bias its prediction. This method is unambiguous and easy to automate. Both methods have been incorporated into the GCG Wisconsin Sequence Analysis Package.
Dichroism measures the difference in light transmitted through a sample. In circular dichroism, the light is not only polarized, but caused to move in both the right and left directions in a circular manner. Chiral molecules, those which are not superimposable on their mirror images, cause circularly polarized light to rotate differently in these two directions. CD devices measure this difference over a range of wavelengths for a given sample and output the results as a spectrum. CD studies can be used to determine experimental secondary structure estimates by interpreting the spectra produced.
Another approach is to look for periodicity in regular secondary structures. Such information can often be seen best with helical wheel diagrams where the view down the helical axis shows groupings of similar kinds of amino acids. The regular appearance of apolar residues spaced 3 or 4 residues apart could be a pattern indicative of alpha helices, while sheets might show uniformly apolar sections -- if completely buried within a protein -- or alternating polar and apolar residues if on the surface. Some proteins have been shown to display these patterns to a certain extent. Such studies have resulted in the prediction scheme of Lim and Eisenberg's hydrophobic moment technique.
Others have looked at all the possible structural conformations for various sequence sections that exist in the known structures and tried to form prediction schemes based on their findings. The thought is that a similar sequence will have similar secondary structures wherever it is found. To do this, a measure of similarity must be established between the studied sequences and the possible conformations weighed to form a final prediction. The algorithms of Nishikawa and Ooi, Levin, and Sweet are all based on this theme. The differences result from the comparison choices made and the scoring systems used.
Protein secondary structure prediction reliability is quite controversial. Studies done on the reliability of various prediction schemes show disheartening results. Depending on whether three or four secondary structural elements are used, random chance would result in either a 25% or a 33% chance of a prediction being correct. Most of the different approaches touched on here only improve those chances to between 45% and 55% of the prediction being correct. Reported higher percentages are often the result of a biased data set, and not an actual improvement in the technique devised. However, recent advances, such as the PredictProtein server at EMBL, combine neural net technology with the strength of multiple sequence analysis to improve reliability up to and beyond 70% in many situations.
One of the more important things to realize is many of the algorithms are based on soluble, globular proteins; therefore, when dealing with other types of proteins you must alter parameters and interpret the results in this light. Using the same parameters with all types of proteins would not be appropriate. Since defaults are often based on the soluble type guidelines, one must be especially careful when working with membrane-associated proteins. The applicability of these parameters is vital and one must tailor them appropriately. The simplest parameter to change is often the window size. It should be set approximately to the size of the feature being analyzed (e.g., use a window size of about 21 when trying to find membrane-spanning helices).
Users of such prediction schemes must be cautious in the application and interpretation of their results. It is best to use these predictions only in cases where other types of potential confirming, experimental evidence is available, such as the presence of antigen producing regions, or estimates derived from physical data. In all cases the computer must be thought of as a tool only; experimental evidence should be used to corroborate.
A comparison of four insulin-like growth factor II sequences and some secondary structure analysis methods follows. The PDB entry is a model of the human mature form only, the SwissProtein sequence is the human precursor protein, the GenBank entry is a translation based on the reference CDS information from human entry HumIGF2g, and the Profile consensus is based on the conserved portion of a multiple sequence alignment of all unique IGF2 protein entries. Following each block of the sequence alignment are the predicted and modeled secondary structural elements of the protein.
. . . . . .
PDB AYRPSETLCGGELVDTLQFVCGDRGFYF...SRPAS 33
|||||||||||||||||||||||||||| |||||
SwissPro MGIPMGKSMLVLLTFLAFASCCIAAYRPSETLCGGELVDTLQFVCGDRGFYF...SRPAS 57
|||||||||||||||||||||||||||||||||||||||||||||||||||| |||||
GenBank MGIPMGKSMLVLLTFLAFASCCIAAYRPSETLCGGELVDTLQFVCGDRGFYF...SRPAS 57
|||||||||||||||||||||||||||| |||.|
Profile AYRPSETLCGGELVDTLQFVCGDRGFYFRLPSRPSS 36
PDB secondary structure data: HHHHHHHHHHH
TTTTT TTTTT
GCG CF: TTttHHHHHHHHHHHttBBBBB TTt tTT BBBBBTTTTBBB...BBttt
GCG GOR: HHHHHHHHHHHHHHHH TTTTTTTT BBBBBBBBBTTTTTTT
GCG AI: xxx x xxxx xx
Amphi: AAAAAA
HelicalWheel: HYDROPHOBIC amphiphilic
. . . . . .
PDB RVSRRSR..................GIVEECCFRSCDLALLETYCATPAKSE 67
||||||| |||||||||||||||||||||||||||
SwissPr RVSRRSR..................GIVEECCFRSCDLALLETYCATPAKSERDVSTPPT 99
||||||| |||||||||||||||||||||||||||||||||||
GenBank RVSRRSRGIVEECCFRRKQHSSTMPGIVEECCFRSCDLALLETYCATPAKSERDVSTPPT 117
||.|||| |||||||||||||||||||||||||||
Profile RVNRRSR..................GIVEECCFRSCDLALLETYCATPAKSE 70
PDB secondary structure data: HHHHHHHH HHHHHH
TTTTT
GCG CF: ttttthhhhhhhhhhhhTT TThhhhhhhhtttHHHHHH hhhhhhhhhh tt
GCG GOR: TTTTTT HHHHHHHTTT TTTTTTHHHHHHHHHHHH TTTT TT
GCG AI: xxxxxx xxxxx x xxxx
Amphi: AAAAAA AAAAAAA
HelicalWheel: weakly amphiphilic
. . . . . .
SwissPr VLPDNFPRYPVGKFFQYDTWKQSTQRLRRGLPALLRARRGHVLAKELEAFREAKRHRPLI 159
||||||| .|:|||||||||||||||||||||||||||||||||||||||||||||||||
GenBank VLPDNFPEIPLGKFFQYDTWKQSTQRLRRGLPALLRARRGHVLAKELEAFREAKRHRPLI 177
GCG CF: TTt BBBBB tt ttthhhhhhhhh HHHHHHHHHHHHHHHHHbbb
GCG GOR: TTTTTTTTTT TTTTT HHHHHHHH HHHHHHHHHHHHH
GCG AI: xxxxxx xxxxx
Amphi: AAAAAAAAAA
HelicalWheel: amphiphilic
. .
SwissPr ALPTQDPAHGGAPPEMASNRK 180
|||||||||||||||||||||
GenBank ALPTQDPAHGGAPPEMASNRK 198
GCG CF: bb ttttttthhhhhhhtt
GCG GOR: HHHHHHH
GCG AI: xxx xx
X-ray data can be interpreted in many different ways. The structural assignments made by the author of the structure may not agree with assignments made via programs using the same coordinate data as input. Even the assignments made by computer software will vary. Actual x-ray data is a guide to, not a final configuration for, secondary structural elements of any given protein. The actual starting and ending points of these structural units are often subject to conjecture and may be somewhat subjective.
This exercise will acquaint you with various computer methods to estimate protein secondary structure. Some of these programs use experimentally determined data, others are based on statistical analysis or interpretation of crystallographic results.
As in the previous four exercises, run the all the analyses on your Selected Molecule. Use the examples given here with the human prion protein as a guide for running your own studies. Many of the following programs are GCG graphics routines so be sure to initialize both GCG and the graphics configuration before beginning the remainder of the session!
1) Physical Characteristics/Protein Mapping: PEPTIDEMAP, PEPTIDESORT, ISOELECTRIC
These three GCG programs enable you to generate protease digestion data, molecular weight and amino acid composition information, and HPLC retention and isoelectric point values. All results can be experimentally verified and often may assist in experimental design. You are welcome to run your selected molecule through these programs, although it is not required. They are very fast and easy to use, and may prove useful in your own labs.
2) Hydrophobicity and Amphiphilicity:
Hydrophobicity is a measure of how much a molecule hates water. Each amino acid can be attributed hydrophobicity values. This has been done by many researchers, hence the abundance of different hydrophobicity scales. In all hydrophobicity scales the more positive the number, the more hydrophobic the residue; the converse holds in hydrophilicity scales. Hydrophobic residues tend to lie buried in the interior of a protein while hydrophilic residues tend toward a surface. Correspondingly, in membrane-associated proteins, those residues in contact with the lipid bilayer tend toward strong hydrophobicity. The pattern of hydrophobic and philic residues in a protein can often reveal aspects of protein structure. The most common structures hypothesized in this manner are membrane-spanning helices. To search for this type of helix, window sizes of 19 to 21 should be used since about 20 amino acids are required to span the membrane.
These two homegrown programs plot Kyte-Doolittle and Goldman, et al. data, respectively. PK23 allows you to specify up to four different window sizes. Type pk23 to launch the first program. Supply appropriate parameters for your selected molecule:
% pk23
Process set to plot with VERSATERM-TEK4105 attached to term
using the tekd graphic interface.
Kyte - Doolittle plotting program
Please enter the filename.ext humprp.pep
Begin (* 1 *) ? <rtn>
End (* 231 *) ? <rtn>
Enter number of windows (1-4): 3
Average of hydrophilicity over how many acids (* 7 *) ? 5
Average of hydrophilicity over how many acids (* 7 *) ? 7
Average of hydrophilicity over how many acids (* 7 *) ? 9
values are -4.1600008 to 3.5600008
When your VERSATERM-TEK4105 attached to tty is ready, press <Return>. <rtn>

Type ges to launch the Goldman et al. hydrophobicity plot. Notice the way the program runs is quite different than PK23:
% ges
Process set to plot with VERSATERM-TEK4105 attached to term
using the tekd graphic interface.
This is the program GESPLOT
It will either create a file of GES values
or produce a plot of the results
Do you want just a file of values - 1,
or a plot of the results - 2,
or both a file and a plot - 3? : 2
Change the window from the default of 20? y=1 : 1
Enter new window size
Must be between 1 and 50
new value: 7
GESPLOT of what protein sequence ? humprp.pep
Begin (* 1 *) ? <rtn>
End (* 231 *) ? <rtn>
What density in residues per page (* 231 *) ? <rtn>
That will take 1 pages, is that alright (* Yes *) ? <rtn>
When your VERSATERM-TEK4105 attached to tty is ready, press <Return>. <rtn>

Take notes of any striking peaks and valleys or turn your plots into PostScript files and print them out. This information will be required in the report form.
The helical hydrophobic moment, as described by David Eisenberg, quantitatively shows how asymmetrically distributed residue hydrophobicities are, by using vector mathematics. This value, calculated with the appropriate window size, can often help you identify "amphiphilic" structures. These are alpha-helices or beta-sheets with one polar and one apolar face. This type of alpha-helix is often found in membrane channels with several helices clustered together, hydrophilic to the middle and hydrophobic to the membrane. Amphiphilic structures are also commonly found on the surface of proteins, often with their hydrophilic face exposed to the solvent and their hydrophobic face interacting with a lipid membrane. The GCG implementation of hydrophobic moment plotting, Moment, can be difficult to interpret. Run Moment on your sequence by typing moment if you want, but it is not required. Another homegrown program, MOM, was created to simplify this interpretation. Run MOM on your selected molecule by typing mom at the command line. The plot can get confusing when both alpha and beta curves are drawn on the same plot, so make two separate plots. Take notes of the sequence location of any particularly striking moment peaks or print out a PostScript plot of the figure.
% mom
This is the program MOMENTPLOT
It will create plot(s) of MOMENT values
MOMENTPLOT of what protein sequence ? humprp.pep
Begin (* 1 *) ? <rtn>
End (* 231 *) ? <rtn>
Enter the type of plot desired
helical moment = 1
beta moment = 2
both values = 3 1
max is 0.82 min is 0.01
max is 1.04 min is 0.03
When your TEK4107 attached to tty is ready, press <Return>. <rtn>

A fourth piece of homegrown software known as Amphi can be run to help confirm the location of possible surface helices. T-cell antigenicity has been found to correlate very highly with amphiphilic (amphipathic) alpha helices, especially those present after partial cleavage and/or unfolding. This program attempts to determine if potential amphipathic helices exist in a sequence through hydrophobic moment analysis. Its main function is to locate the possible T-cell antigenic sites that correlate with those amphipathic helices. Run Amphi on your molecule by typing amphi; generate the short form of output.
% amphi
Amphi
This program is for the predication of helper
T-cell antigenic sites that correlate with
amphipathic helices
Enter input filename: humprp.pep
Enter output filename: humprp.amphi
Enter desired block length [7 or 11]: 11
Begin (* 1 *) ? <rtn>
End (* 231 *) ? <rtn>
length is the sequence is 231
if you would like a detailed output
type 1 else type 0 0
Examine the results of this program on your protein to see if it correlates
with the MOM plot made earlier. The example prion output is below; wherever
the "as" value climbs above ten, the area is predicted as a good amphipathic
site:
Prediction of helper T cell antigenic sites for humprp.pep
predicted amphipathic segments
mid points angles as
of blocks
--------------------------------------
P 33- 41 85.-135. 16.6
P 43- 49 85.-135. 12.4
P 51- 57 85.-135. 12.4
P 59- 65 85.-135. 12.0
K P 74- 82 95.-130. 21.9
90- 94 80.-125. 9.8
P 98-116 80.-100. 39.1
123-125 85.- 90. 6.0
130-149 85.-115. 52.4
* 152-156 90.-105. 10.6
* 172-176 115.-135. 9.6
181-189 80.-110. 21.7
No of predicted blocks 105
3)Secondary Structure Prediction Programs:
3a) PEPPLOT analysis:
The first combination program to investigate is GCG's PepPlot. It can produce up to nine different graphical panels displaying various secondary structure and physical attributes. Sometimes running the program with only a few of the panel displays activated can be more effective than showing them all. Since this program can take advantage of many options, run it with the check option. The most important options to note are the hwindow and geswindow settings. These are set at different values by default; this will not yield nicely congruous hydrophobicity plots. Therefore, decide on an appropriate window size for your circumstance and assign it to both window sizes. Notice that the resultant plot has many striking peaks and valleys -- these can indicate several structural features, especially when they all agree with one another. However, do not blindly accept the predictions; take what is offered with a "grain of salt."
A sample PepPlot session and its resultant graphic using the human prion protein is illustrated next. Use your Selected Molecule with the program PepPlot as illustrated here.
% pepplot -check
PepPlot plots measures of protein secondary structure and
hydrophobicity in parallel panels of the same plot.
Minimal Syntax: % pepplot PIR:Kihua -Default
Prompted Parameters:
-BEGin=1-END=100 the range of interest
-DENsity=87 density in residues per 100 platen units
-MENu=a sequence display
b charged-polar-hydrophobic residue cartoon
c beta forming-breaking symbols
d Chou-Fasman alpha-beta prediction curves
e alpha forming-breaking symbols
f Chou-Fasman NH2-Ends prediction curves
g Chou-Fasman CO2-Ends prediction curves
h Chou-Fasman turn prediction curve
i helical hydrophobic moment for alpha and beta
j hydropathy and hydrophilicity
Local Data Files:
-DATa1=pepplot.dat amino acid attributes except for Garnier
Press q to quit or <Return> for more:
-DATa2=garnier.dat amino acid attributes for Garnier
-DATa3=ges.dat hydrophobicities for the GES curve
Optional Parameters:
-CFFile[=kihua.cho] writes out the Chou and Fasman predictions
-GARnierfile[=kihua.gar] writes out the Garnier predictions
-MOMentfile[=kihua.mom] writes out the Hydrophobic moment values
-NOPLOt suppresses the whole plot
-HWINdow=9 sets the window for hydropathy averaging
-NOGES suppresses the GES curve (default)
-GESWindow=20 sets the window for GES scale averaging
-SHOwseq insists on showing the sequence in panel 1
-BOXES draws a box around each quantitative panel
-NOTITle suppresses the plot's title
All GCG graphics programs accept these and other switches. See the Using
Graphics chapter of the USERS GUIDE for descriptions.
-FIGure[=FileName] stores plot in a file for later input to FIGURE
-FONT=3 draws all text on the plot using font 3
-COLor=1 draws entire plot with pen in stall 1
-SCAle=1.2 enlarges the plot by 20 percent (zoom in)
Press q to quit or <Return> for more: <rtn>
-XPAN=10.0 moves plot to the right 10 platen units (pan right)
-YPAN=10.0 moves plot up 10 platen units (pan up)
-PORtrait rotates plot 90 degrees
Add what to the command line ? -hwindow=7 -geswindow=7
Process set to plot with VERSATERM-TEK4105 attached to term
using the tekd graphic interface.
PEPPLOT of what protein sequence ? humprp.pep
Begin (* 1 *) ? <rtn>
End (* 231 *) ? <rtn>
The minimum density for a one-page plot is 200.9 residues/100 platen units.
What density do you want (* 200.9 *) ? <rtn>
What Panels do you want to plot?
a) Sequence
b) Charged-polar-hydrophobic residue schematic
c) Beta forming-breaking symbols
d) Chou-Fasman Alpha-Beta prediction curves
e) Alpha forming-breaking symbols
f) Chou-Fasman NH2-end prediction curves
g) Chou-Fasman CO2-end prediction curves
h) Chou-Fasman Turn prediction curve
i) Helical Hydrophobic Moment for Alpha and Beta
j) Hydropathy and Hydrophilicity
Please choose one or more (* ABCDEFGHIJ *): <rtn>
You
may want to only choose a few rather than all for your own research situation.
I certainly recommend this for publication purposes, especially if you create a
GCG Figure file and edit that file to customize the presentation.
When your VERSATERM-TEK4105 attached to tty is ready, press <Return>. <rtn>

3b) PEPTIDESTRUCTURE/PLOTSTRUCTURE analysis:
PeptideStructure makes secondary structure predictions, including B-cell antigenicity, flexibility, and surface probability, as well as a hydrophilicity determination; PlotStructure graphically displays these predictions. PeptideStructure must be run first. The program is optimized for soluble, globular proteins; therefore, the window size should be changed for anything other than these type of proteins. Use your Selected Molecule primary sequence file in the program PeptideStructure to determine how GCG's other secondary structure program estimates its secondary structure. Use the prion example given below as a guide to running the program. The broadening option can be helpful; accept the default Kyte-Doolittle hydrophilicity scale calculation.
% peptidestructure -check
PeptideStructure makes secondary structure predictions for a peptide
sequence. The predictions include (in addition to alpha, beta, coil, and
turn) measures for antigenicity, flexibility, hydrophobicity, and surface
probability. PlotStructure displays the predictions graphically.
Minimal Syntax: % peptidestructure [-INfile=]PIR:Mmecf -Default
Prompted Parameters:
-BEGin=1 first base of sequence
-END=362 last base of sequence
-MENu=k K)yte-Doolittle or H)opp-Woods menu
[-OUTfile=]mmecf.p2s output file name
Local Data Files: None
Optional Parameters:
-HWINdow=7 sets the window for the hydrophilicity calculation
-BROAdening broadens peaks of antigenic index outside of strong helices
Add what to the command line ? -broad
The broadening option allows for easier visualization of antigenic peaks by smoothing them out somewhat. Changing the hwindow parameter would be appropriate for anything other than a soluble protein.
PEPTIDESTRUCTURE for what peptide sequence ? humprp.pep
Begin (* 1 *) ? <rtn>
End (* 231 *) ? <rtn>
Calculate hydrophilicity according to
H)opp-Woods or
K)yte-Doolittle
Please choose one (* K *) : <rtn>
What should I call the output file (* humprp.p2s *) ? humprp_kyte.p2s
The
output file from this run, humprp_kyte.p2s, contains structural and antigenic
index predictions. Data is displayed in this file in columns under given
headings. The Chou-Fasman predictions are under the CF-Pred heading, the
Garnier-Robson are under the GORPred heading and the antigenic index
information under the AI-Ind heading. Chou-Fasman uses both upper and lower
case letters in their prediction scheme. Upper case letters denote strongly
predicted structures while lower case indicate weakly predicted ones.
Garnier-Robson uses only upper case letters in their prediction scheme, not
bothering with weak predictions. Use lpr to print off your .p2s
output file to study.Consider an antigenic index value of 1.00 or greater to be a probable antigenic site. This antigenic index, as opposed to AMPHI's, is based on the amount of predicted surface exposure, flexibility of a portion of the molecule, hydrophilicity values, and secondary predictions all combined, rather than just the predicted existence of amphipathic helices. As such, it attempts to predict all major immunogenic determining sites, especially those associated with B-cell humoral response epitopes, not T-cell.
PlotStructure can now be used to display the data in the .p2s file graphically. Answer the PlotStructure input file name prompt with the name of your .p2s file. The program will ask for the sequence's beginning and ending; accept the defaults. Next, you will be asked if you want a panel or a squiggly graph. For the first pass, accept the default squiggly plot. Next you will be asked whether to plot with Chou-Fasman or Garnier-Robson predictions. Both are based on soluble, globular proteins so it doesn't really matter, but let's go with Garnier-Robson this round; sometimes it's more accurate than Chou-Fasman. For this first round let's superimpose hydrophilicity by choosing "H." The program will ask you if you are ready; press <return>. A squiggle plot of your protein will be drawn. Refer to the GCG manual for interpreting the various types of lines that describe the secondary elements of the protein. Next, rerun PlotStructure only choose the option to generate a panel graph. This graph is much more informative. Finally, repeat the panel graph choice, however, specify -figure=(your last name).p2splot on the command line to create a figure file of your work.
A screen trace and the resultant plot from a PlotStructure session with the prion protein follows:
% plotstructure
PlotStructure plots the measures of protein secondary structure in
the output file from PeptideStructure. The measures can be shown on
parallel panels of a graph or with a two-dimensional "squiggly"
representation.
Process set to plot with VERSATERM-TEK4105 attached to term:
using the tekd graphic interface.
PLOTSTRUCTURE of what PEPTIDESTRUCTURE output file ? humprp_kyte.p2s
humprp_kyte.p2s is PEPTIDESTRUCTURE of: humprp.pep check: 414 from: 1 to: 231
calculated on: February 18, 1996 16:51
Plot Begin (* 1 *) <rtn>
Plot End (* 231 *) <rtn>
Do you want a
1)-dimensional (panel graph) or a
2)-dimensional (squiggly) plot
Please choose one (* 2 *) : 1
When your VERSATERM-TEK4105 attached to tty is ready, press <Return>. <rtn>

4) Secondary structural information in PDB files.
All of the Selected Molecules for this class have had either their structures determined or close homologs' structures determined; all have PDB access codes. You should know this access code due to earlier searching efforts. In those efforts you should have explored visualizing their structure with Entrez and/or RasMol. Since the NRL_3D database corresponds to all of the sequences from Brookhaven's PDB, and also contains secondary structure annotation, we can use its entry as an easy way to see the secondary assignments found in the PDB file. NRL_3D uses a sequence naming convention based on the corresponding PDB entry. For instance, the NRL_3D entry that contains the sequence of PDB entry 1GF2, the insulin-like growth factor II used as an example for the comparison file shown previously, is 1GF2. In cases where the PDB file contains multiple chains, NRL_3D adds numbers or letters to the end of the name to differentiate chains. You should know the variation of the PDB code that NRL_3D uses for your molecule.
Use the command line typedata -reference on your entry to read the author's secondary structure assignment. Compare the secondary information in it to what you discovered in the above predictive analyses. Take notes of your findings. A few warnings about PDB data need to be heeded though. First of all, PDB data is always on the mature protein whereas standard sequence database entries are usually precursor molecules. This will yield a numbering discrepancy between PDB/NRL and PIR/SwissProt. A good way to cope with this is to run a quick Gap alignment of the two seqeunces. Secondly, PDB data entries are often multiple chains, especially when the molecule exists in nature as a complex. This is apparent in the RuBisCO Selected Molecule case. Both issues can create confusion.
5) Run HELICALWHEEL to verify the occurence of amphiphilic helices.
After all other analyses have finished, HelicalWheel can be used on those areas of the molecule which are, or show, potential of being an amphiphilic helix or sheet. Examining the results of helical wheel analysis on those areas of a predicted or known secondary element can verify whether any asymmetrical ordering of the hydrophobicity pattern of the residues within that element is present and can often give information on potential packing patterns. It is by far the easiest way to visualize this phenomenon.
Determine where the actual helical regions of your protein are by using the PDB secondary assignments found in the above section as your source of information and record them in your notes. Be sure to take the numbering discrepancy between the two database entries into consideration. Repeat with the sheet specifications if you would like.
With the locations of the alpha helical regions specified, run HelicalWheel on each of them and note any ordering of the helical surfaces shown, i.e. the clustering of polar or nonpolar amino acids on one side of the wheel. Use the example run given below as a guide for the program. Record your findings for the various helices tested or make PostScript plots of the results . If you want to test your sheet regions, repeat the analyses with the beta option.
% helicalwheel
HelicalWheel plots a peptide sequence as a helical wheel to help you
recognize amphiphilic regions.
Process set to plot with VERSATERM-TEK4105 attached to term:
using the tekd graphic interface.
HELICALWHEEL of what protein sequence ? humprp.pep
Begin (* 1 *) ? 176
End (* 231 *) ? 194
When your VERSATERM-TEK4105 attached to tty is ready, press <Return>. <rtn>
Rerun
HelicalWheel of the very best, in your opinion, amphiphilic segment found in
your Selected Molecule; however, add the option -figure=(your last
name).wheel to the command line in order to create a Figure output file.
The seemingly best amphiphilic helix on the prion protein is shown next:

6) Using MSU to access Internet secondary structure predictions.
PredictProtein is an electronic mail service by the Protein Design Group at the European Molecular Biology Laboratory, Heidelberg, Germany. A multiple sequence alignment is performed by a weighted dynamic programming method (MaxHom, R.Schneider) and a secondary structure prediction is produced by the profile network method (PHD). The prediction is made by a new method rated at an expected 70.2% average accuracy for the three states helix, strand, and loop (Rost and Sander, PNAS).
NNPredict is a service of the San Francisco campus of the University of California which uses neural net technology to predict protein secondary structure. The basis of the prediction is a two-layer, feed-forward neural network. By adding neural network units that detect periodicities in the input sequence, they have modestly increased the secondary structure prediction accuracy. The use of predetermined tertiary structural classification causes a marked increase in accuracy. The best case prediction was 79% for the class of all-alpha proteins.
% msu **************** MSU (Mail Server Utility) ********************* Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library Sequence loaded: None Options: L - Load sequence H - Retrieve HELP file from server R - Register with service O - Other options Q - Quit (exit program) Enter L, H, R, O, or Q to quit: h Retrieve HELP file on service: (press RETURN for next page) 1 - EMBL BLITZ server 2 - BIOCCELERATOR 3 - FLASH 4 - NCBI BLAST 5 - EMBL FASTA server 6 - NBRF/PIR FASTA server 7 - EMBL QuickSearch server 8 - CBRG (ETH Zuerich) 9 - BLOCKS server 10 - MotifFinder Options: M - Back to Main Menu Enter 1-26 or M to go back to Main Menu: <rtn> Retrieve HELP file on service: (press RETURN for next page) 11 - ProteinPredict 12 - nnpredict 13 - NetGene 14 - Grail 15 - GeneID (less than 20kb) 16 - GenMark 17 - PYTHIA (Rpts) 18 - PYTHIA (Alu) 19 - GenomeNet BLAST 20 - GenomeNet FASTA Options: M - Back to Main Menu Enter 1-26 or M to go back to Main Menu: 11 Request mailed to PredictProtein@embl-heidelberg.de at Wed Sep 28 11: 14:46 1994 The reply should soon arrive in your mailbox PRESS <RETURN> TO CONTINUE... <rtn> ////////////////////////////////////////////////////////////////////////////// Enter 1-26 or M to go back to Main Menu: 12 Request mailed to nnpredict@celeste.ucsf.edu at Wed Sep 28 11:21:43 1 994 The reply should soon arrive in your mailbox ////////////////////////////////////////////////////////////////////////////// Options: M - Back to Main Menu Enter 1-26 or M to go back to Main Menu: m **************** MSU (Mail Server Utility) ********************* Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library Sequence loaded: None Options: L - Load sequence H - Retrieve HELP file from server R - Register with service O - Other options Q - Quit (exit program) Enter L, H, R, O, or Q to quit: l Enter file name: humprp.pir **************** MSU (Mail Server Utility) ********************* Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library Sequence loaded: humprp.pir Protein, 231 residues, 1-KKRPK...FLIVG-231 Services: (press RETURN for next page) 1 - EMBL BLITZ server 2 - BIOCCELERATOR 3 - FLASH 4 - NCBI BLAST 5 - EMBL FASTA server 6 - NBRF/PIR FASTA server 7 - CBRG (ETH Zuerich) 8 - BLOCKS server 9 - MotifFinder 10 - ProteinPredict Options: L - Load sequence S - Set sequence limits (1 - 231) H - Retrieve HELP file from server R - Register with service O - Other options Q - Quit (exit program) Enter 1-15, L, S, H, R, O, or Q to quit: <rtn> **************** MSU (Mail Server Utility) ********************* Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library Sequence loaded: humprp.pep Protein, 231 residues, 1-KKRPK...FLIVG-231 Services: (press RETURN for next page) 11 - nnpredict 12 - GenomeNet BLAST 13 - GenomeNet FASTA 14 - GenQuest (Q) 15 - ProDom Options: L - Load sequence S - Set sequence limits (1 - 231) H - Retrieve HELP file from server R - Register with service O - Other options Q - Quit (exit program) Enter 1-15, L, S, H, R, O, or Q to quit: 10 Service ProteinPredict Neural network secondary protein structure prediction Your name (or ?): Steve Thompson Your address (part1) (or ?): WSU VADMS Center Your address (part2) (or ?): Pullman WA 99164-4660 Your email address (or ?): thompson@ribozyme.vadms.wsu.edu Enter sequence description (or ?): human prion protein Request mailed to PredictProtein@embl-heidelberg.de at Wed Sep 28 11: 42:36 1994 The reply should soon arrive in your mailbox PRESS <RETURN> TO CONTINUE... <rtn> **************** MSU (Mail Server Utility) ********************* Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library Sequence loaded: humprp.pep Protein, 231 residues, 1-KKRPK...FLIVG-231 ////////////////////////////////////////////////////////////////////////////// Enter 1-15, L, S, H, R, O, or Q to quit: 11 Service nnpredict Neural network secondary protein structure prediction Select prediction options: 1 - n 2 - a 3 - b 4 - a/b 1-4 or ? [1]: <rtn>Accept option 1 to specify No experimental evidence for the structural class; all alpha, all beta, or alpha/beta; that your protein belongs in. Naturally, if you did have evidence that your protein was in one of these classes, you would specify the class and the program would provide more reliable estimates of protein secondary structure placement.
Request mailed to nnpredict@celeste.ucsf.edu at Wed Sep 28 14:41:24 1 994 The reply should soon arrive in your mailbox PRESS <RETURN> TO CONTINUE... <rtn> ////////////////////////////////////////////////////////////////////////////// Options: L - Load sequence S - Set sequence limits (1 - 231) H - Retrieve HELP file from server R - Register with service O - Other options Q - Quit (exit program) Enter 1-15, L, S, H, R, O, or Q to quit: qYour prompt will be returned and you can read the new HELP files in your mail to learn more about the services that you just used. Export the HELP file mail messages and print them out for your records. When the predictions have been returned to you; sometimes it takes overnight and PredictProtein should send two messages, one just a confirmation of the job being submitted and one the actual results; Export them, print them out, and carefully read the predictions in order to understand them. I know, at least in the PredictProtein output, getting your answer is a bit involved. The output is very long and the most relevant part isn't until near the end. Read through it all and it should make sense by the time you hit the most probable (they call these "subset" predictions) secondary structure estimate near the end. The NNPredict server's answer is more straightforward but it has a small complication -- there is no sequence numbering on it. My drastically abridged results from both servers follow. Notice that PredictProtein assembles a GCG style MSF alignment by default:
Date: Mon, 19 Feb 1996 09:28:17 +0100
To: Thompson@ribozyme.vadms.wsu.edu
From: <phd@stork.EMBL-Heidelberg.DE>
Message-Id: <199602190325.DAA04959@phenix.embl-heidelberg.de>
Subject: Predict-Protein
The following information has been received by the server:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# human prion protein
kkrpkpggwntggsrypgqgspggnryppqggggwgqphgggwgqphgggwgqphgggwg
qphgggwgqgggthsqwnkpskpktnmkhmagaaaagavvgglggymlgsamsrpiihfg
sdyedryyrenmhrypnqvyyrpmdeysnqnnfvhdcvnitikqhtvttttkgenftetd
vkmmervveqmcitqyeresqayyqrgssmvlfssppvillisfliflivg
///////////////////////////////////////////////////////////////////////////
--- MAXHOM ALIGNMENT: IN MSF FORMAT
MSF of: /home/phd/server/work/predict_e17121_23259.hssp from: 1 to: 231
/home/phd/server/work/predict_e17121_23259.ret_msf MSF: 231 Type: P
19-Feb-96 04:23:0 Check: 3969 ..
Name: predict_e171 Len: 231 Check: 414 Weight: 1.00
Name: prio_human Len: 231 Check: 414 Weight: 1.00
Name: prio_gorgo Len: 231 Check: 798 Weight: 1.00
Name: prio_pantr Len: 231 Check: 973 Weight: 1.00
//////////////////////////////////////////////////////////////////////////
Name: ch18_drome Len: 231 Check: 1221 Weight: 1.00
Name: roa1_drome Len: 231 Check: 3100 Weight: 1.00
Name: rin1_human Len: 231 Check: 8024 Weight: 1.00
Name: rhle_ecoli Len: 231 Check: 3590 Weight: 1.00
Name: k2c2_xenla Len: 231 Check: 9130 Weight: 1.00
Name: prpc_human Len: 231 Check: 9322 Weight: 1.00
//
1 50
predict_e171 KKRPKPGGWN TGGSRYPGQG SPGGNRYPPQ GGGGWGQPHG GGWGQPHGGG
prio_human KKRPKPGGWN TGGSRYPGQG SPGGNRYPPQ GGGGWGQPHG GGWGQPHGGG
prio_gorgo KKRPKPGGWN TGGSRYPGQG SPGGNRYPPQ GGGGWGQPHG GGWGQPHGGG
prio_pantr KKRPKPGGWN TGGSRYPGQG SPGGNRYPPQ GGGGWGQPHG GGWGQPHGGG
prio_ponpy KKRPKPGGWN TGGSRYPGQG SPGGNRYPPQ GGGGWGQPHG GGWGQPHGGG
prio_colgu KKRPKPGGWN TGGSRYPGQG SPGGNRYPPQ GGGGWGQPHG GGWGQPHGGG
//////////////////////////////////////////////////////////////////////////
ydh3_hsvsc .........S PGGPGGPGGP GGPGGPGGPG GPGGPGGPCG PGGPCGPGGP
k2c1_human ...RSGGGFS SGSAGitRRS GGGGGRFSSC GGGGGSFGAG GGFGSrrGGG
sala_drome ........IA QNGFGQVGQG GYGGQ..... ..GGFGGFGG IGGQAGFGGQ
ch18_drome RPRGGYGGAP VGGYAYQVQP ALTVKAIVPS YGGGYGGNHG GYGGasVPVP
roa1_drome KALPKQNDQQ GGGGGRGGPG GRAGGNRGNM GGGNYGNQNG GGNWNNGGNN
rin1_human .RRPIPGSDq sGGEGEPGEG EGDGEDVSSD SAPDSAP..G PAPKRPRGGG
rhle_ecoli .......... .......... .......... .....AEPIQ NGRQQRGGGG
k2c2_xenla .....GGGGG MGGGMGGGMG MGGGMGMGGG MGMGGGMGMG GGMGGGMGMG
prpc_human DDGPQQGPPQ QGGQQQQGPP PPQGKPqpPQ QGGHPPPPQG RPQGPPQQGG
//////////////////////////////////////////////////////////////////////////
201 231
predict_e171 QAYYQRGSSM VLFSSPPVIL LISFLIFLIV G
prio_human QAYYQRGSSM VLFSSPPVIL LISFLIFLIV G
prio_gorgo QAYYQRGSSM VLFSSPPVIL LISFLIFLIV G
prio_pantr QAYYQRGSSM VLFSSPPVIL LISFLIFLIV G
prio_ponpy QAYYQRGSSM VLFSSPPVIL LISFLIFLIV G
prio_colgu QAYYQRGSSM VLFSSPPVIL LISFLIFLIV G
prio_atege QAYYQRGSSM VLFSSPPVIL LISFLI.... .
prio_prefr QAYYQRGSSM VFFSSPPVIL LISFLIFLIV G
//////////////////////////////////////////////////////////////////////////
k2c1_human E......... .......... .......... .
sala_drome .......... .......... .......... .
ch18_drome .......... .......... .......... .
roa1_drome .......... .......... .......... .
rin1_human .......... .......... .......... .
rhle_ecoli .......... .......... .......... .
k2c2_xenla .......... .......... .......... .
prpc_human .......... .......... .......... .
//////////////////////////////////////////////////////////////////////////
About the network method
~~~~~~~~~~~~~~~~~~~~~~~
The network procedure is described in detail in:
1) Rost, Burkhard; Sander, Chris:
Prediction of protein structure at better than 70% accuracy.
J. Mol. Biol., 1993, 232, 584-599.
A brief description is given in:
Rost, Burkhard; Sander, Chris:
Improved prediction of protein secondary structure by use of se-
quence profiles and neural networks.
Proc. Natl. Acad. Sci. U.S.A., 1993, 90, 7558-7562.
The PHD mail server is described in:
2) Rost, Burkhard; Sander, Chris; Schneider, Reinhard:
PHD - an automatic mail server for protein secondary structure
prediction.
CABIOS, 1994, 10, 53-60.
The latest improvement steps (up to 72%) are explained in:
3) Rost, Burkhard; Sander, Chris:
Combining evolutionary information and neural networks to predict
protein secondary structure.
Proteins, 1994, 19, 55-72.
To be quoted for publications of PHD output:
Papers 1-3 for the prediction of secondary structure and the pre-
diction server.
About the input to the network
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The prediction is performed by a system of neural networks.
The input is a multiple sequence alignment. It is taken from an HSSP
file (produced by the program MaxHom:
Sander, Chris & Schneider, Reinhard: Database of Homology-Derived
Structures and the Structural Meaning of Sequence Alignment.
Proteins, 1991, 9, 56-68.
For optimal results the alignment should contain sequences with varying
degrees of sequence similarity relative to the input protein.
The following is an ideal situation:
+-----------------+----------------------+
| sequence: | sequence identity |
+-----------------+----------------------+
| target sequence | 100 % |
| aligned seq. 1 | 90 % |
| aligned seq. 2 | 80 % |
| ... | ... |
| aligned seq. 7 | 30 % |
+-----------------+----------------------+
//////////////////////////////////////////////////////////////////////////
The resulting network (PHD) prediction is:
PredictProtein@EMBL-Heidelberg.DE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PHD: Profile fed neural network systems from HeiDelberg
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Prediction of:
- secondary structure, by PHDsec
- solvent accessibility, by PHDacc
- and helical transmembrane regions, by PHDhtm
Author: Burkhard Rost
EMBL, Heidelberg, FRG
Meyerhofstrasse 1, 69 117 Heidelberg
Internet: Predict-Help@EMBL-Heidelberg.DE
All rights reserved.
The network systems are described in:
PHDsec: B Rost & C Sander: JMB, 1993, 232, 584-599.
B Rost & C Sander: Proteins, 1994, 19, 55-72.
PHDacc: B Rost & C Sander: Proteins, 1994, 20, 216-226.
PHDhtm: B Rost, R Casadio, P Fariselli & C Sander,
Prot. Science, 4, 521-533.
Some statistics
~~~~~~~~~~~~~~
Percentage of amino acids:
+--------------+--------+--------+--------+--------+--------+
| AA: | G | P | Q | S | Y |
| % of AA: | 18.6 | 7.4 | 6.5 | 6.1 | 5.6 |
+--------------+--------+--------+--------+--------+--------+
| AA: | V | T | R | N | M |
| % of AA: | 5.2 | 5.2 | 4.8 | 4.8 | 4.3 |
+--------------+--------+--------+--------+--------+--------+
| AA: | K | H | I | E | A |
| % of AA: | 4.3 | 4.3 | 3.9 | 3.9 | 3.5 |
+--------------+--------+--------+--------+--------+--------+
| AA: | W | L | F | D | C |
| % of AA: | 3.0 | 3.0 | 2.6 | 2.2 | 0.9 |
+--------------+--------+--------+--------+--------+--------+
Percentage of secondary structure predicted:
+--------------+--------+--------+--------+
| SecStr: | H | E | L |
| % Predicted: | 11.7 | 15.6 | 72.7 |
+--------------+--------+--------+--------+
According to the following classes:
all-alpha: %H>45 and %E< 5; all-beta : %H<5 and %E>45
alpha-beta : %H>30 and %E>20; mixed: rest,
this means that the predicted class is: mixed class
PHD output for your protein
~~~~~~~~~~~~~~~~~~~~~~~~~~
//////////////////////////////////////////////////////////////////////////
Abbreviations: PHDsec
~~~~~~~~~~~~~~~~~~~~
sequence:
AA : amino acid sequence
secondary structure:
HEL: H=helix, E=extended (sheet), blank=other (loop)
PHD: Profile network prediction HeiDelberg
Rel: Reliability index of prediction (0-9)
detail:
prH: 'probability' for assigning helix
prE: 'probability' for assigning strand
prL: 'probability' for assigning loop
note: the 'probabilites' are scaled to the interval 0-9, e.g.,
prH=5 means, that the first output node is 0.5-0.6
subset:
SUB: a subset of the prediction, for all residues with an expected
average accuracy > 82% (tables in header)
note: for this subset the following symbols are used:
L: is loop (for which above " " is used)
".": means that no prediction is made for this residue, as the
reliability is: Rel < 5
Abbreviations: PHDacc
~~~~~~~~~~~~~~~~~~~~
solvent accessibility:
3st: relative solvent accessibility (acc) in 3 states:
b = 0-9%, i = 9-36%, e = 36-100%.
PHD: Profile network prediction HeiDelberg
Rel: Reliability index of prediction (0-9)
P_3: predicted relative accessibility in 3 states
note: for convenience a blank is used intermediate (i).
10st:relative accessibility in 10 states:
= n corresponds to a relative acc. of n*n %
subset:
SUB: a subset of the prediction, for all residues with an expected
average correlation > 0.69 (tables in header)
note: for this subset the following symbols are used:
"I": is intermediate (for which above " " is used)
".": means that no prediction is made for this residue, as the
reliability is: Rel < 4
Abbreviations: PHDhtm
~~~~~~~~~~~~~~~~~~~~
secondary structure:
HL: T=helical transmembrane region, blank=other (loop)
PHD: Profile network prediction HeiDelberg
PHDF:filtered prediction, i.e., too long transmembrane segments
are split, too short ones are deleted
Rel: Reliability index of prediction (0-9)
detail:
prH: 'probability' for assigning helical transmembrane region
prL: 'probability' for assigning loop
note: the 'probabilites' are scaled to the interval 0-9, e.g.,
prH=5 means, that the first output node is 0.5-0.6
subset:
SUB: a subset of the prediction, for all residues with an expected
average accuracy > 82% (tables in header)
note: for this subset the following symbols are used:
L: is loop (for which above " " is used)
".": means that no prediction is made for this residue, as the
reliability is: Rel < 5
protein: predict length 231
....,....1....,....2....,....3....,....4....,....5....,....6
AA |KKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWG|
PHD sec | |
Rel sec |999998998778877788877777888876787654446776544457876444578655|
detail:
prH sec |000000001111111110111111110111111123332112233221112223211122|
prE sec |000000000000000000000000000000000000000000000000000000000000|
prL sec |999998988788887888887888888887887766667787666678887666788776|
subset: SUB sec |LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL...LLLLL...LLLLL...LLLLLL|
ACCESSIBILITY
3st: P_3 acc |eeeeeeebeeebbee eeeb eebee eeeebbb b eeb b b e bbb b eebbb b|
10st: PHD acc |997877906760076469705770775677900050576050505750005057600050|
Rel acc |995455311301031002411321331134300002131200021300000113001001|
subset: SUB acc |eeeeee............e..........e..............................|
....,....7....,....8....,....9....,....10...,....11...,....12
AA |QPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFG|
PHD sec | E |
Rel sec |447887655565544687999997776665643347865455432213332488411114|
detail:
prH sec |331111222222232211000001111212123321111122224442123210001100|
prE sec |000000000000000000000000000000000000012200010112211001244442|
prL sec |667887766777766788999988787776765568876666655334555688654446|
subset: SUB sec |..LLLLLLLLLLL..LLLLLLLLLLLLLLLL....LLLL.LL..........LL......|
ACCESSIBILITY
3st: P_3 acc | eebee bebbbe eeeeeeeeeeeebeebbbbbbbbb bbbbbb bbbebbbebbbebb|
10st: PHD acc |576067507000657777778777770760000000005000000400060006000600|
Rel acc |130102123110113324444544331412233012321232025101300011111132|
subset: SUB acc |.................eeeeeee...e................b...............|
....,....13...,....14...,....15...,....16...,....17...,....18
AA |SDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETD|
PHD sec | EEEE EEE EEEEEEEEEEEEE HHH|
Rel sec |775553235641247895365237655689864521699999995763526987655245|
detail:
prH sec |012221121113200000000000110100000000000000000000000000001567|
prE sec |100012312122321002576530012000016754788998886776641011122100|
prL sec |776765456754468887312458766788872235100000112223257887776322|
subset: SUB sec |LLLLL...LL....LLLL.EE..LLLLLLLLL.E..EEEEEEEEEEE.E.LLLLLLL..H|
ACCESSIBILITY
3st: P_3 acc |ee eeeb eebbeebbb bbb bbee ee e bbbbbbbbbbeebbbebbbeeeebbebb|
10st: PHD acc |663676046700760005000400764665740000000000660006002676700600|
Rel acc |110121012301320000424100310100206624973859211152100231411112|
subset: SUB acc |..................b.b...........bb.bbb.bbb....b.......e.....|
....,....19...,....20...,....21...,....22...,....23...,....24
AA |VKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG|
PHD sec |HHHHHHHHHHHHHHHHHHHHHHHH EEEEE EEEEE EEEEE |
Rel sec |989999998853457732568653399479998399668875111125529|
detail:
prH sec |989999998875667765678765200000000000000012344432100|
prE sec |000000000012221000011122200378998300178876101456650|
prL sec |000000000001011134210001599610001699810000443101139|
subset: SUB sec |HHHHHHHHHHH..HHH..HHHHH..LL.EEEEE.LLLEEEEE.....EE.L|
ACCESSIBILITY
3st: P_3 acc |bebbe bbebbbbbebeeebeb eeeeebbbbbee ebbbbbbbbbbbbbe|
10st: PHD acc |060074007000006077707056779700000965700000000000009|
Rel acc |325630995398302047505511446312335401369999679887525|
subset: SUB acc |..bb..bbe.bb....eee.eb..eee.....be...bbbbbbbbbbbb.e|
PHDhtm Helical transmembrane prediction
note: PHDacc and PHDsec are reliable for water-
soluble globular proteins, only. Thus,
please take the predictions above with
particular caution wherever transmembrane
helices are predicted by PHDhtm!
PHDhtm
....,....1....,....2....,....3....,....4....,....5....,....6
AA |KKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWG|
PHD htm | |
Rel htm |999999999999999999999999999999999999999999999999999999999999|
detail:
prH htm |000000000000000000000000000000000000000000000000000000000000|
prL htm |999999999999999999999999999999999999999999999999999999999999|
subset: SUB htm |LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL|
....,....7....,....8....,....9....,....10...,....11...,....12
AA |QPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFG|
PHD htm | |
Rel htm |999999999999999999999999999999999999999999999999999999999999|
detail:
prH htm |000000000000000000000000000000000000000000000000000000000000|
prL htm |999999999999999999999999999999999999999999999999999999999999|
subset: SUB htm |LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL|
....,....13...,....14...,....15...,....16...,....17...,....18
AA |SDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETD|
PHD htm | |
Rel htm |999999999999999999999999999999999999999999999999999999999999|
detail:
prH htm |000000000000000000000000000000000000000000000000000000000000|
prL htm |999999999999999999999999999999999999999999999999999999999999|
subset: SUB htm |LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL|
....,....19...,....20...,....21...,....22...,....23...,....24
AA |VKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPVILLISFLIFLIVG|
PHD htm | TTTTTTTTTTTTTTTTTTTT|
Rel htm |999999999999999999999999999986223467888888888887764|
detail:
prH htm |000000000000000000000000000001366788999999999998887|
prL htm |999999999999999999999999999998633211000000000001112|
subset: SUB htm |LLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....HHHHHHHHHHHHHHHH.|
Prediction of transmembrane regions (PHDhtm)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Note: The accuracy of predicting helical trans-membrane regions is
some 95%. In a test on 69 proteins only one was not predicted to be
a trans-membrane protein (2mlt). PHDsec for the prediction of glo-
bular proteins predicted this protein more accuractely, than PHDhtm
for trans-membrane proteins. Vice versa, about 5% out of 300 globu-
lar proteins were missclassified as trans-membrane molecules. These
results have two practical consequences:
(i) if you know that your sequence is partly in a membrane and
PHDhtm does not predict a clear membrane region:
-> try PHDsec, it may be more accurate although in general not
suited for membrane proteins.
(ii) if you assume your sequence is not at all in a membrane and
PHDhtm does predict a membrane segment:
-> ignore the trans-membrane prediction.
For residues predicted to be outside of the lipid bilayer (predicted
as loop, PHDsec should give reasonably accurate results, provided the
regions sticking out of the membrane or long enough.
//////////////////////////////////////////////////////////////////////////
From nnpredict@celeste.ucsf.eduMon Feb 19 10:58:42 1996
Date: Sun, 18 Feb 1996 18:35:02 -0800 (PST)
From: the nnpredict server <nnpredict@celeste.ucsf.edu>
To: thompson@ribozyme.vadms.wsu.edu
Cc: nnpredict-request@celeste.ucsf.edu
Subject: Reply to your nnpredict query
Sequence:
>HUMPRP
KKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGG
THSQWNKPSKPKTNMKHMAGAAAAGAVVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPM
DEYSNQNNFVHDCVNITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSP
PVILLISFLIFLIV
Secondary structure prediction for HUMPRP
(option n):
------------------------------------------------------------------------
--------------HHHHHHHHHHHHEE------EE------EEEE--------HHH---------------
---------------EEE--E-E-E--------HHHHHHHHHHH-HH------H--HHHHE----EEEE---
--EEEEHHHHEEE-
(H = helix, E = strand, - = no prediction)
nnpredict was written by Donald Kneller.
Copyright (C) 1991 Regents of the University of California.
Try the NEW nnpredict Web page:
http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html
7) Creating a comparison file.Use all of the data generated in this exercise to create a listing of your Selected Molecule similar to that shown in the introduction for insulin-like growth factor II. Start with any appropriate file; the Gap comparison suggested below between your sequence and the corresponding entry from NRL_3D will do nicely. Use pico to add all relevant pieces of information you found to this file. Be sure to include the predictions found through MSU. One warning -- as noted before -- often the numbering system on the sequence that you've been using all along will not correspond with the PDB file's sequence, therefore, you may want to run Gap between your peptide file and the NRL_3D entry to insure that the two are aligned. When all of the information has been added, rename the file to have your last name; use the word comparison as an extension.
8) Finishing up
Copy over a report form for this exercise, rename it to have your last name and then go into the file with the editor to fill in the report. Check to make sure your comparison file is complete and named with your last name. Finally, send all of the Figure files created in the exercise and the above two files, the report form and the comparison file, to the teacher account.
You've now gone all the way from probe design, through fragment entry, up to structural prediction. As you can see the further we get into theoretical realms, the more loosely we have to entertain the results -- reality and predictions don't always quite match. Oftentimes the resultant predictive data derived from sequence analysis will directly conflict with the known structural data, but methods also sometimes agree. Newly discovered genes usually have no structural information available; we must try and use whatever is available, always keeping in mind the reliability of the methods.
You began this exercise series on common ground with the molecular modellers, then split into separate fields, and are now merging back into the commonality of structure and function. Hopefully you will have gained a greater appreciation for each other's specialities. If your protein is very similar to another protein, as identified by searching algorithms, and belongs to a distinct family, then many parallels may be drawn. In fact, even three-dimensional modelling without crystal coordinates is possible. This is "homology modelling." It will often lead to remarkably accurate representations if the similarity is great enough between your protein and one in which the structure has been solved through experimental means. Automated homology modelling is even now available through the Web at Amos Bairoch's Expasy server in Switzerland.
"THINK about what you're doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and do NOT blindly accept everything the computer offers you."
Gunnar von Heijne in his very readable treatise, Sequence Analysis in Molecular Biology: Treasure Trove or Trivial Pursuit (1987), provides a very appropriate conclusion:
". . . if any lesson is to be drawn ... it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that's all it takes."
Internet database services provided by the EMBL Data Library:
All databases from EMBL are available via ftp, Gopher, the Web, or limited E-Mail server access.