'96 BC/BP 378

Week 4

Working with proteins - section 2. You will explore longer protein primary sequence information for insight on the hydrophobic nature of proteins as it relates to crossing membranes; check sequences for the possible locations of sequence patterns with functional significance and do simple pattern searching; gauge the usefulness of secondary structure predictions based on primary sequences; and use other ways to confirm secondary structure predictions for proteins.

Author:

Susan Jean Johns

Protein Background Information

Each of these twenty amino acids varies in the size, shape, charge, hydrogen-bonding capacity and chemical reactivity of its respective side chain. These differences affect how the proteins composed of these amino acids fold or organize into both secondary and tertiary structures.

Hydrophobicity:
Individual amino acids react to water in set ways Amino acids with philic or water loving side chains form hydrogen bonds with water. Those amino acids with phobic or water hating side chains cluster together and form hydrogen bonds between themselves. When amino acids are linked together into a protein, the hydrophobic nature of their side chains (now the only regions of the protein capable of interaction with its surrounding environment - outside of the vastly overwhelmed charges at the ends of the protein chain) have a great deal of influence on how the protein behaves.

Proteins attempt to fold in response to their hydrophobic side chains. Phobic side chains tend to gather together in the inner region of the protein. Philic side chains tend to be on the outside surface of the protein where they can interact with the surrounding environment. There may be individual phobic or philic amino acids in places where they might not be expected due to a specific sequence, however the overall hydrophobicity of the given region should match this general convention - phobic for the inner portions of a protein and philic for the surface.

As proteins increase in size they organize themselves into more complex structures. The longer a protein, the more likely it is to contain defined secondary structural elements such as helixes, sheets and turns. The function a protein performs dictates its spatial conformation.

Proteins that span membranes must have sequence section(s) that can exist within the nonpolar or phobic environment existing within the membrane. When the hydrophobic nature of proteins known to span membranes was first looked at, it was found that the actual sections of the protein spanning the membrane are hydrophobic in nature and suspected of having helical structures. Such helical structures were found to be between 18 and 28 amino acids long with the average length being about 21. Based on the data from known membrane spanning proteins, computer software has been developed to predict such transmembrane regions in a protein. Possible membrane spanning helical regions can also be spotted by examining the hydrophobicity values for the protein and locating regions which are phobic for a continuous span of about 21 amino acids. More recently -sheet structures have been found that also span membranes. Sheet spanning sections are not easy to detect and you need to see the structure to know that they are there.

Hydrogen bonding:
Hydrogen bonding helps stabilize the folded structure of a group of amino acids that are close to one another in a linear sequence. When the steric relationship of these close amino acids are regular periodic structures are created. Such periodic structures are known as -helixes, -sheets and turns.

Linus Pauling predicted that helixes and sheets would be found in protein structures in 1948. His prediction was based solely on the idea that the potential hydrogen bonding possible in such structures would increase their stability and therefore make them more probable. Later improvements in x-ray diffraction techniques made it possible to solve protein three-dimensional structures. The predicted structural elements were there.


Background on Functional Patterns

As the number of proteins whose structures have been solved grows, so does the understanding of how the protein's spatial conformation and its function are related. The actual active site of a protein may be a complex interrelationship between sections of the protein sequence that are very far apart from one another in the linear sequence, yet close to one another when the folded structure is taken into account. As more and more structures from the same protein family have been solved, it has been found that proteins in the same family share a spatial conformation as well as the functional process. This has lead to the belief that similar protein functions will have similar spatial conformations. The entire process of homology modelling is based on this idea.

It takes a lot of time to solve a protein structure. There is a lot more information on protein primary sequences than there is on actual structures. Therefore, a great deal of analysis has been done on protein primary sequences to find out just how the members of a protein family and their active sites are related to one another. Do parts of the sequence remain constant, and if so, how do these sections relate to the function of that particular protein family? Found patterns are then checked against the protein databases to see if the generated patterns are specific enough for the desired function or too general to be of much use. Great effort has gone into developing of characteristic patterns which can serve as the template for a given protein function. If there is any problem with these characteristic patterns or motifs, it is that they are generated from purely primary sequence data and therefore may only represent a small portion of the entire sequence necessary to form the complete spatial conformation for that biological function.

The development of such a motif begins with noticing an unusual section of sequence in a protein. This section might be as simple as having a high local concentration of a single amino acid. It might also be the simple repeating of a given amino acid at set intervals within the sequence. The initial noticed pattern may be only a small portion of the final motif. It depends on whether the observed pattern is really the part of a region defining a functional activity and just how complex that final pattern turns out to be. Not all unusual sequence regions define functional portions of the protein.

Once such a pattern is noticed, it is checked against a series of sequences belonging to that protein family. If the observed pattern is found in all the members of the test set, it is further refined and run against an expanded dataset to see just how specific it is. This process is repeated and expanded until checks are made against entire databases. Since proteins performing similar functions share functional motif(s), it is important to determine just what protein families contain the developed motif and why. Once a motif has passed all these tests it is included in a library of motifs and used in software to characterize newly determined unknown sequences.


Secondary Structural Elements

The secondary structure of a protein is defined by the local conformation of its polypeptide backbone (the path the backbone travels through space). These local conformations have regular folding patterns known as helixes, sheets and turns. In order to better understand these folding patterns, you must know the nature of the peptide bond that links a protein's constituent amino acids together.

X-ray studies have shown that the carbon-nitrogen bond of the peptide linkage has considerable double-bond character. Therefore, rotations of groups attached to either the carbon or the nitrogen atoms about this bond are severely limited. This makes the NH-CO linkage very rigid. However, the alpha carbon atoms of the amino acid attached to this linkage have relatively free rotation, allowing the protein chain to form different conformations.

One such conformation is the -helix. This is a rod-shaped structure. The tightly coiled protein backbone forms the inner part of the rod and the amino acid side chains extend outward from this core. This structure is stabilized by hydrogen bonding between NH and CO groups of the backbone. Each backbone CO group is hydrogen bonded to the NH group of the amino acid, four residues down the linear sequence. In a standard -helix each residue is related to the next one by a translation of 1.5 angstroms along the helical axis and a rotation of 100 degrees. A variety of helical structures have been found in proteins. Not all helixes have the same characteristics. Helixes can also be used to form more complex structures such as the helical coiled coils found in muscles.

Another conformation is that of the -sheet. In the -sheet the protein chain is almost fully extended. The axial distance between adjacent amino acids in a sheet is 3.5 angstroms as opposed to that of 1.5 angstroms in an -helix. A sheet is stabilized by hydrogen bonding between different sections of a protein sequence. The adjacent chains comprising a sheet can run in the same direction (parallel) or in opposite directions (antiparallel).

A third conformation is that of the -turn. Many proteins have compact globular shapes because the protein backbone makes numerous reversals in direction. Many of these changes in direction resulted from a structural element conformation known as a -turn. In a -turn, the CO of one amino acid is hydrogen bonded to the NH of another 3 residues down the linear sequence. Thus the protein backbone can abruptly change direction. -turns often connect antiparallel -sheets. These structures are also known as reverse turns or hairpin bends.


Protein Secondary Structure Information

As more protein structures were determined, the beginnings of possible folding patterns were observed. Soluble globular proteins have started to be understood in general terms. "The principle underlying the structure of helices, sheets, and turns is the simultaneous formation of hydrogen bonds by buried peptide groups and the retention of single residue conformations close to those of minimum energy. The shape of the helix and sheet structures makes these structural elements pack together in a small number of relative orientations. The links between secondary structures tend to be right-handed and short, and do not form knots." As a result, globular proteins usually fold into a few common patterns. These proteins can roughly be grouped into four classes: all-alpha, all beta, mixed alpha/beta formed from beta-alpha-beta units, and alpha + beta where the helix and sheet units are segregated.

Whether a given section of a peptide folds to form a helix, a sheet or a turn primarily depends on the preferred conformations of the constituent residues and the packing quality of the surface formed. Prediction schemes have been devised, with relative success, based on only local or semi-local sequence patterns due to this local characteristic of folding forces. Once past these generalities, the detailed mechanisms of folding is only vaguely understood.

Even as the body of determined structures grows, questions remain as to the relationship between solved crystal structures and that of the proteins in solution. What effect do ionic conditions have on secondary structure? What effect does protein concentration have? Do crystals with different space groups produce the same or similar protein structures? Do x-ray and NMR (solution) structure determinations on the same protein agree with one another? If not, what are the causes of the differences?

Secondary Structure Prediction:
Indications of the possible structures of proteins came from initial studies on polypeptides. As protein structures were solved it appeared that the conformation of a residue in a protein was the same as in the homopolymeric form. This correlation is far from perfect, however.

After a number of protein structures had been solved, scientists attempted to do statistical studies on the data to determine any preference on the part of an individual amino acid for a given type of secondary structure. These efforts resulted in the empirical prediction schemes of Chou-Fasman and Garnier-Robson (GOR). The Chou-Fasman method is a group of rules applied to a given sequence. It is an ambiguous method that has proven difficult to automate. The Garnier- Robson method is based on consistent application of information theory with auxiliary information from circular dichroism used to bias its prediction. This method is unambiguous and easy to automate.

Another approach is to look for periodicity in regular secondary structures. Such information is best seen from helical wheel diagrams where the view down the helical axis groupings shows similar kinds of amino acids. The regular appearance of apolar residues spaced 3 or 4 residues apart could be a pattern that helixes recognize, while sheets might look for uniformly apolar sections - if completely buried within a protein - or alternating polar and apolar residues if on the surface. Some proteins display these patterns to a certain extent. Such studies have resulted in the prediction scheme of Lim and Eisenberg's hydrophobic moment technique.

Other scientists have looked at all the possible structural conformations for various sequence sections that exist in the known structures and tried to form prediction schemes based on their findings. The thought is that a similar sequence will have similar secondary structures wherever it is found. To do this, a measure of homology (or similarity) needs to be established between the studied sequences and a weighing of possible confirmations found in order to form a final prediction. The algorithms of Nishikawa and Ooi, Levin, and Sweet are all based on this theme with the differences resulting from the comparison choices made and the scoring systems used.

Prediction Reliability:
Studies done on the reliability of the various prediction schemes show disheartening results. Depending on whether three or four secondary structural elements are used, random chance would result in either a 25% or a 33% chance of a prediction being correct. The different approaches touched on here only improve the chances of the prediction being correct to 45% to 55%. Reported higher percentages usually are the result of a biased data set, and not an actual improvement in the technique devised. The application of neural net analysis to the area of secondary structure prediction has increased the reliability to about 72%.

One possible problem is the length of the structural elements being sought. While helixes are usually at least 5 residues long and can be much longer, sheets tend to be relatively short. Sheet elements as small as a single residue have been reported in some structures. It takes at least two sheet elements to actually form a viable sheet structure. Most prediction schemes don't worry about this situation when they generate their prediction. Turns and helixes are very similar in their general appearance and can be confused for one another if not carefully examined.

Some of the problems with secondary structure prediction techniques are the data being used to check their results. There are no set rules for establishing the secondary structure of a solved protein structure. Authors of such data can be subjective in making their structural assignments. Ideas about what constitutes a helix or a sheet section in such a solved structure have changed over time. Some authors don't report turns, others report the same residue as belonging to two different structural elements at the same time. Some authors use computer analysis to make their secondary structural assignments, others subjectively determine these assignments by eye. With a floating reference set against which to check your results, it is no wonder that secondary structure prediction reliability is low.

As a user of such prediction schemes, be cautious in applying and interpreting results. It is best to use these predictions only in cases where other types of potential confirming evidence is available, such as the presence of antigen producing regions, secondary structure estimates derived from physical data, hydrophobic moment analysis, or location of functional patterns with known spatial conformations.


Background Information on Confirming Secondary Structure Predictions

Since the reliability of secondary structure prediction programs are so low, it is best to do additional analysis to confirm the existence of the predicted secondary structural elements. There are a number of ways of doing this. However, each of these techniques only firms up a small part of the initial total prediction, since they only look for a specific type of characteristic. The entire series of these additional analysis needs to be run in order to completely check out a prediction. At times the only thing that your prediction does is let you know that there are secondary structure elements, not which ones are where.

Antigenic confirmation methods:
There are two programs that look for portions of a protein capable of generating antigens. Each program is based on a different type of antigen generation. The antigen index given in the PEPTIDESTRUCTURE program is based on B-cell antigenic response of whole intact proteins and represents regions of high surface exposure and flexibility (loop regions). The program AMPHI looks for T-cell receptors, small portions of the protein which result after a protein's cleavage or unfolding (amphipathic helixes). A positive hit in antigen index shown in PEPTIDESTRUCTURE is any value over 1.0. Likely regions from the AMPHI program have scores over 10.0.

Hydrophobic Moments:
Examining a protein sequence to determine regions of high helix or sheet hydrophobic moments is another way to confirm secondary structure predictions. In this technique the vector analysis is better at finding real regions of helix hydrophobic moments than it is for sheets. There are two programs that perform this mode of analysis. Of the two the homegrown programs, MOM produces results which are more easily understood.

Exercise for week 4

In this series of exercises you will explore longer protein primary sequence information for insight on the hydrophobic nature of proteins as it relates to crossing membranes. Check for the possible locations of sequence patterns with functional significance. Gauge the usefulness of secondary structure predictions based on primary sequences and use other means to confirm these predictions. Instructions in bold should be entered followed by pressing the ENTER key. The <rtn> symbol given in program examples means to press the ENTER key as well.


l) Activate the computer.

Move the mouse to get out of screen saver mode. A screen appears that shows overlapping windows.

By this time you should know how to activate the machine you want to use, make connections with ribozyme and log into your account. If you still need help with these functions, refer to the beginning of the exercises for weeks 2 and 3 for step by step instructions.


2) Move to this week's subdirectory and copy over to it the necessary files.

% cd four

Now copy over all the files needed to do this week's exercise. They are located in the directory location $UGRAD_DIR/week4.

% cp $UGRAD_DIR/week4/* .

3) Run the demo that describes this week's activities.

This week's demo looks at the phobic nature of transmembrane helixes and protein secondary structural elements. Due to the nature of membranes it has proven very difficult to grow crystals of transmembrane proteins. What little data is available is partial in nature and only deals with the membrane spanning helixes themselves. This data will be examined to find insights into the organization of the residues of the helixes and how they pack with other helixes from the same protein. Proteins of various sizes and secondary structural element content will be examined to see what the secondary elements look like and how they are organized with respect to hydrogen bonding.

Graphical demos are actually run on different computer. To reach this machine and get yourself in a directory location in an account from where you can run the demo, enter the following command.

% model1

Now get into MacroModel and view the demo for week four. Entering mmv30 starts up the program. Respond to the question about a script file with week4.log and that about doing a batch process with n.

$ mmv30

  week4.log

  n

The demo looks at the transmembrane protein bacteriorhodopsin. This data set only contains the transmembrane helixes displayed as a topographical model. The green lines represent the location of the surface of the membrane. After the image is shown in default colors, it is shown colored by charge. Notice how few of the residues are charged within the membrane section. This is followed by the actual x-ray data of bacteriorhodopsin color coded to show charge. The image is clipped down to the approximate section within the membrane and then rotated 90 degrees on the x-axis so that you can see down the pore. The purple object is the retinal molecule. Charged amino acids have been shown completely. Notice how they either point toward the retinal molecule or another helical bundle.

Next the subject of photosynthetic reaction centers is covered. A series of these molecules' structures have been determined. Only the backbones of these molecules and the necessary substrate groups are shown. Their membrane spanning section is composed of 11 helical segments. These segments are phobic in nature and contain few charged amino acids. Three such structures are shown. The first two resemble popped champagne corks. Their membrane spanning regions contains the substrates (colored green or purple) that convert electrons from water, energy from the sun and atmospheric CO2 into organic compounds. The third structure is more complex. In all three cases, the charged groups within the membrane spanning region appear hold the substrate groups in place rather than having them poke out of the pore into the lipid membrane.

Not all membrane spanning proteins are composed of helical segments. Porins have their spanning regions composed of sheets. One of these structures is shown. It has been color coded to show amino acid charge. The charged amino acids in this protein have been shown completely. When the image is rotated 90 degrees in the x direction, notice how they all point inside the pore.

Protein secondary structure is next explored. Four proteins have been color coded to show their secondary structure. Random sections are colored white, helixes red, sheets yellow and turns aqua. The four proteins vary in size from 61 to 146 residues. Only the backbone information is given for these proteins. Each structure is shown, the type of protein given (helix, sheet, turn or mixed), and then the hydrogen bonding of the structural type explored. The demo ends with the helical protein. After a helical region has been shown close up, pick out a helical section and check it out for yourself. To do this select Clip, move the cursor to the lower left-hand corner of the region you want to zero in on. Press the space bar. Move off to the upper right-hand corner of the area and press the space bar again. The screen is then filled with only that portion of the structure. Look closely at this image. Can you see a helical section clearly? Does the image need to be moved around either the x, y or z axis in order to get a better view? Pick the appropriate Rot button and estimate the number of degrees it will take to improve the image. Repeat this step as needed until you have a good view of your chosen helix on the screen. You can either attempt to view the structure from the side or the end. The two possible images look like this.

After you have the view you want, select STOP, and respond with y to the two questions given by the program.

Logout of this machine by entering the command, logout, to the dollar sign prompt.

$ logout

Back on ribozyme, one of the files that were copied over at the beginning of the exercise contain images of some the information shown in the demo. To see what this data looks like in a slightly different format use the week4.images file. First, rename this file to reflect your own lastname and then print it off on the teaching lab printer.

% mv week4.images (your lastname).images4

% lpr (your lastname).images4

Pick up your hardcopy at the printer. The images shown are that of a porin transmembrane protein, a helical protein, and a sheet protein. Save this information. Add it to your collection of molecular images.


4) Getting the data to do the exercises

First copy over one the of data sets you generated last week. The data set on the metallothionein-ii sequences is what you will be using for part of this week's work. To copy over this data, enter the following command.

% cp ../three/metal.look .

Now invoke the GCG package by entering gcg. You will be using a number of GCG programs or homegrown software developed using GCG subroutines during this week's exercise. You might as well set up the graphics environment by entering tek_plot after the prompt returns.

% gcg

The GCG welcome message appears on the screen.

                     Welcome to the WISCONSIN PACKAGE
                      Version 8.1-UNIX, August 1995

                             Installed on irix

  Copyright 1982, 1983, 1984, 1985, 1986, 1987, 1989, 1991, 1992, 1994, 1995
            Genetics Computer Group, Inc.  All rights reserved.

         Published research assisted by this software should cite:

                 Program Manual for the Wisconsin Package,
            Version 8, September 1994, Genetics Computer Group,
             575 Science Drive, Madison, Wisconsin, USA  53711

              Databases available:
                   GenBank            Release 94.0 ( 4/96)
                   EMBL (Abridged)    Release 43.0 ( 5/95)
                   PIR-Protein        Release 45.0 ( 6/95)
                   SWISS-PROT         Release 31.0 ( 3/95)
                   NRL_3D             Release 19.0 ( 6/95)
                   PROSITE            Release 12.2 ( 3/95)
                   Restriction Enzymes (REBASE)    ( 6/95)

               Help is available with the command % genhelp or by
            calling (608) 231-5200 or sending e-mail to Help@GCG.Com

% tek_plot

Now create some additional data sets to do the rest of the sections. Do a strings search of the SwissProtein database for the various sequences located there on lactose permease. Use the following example as a guide. In this example, the comma separating the two terms to be searched means that both terms need to be present in a line of text in order for a hit to be found.

% stringsearch

STRINGSEARCH identifies sequences by searching sequence documentation
with character patterns such as 'globin' or 'human'.

 STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ?  sw:*<rtn>

 Do you want to search through:
     A) definitions
     B) complete sequence records

 Please choose one (* A *):  a <rtn>
 Search for what text patterns ?  lactose,permease<rtn>

 What should I call the output file (* sw.strings *) ?  lac.look<rtn>

//////////////////////////////////////////////////////////////////////

     Sequences searched:    43470
 Sequences with matches:       xx
        Patterns sought: lactose permease

            Output file: lac.look

%

Edit the resulting strings file and remove from the listing any sequences with the terms precursor or fragment in their definition line.

Create another dataset by doing a strings search of the SwissProtein database for rhodopsin sequences. Use the following example as a guide. The quotes and the space before the term rhodopsin ensure that you get rhodopsin hits and not hits in which rhodopsin is part of another word.

% stringsearch

STRINGSEARCH identifies sequences by searching sequence documentation
with character patterns such as 'globin' or 'human'.

 STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ?  sw:*<rtn>

 Do you want to search through:
     A) definitions
     B) complete sequence records

 Please choose one (* A *):  a <rtn>
 Search for what text patterns ?  " rhodopsin" <rtn>

 What should I call the output file (* sw.strings *) ?  rho.look<rtn>

//////////////////////////////////////////////////////////////////////

     Sequences searched:    43470
 Sequences with matches:       xx
        Patterns sought:  rhodopsin

            Output file: rho.look

%

Edit the resulting strings file so it only contains lines with the term rhodopsin in the definition line of the sequence. Also remove any sequences with a length greater than 360 residues.


5) Examining proteins for transmembrane sections

Once a protein gets beyond 20 or so amino acids in length, there is a high probability that it will contain secondary structural elements. As the length of the protein increases over 100 the possibility that it may also contain transmembrane segment(s) starts to become real. Transmembrane segments can be checked for in a number of ways.

The first method is to check it with the hydrophobicity program you ran last week called PK23. As you know this program can be run with different window sizes. Run the program with a window size of 21, the average size of a transmembrane helix, and check the resulting plot for phobic regions of the protein that are at least 21 amino acids long to spot possible transmembrane regions.

Type off the lac.look file. Select two of the sequences listed there to do further work on. Record those names below.

% cat lac.look

lac sequence 1: ______________________________________________________________

lac sequence 2: ______________________________________________________________

Now type off the rho.look file. Select two of these sequences to do further work on. Record their names below.

% cat rho.look

rho sequence 1: ______________________________________________________________

rho sequence 2: ______________________________________________________________

Run the PK23 program on each of these four sequences using a window of 21. Examine the resulting plot closely to spot any phobic regions of over 21 amino acids. Record the results on the next page. Use the example given below and on the next page as a guide. The resulting plot is given on the next page.

% pk23
 Process set to plot with TEK4107 attached to term:
 using the tekd graphic interface.

Kyte - Doolittle plotting program

Please enter the filename.ext glyco.pep <rtn>

                  Begin (*   1 *) ? <rtn>
               End (*     131 *) ? <rtn>

 enter number of window sizes to be  tried: 1 <rtn>

 Average of hydrophilicity over how many acids (* 7 *) ?  21<rtn>

 values are -2.3285728 to 1.8428578

 When your TEK4107 attached to tty is ready, press <RETURN>.<rtn>

From this plot you can see the large phobic region from residue 73 through residue 94. This region is big enough to span a membrane. Use clearplot to clean the screen between runs.

Record the results of plotting the four chosen sequences with PK23 here. Record the relative locations and the number of the transmembrane sections you found.

lac sequence 1: ___________________________________________________________

___________________________________________________________________________

lac sequence 2: ___________________________________________________________

___________________________________________________________________________

rho sequence 1: ___________________________________________________________

___________________________________________________________________________

rho sequence 2: ___________________________________________________________

___________________________________________________________________________

Another piece of software useful in spotting transmembrane sections is MOM. This program produces a plot of Eisenberg's hydrophobic moment in a format similar to the one used in PK23. It can be run to determine helical moments, sheet moments or both. The results are color coded. The helical moment is shown in red, moment in blue. Record the results of the various sequence runs on the next page. Use clearplot to clean the screen between runs.

% mom

This is the program MOMENTPLOT
It will create plot(s) of MOMENT values

MOMENTPLOT of what protein sequence? glyco.pep <rtn>

                Begin (*   1 *) ? <rtn>
              End (*     131 *) ? <rtn>

Enter the type of plot desired
helical moment = 1
beta moment    = 2
both values    = 3  1 <rtn>

max is    0.67   min is     0.02
When your TEK4107 attached to tty is ready, press <RETURN>.<rtn>

From this plot, there are 4 possible areas. Three of these regions (as measured at the .45 line) 17-21, 65-70 and 97-101 are just too small. The range 81-88 at least has some potential to be a phobic region of interest, especially since this area lies in the middle of the region found with PK23. Notice that some techniques produce more obvious results than others.

Record the results of plotting these 4 sequences with MOM here. Indicate those regions which seem to be the most likely to lie in the middle on possible regions found by PK23. Use clearplot to clean the screen between runs.

lac sequence 1: ___________________________________________________________

___________________________________________________________________________

lac sequence 2: ___________________________________________________________

___________________________________________________________________________

rho sequence 1: ___________________________________________________________

___________________________________________________________________________

rho sequence 2: ___________________________________________________________

___________________________________________________________________________

A third way to determine the location of transmembrane segments is to run the GCG program PEPTIDESTRUCTURE and examine the resulting output file for the information it contains on the hydrophobicity of the protein. This program does a number of different analysis runs on the protein sequence used. One of these tests is running the hydrophobicity window through the sequence. PEPTIDESTRUCTURE is run in the following manner.

% peptidestructure

PEPTIDESTRUCTURE makes secondary structure predictions for a peptide
sequence. The predictions include (in addition to alpha, beta, coil, and
turn) measures for antigenicity, flexibility, hydrophobicity, and surface
probability.  PLOTSTRUCTURE displays the predictions graphically.

 PEPTIDESTRUCTURE for what peptide sequence ?  glyco.pep<rtn>

                  Begin (* 1 *) ?  <rtn>
                End (*   131 *) ?  <rtn>

 Calculate hydrophilicity according to

     H)opp-Woods or
     K)yte-Doolittle

 Please choose one (* K *) :  <rtn>

 What should I call the output file (* humprp.p2s *) ? glyco.p2s <rtn>

%
The result of running this program is an output file with the extension p2s. In this file the various type of analysis results are shown in columns under their respective headings. The hydrophilicity values we are looking for have the heading, HyPhil. Given on the next page is a partial example of what one of these output files look like. This program is interested in hydrophilicity and not hydrophobicity, therefore phobic regions of the sequence have negative values as opposed to the positive ones you are used to from the homegrown software.

PEPTIDESTRUCTURE of: glyco.pep  check: 8677  from: 1  to: 131

   P1;glyco

   Hydrophilicity (Kyte-Doolittle) averaged over a window of: 7
   Surface Probability according to Emini
   Chain Flexibility according to Karplus-Schulz
   Structure according to Chou-Fasman
   Secondary Structure according to Garnier-Osguthorpe-Robson
   Antigenicity Index according to Jameson-Wolf

                       Date: November 28, 1994 10:52

Pos  AA  GlycoS  HyPhil  SurfPr  FlexPr  CF-Pred GORPred AI-Ind..

1     S       .   0.750   1.401   1.000       .       .   0.900
2     S       .   0.680   1.085   1.000       .       .   0.900

///////////////////////////////////////////////////////////////

71    P       .   0.400   1.441   1.035       .       H   0.600
72    E       .   0.257   0.887   1.024       .       H   0.450
73    I       .  -0.500   0.359   0.995       B       H  -0.600
74    T       .  -1.643   0.163   0.961       B       H  -0.600
75    L       .  -2.271   0.081   0.935       B       H  -0.600
76    I       .  -2.714   0.115   0.913       B       H  -0.600
77    I       .  -2.671   0.059   0.904       B       B  -0.600
78    F       .  -3.043   0.071   0.900       B       B  -0.600
79    G       .  -2.757   0.102   0.899       B       B  -0.600
80    V       .  -2.057   0.144   0.900       B       B  -0.600
81    M       .  -2.014   0.124   0.898       B       B  -0.600
82    A       .  -2.257   0.088   0.906       B       .  -0.600
83    G       .  -2.257   0.117   0.917       B       .  -0.600
84    V       .  -1.557   0.170   0.937       B       B  -0.600
85    I       .  -1.929   0.118   0.954       B       B  -0.600
86    G       .  -2.214   0.098   0.967       B       B  -0.600
87    T       .  -2.814   0.109   0.968       B       B  -0.600
88    I       .  -2.857   0.109   0.950       B       B  -0.600
89    L       .  -2.100   0.148   0.936       B       B  -0.600
90    L       .  -1.971   0.161   0.923       B       B  -0.600
91    I       .  -2.014   0.227   0.917       B       B  -0.600
92    S       .  -2.014   0.193   0.929       t       B  -0.400
93    Y       .  -0.829   0.458   0.945       t       T   0.000
94    G       .   0.357   1.281   0.963       t       T   1.050
95    I       .   0.457   0.788   0.983       B       B   0.300

///////////////////////////////////////////////////////////////

130   D       .   2.400   2.095   1.000       T       .   1.300
131   Q       .   2.125   1.856   1.000       .       .   0.900
In the example given, notice the area from 73 through 93 where the hydrophobicity values are all negative. This is a phobic region in this output file. The length of this region of the sequence is 21 amino acids long, long enough to span a membrane.

Run PEPTIDESTRUCTURE on each of the possible transmembrane proteins you are looking at. Print off the results of these analysis on the lab printer with the lpr command. In the example command, xxxxx.zzz represents the name of your output file. Study the results of each run and record your observations in the space provided.

% lpr xxxxx.zzz

Record the results of PEPTIDESTRUCTURE runs here. Indicated those phobic regions that were found to be at least 20 residues long.

lac sequence 1: ___________________________________________________________

___________________________________________________________________________

___________________________________________________________________________

lac sequence 2: ___________________________________________________________

___________________________________________________________________________

___________________________________________________________________________

rho sequence 1: ___________________________________________________________

___________________________________________________________________________

___________________________________________________________________________

rho sequence 2: ___________________________________________________________

___________________________________________________________________________

___________________________________________________________________________


6) Determining Patterns in Protein Sequences

Last week when you determined trypsin cut sites, you were actually using software doing simple character searches. A number of functional characteristics have defined as character patterns. To give you a taste of looking for patterns, use the strings list you generated last week on the metallothioneins, metal.look. These sequences contain a large number of cysteine residues. Such a large number of cysteines could exist in patterns. The GCG program used to search for such patterns is FINDPATTERNS. Look for the following list of possible patterns, CC, CXC, CCC, CXCC, CXXC, and CXXXC. Use the example on the next page to guide your search. Replace the term (your lastname) with your own last name. These results will be printed and you will need a way to telling your results from those of the other members of the class.

% findpatterns

FINDPATTERNS identifies sequences with short pattern queries like GAATTC
or YRYRYRYR.  You can define the patterns ambiguously and allow mismatches.
You can provide the patterns in a file or simply type them in from the
terminal.

 FINDPATTERNS in what sequence(s) ?  @metal.look <rtn>

 Enter patterns individually, one per line.
 End the list with a blank line.

                Pattern 1:  CC <rtn>
                Pattern 2:  CXC <rtn>
                Pattern 3:  CCC <rtn>
                Pattern 4:  CXCC <rtn>
                Pattern 5:  CXXC <rtn>
                Pattern 6:  CXXXC <rtn>
                Pattern 7:  <rtn>

 What should I call the output file (* Metal.find *) ? (your lastname).find<rtn>

          [names of the searched files are shown]

 FINDPATTERNS in what sequence(s) ? <rtn>

     Total finds:        276
    Total length:        782
 Total sequences:         13
        CPU time:      06.45

    Output file: /disk3/usr/local/people/expxx/four/(your lastname).find

% 
Now type off your results and see which of the patterns you have looked for are actually there. Is there any relationship between the patterns found? To help you answer this, a small section of the resulting find file is given here and explained to help you understand the generated output.

% cat (your lastname).find

! FINDPATTERNS on @metal.look allowing 0 mismatches

!        1 CC
!        2 CXC
!        3 CCC
!        4 CXCC
!        5 CXXC
!        6 CXXXC                              June 19, 1996 15:59..
[This is header information from the output file that lists the patterns being searched for and the date the pattern search was run.]

             Mt1i_Human  ck: 261   len: 61    ! METALLOTHIONEIN-II (MT-1I)
[The name of the sequence file being searched and the length of the sequence. This particular sequence is 61 residues long.]

1                     CC
            33: SCKKS CC SCCPV
            36: KSCCS CC PVGCA
            59: SEKCS CC A
[Information is given on the first pattern found in the sequence, pattern 1 - CC. The numbers at the start of the line indicate at what position in the sequence the pattern was actually found.]

2                     CXC
             5:  MDPN CSC AAGVS
            13: AAGVS CTC AGSCK
            19: TCAGS CKC KECKC
            24: CKCKE CKC TSCKK
            34: CKKSC CSC CPVGC
            48: KCAQG CIC KGASE
            57: GASEK CSC CA
[Information is given on the second pattern found in the sequence, pattern 2 - CXC. The numbers at the start of the line indicate at what position in the sequence the pattern was actually found.]

4                     CXCC
            34: CKKSC CSCC PVGCA
            57: GASEK CSCC A
[Information is given on the third pattern found in the sequence, pattern 4 - CXCC. Notice that since pattern 3 - CCC was not found in this sequence no information is produced on this pattern.]

The rest of the file is organized similarly. The sequence being searched is listed, followed by the information for only those patterns actually found in the sequence. At the end of the file, statistics are given on the total number of patterns found, the total number of residues searched [the sum of the lengths of the looked at sequences] and the number of sequences used in the search.

Print off your results and look carefully at the patterns found. Is there any relationship between the found patterns? Do they form possible longer patterns? If so, record below what sort of a longer pattern is produced.

% lpr (your lastname).find

observed pattern(s)  ___________________________________________________________
Proteins have been extensively studied to find patterns that relate to function activity. These patterns have been collected into a library and special software created to use this pattern library to find possible functional characteristics in unknown protein sequences. MOTIFS is the GCG program that looks for these functional patterns. Run MOTIFS on the metal.look file and see if any of those sequences contain known functional patterns. Use the example given below to run this analysis. Running this program will take a few minutes so be patient.

% motifs

MOTIFS looks for sequence motifs by searching through proteins for the
patterns defined in the PROSITE Dictionary of Protein Sites and
Patterns.  MOTIFS can display an abstract of the current literature on
each of the motifs it finds.

 MOTIFs from what protein sequence(s) ?  @metal.look <rtn>

 What should I call the output file (* metal.motifs *) ? <rtn>

       [names of the sequences being searched will be shown]

             Total finds:         14
            Total length:        731
         Total sequences:         12
          CPU time (sec):      09.70

            Output file: "/disk3/usr/local/people/expxx/four/metal.motifs"

%
Now type the resulting file off on the screen. Record a summary of type of patterns these sequences contain. What is the purpose of this functional pattern?

% cat metal.motifs

___________________________________________________________________________

___________________________________________________________________________

___________________________________________________________________________

___________________________________________________________________________

___________________________________________________________________________

___________________________________________________________________________


7) Exploring Secondary Structure Predictions of Protein Sequence

Software has been developed to make secondary structure predictions based on primary protein sequence. A number of different approaches has been tried to predict secondary structure from this source of data (primary sequences). You will be using the GCG program PEPTIDESTRUCTURE to generate predictions and then using other programs to attempt to confirm these predictions. You will be assigned one of four possible unknown sequences on which to explore this process. Record your unknown sequence name below.

sequence to use: ______________________________________________________________
To give you an example of this process, another protein will be used to go through the steps of this type of analysis, cytochrome c from tuna.

First a run is made on the sequence with the program PEPTIDESTRUCTURE. This produces a P2S output file that contains secondary structure output data plus antigen index information. To help in this prediction process, a composite file has been created to store all the information in one place. After that the sequence was run through the programs, MOM and AMPHI to gain additional pieces of data to help firm up the predictions. Examples of running these programs are given on the next two pages using a cytochrome c protein sequence.

% mom

This is the program MOMENTPLOT
It will create plot(s) of MOMENT values

 MOMENTPLOT of what protein sequence ?  1cyc.nrl_3d <rtn>

                  Begin (* 1 *) ? <rtn>
                End (*   103 *) ? <rtn>
 Enter the type of plot desired
 helical moment = 1
 beta moment    = 2
 both values    = 3 1 <rtn>

 max is   0.74  min is   0.01
 When your TEK4107 attached to tty is ready, press <RETURN>. <rtn>

            [ an image is displayed on the screen] 
A helical moment plot was selected, because the program does a better job predicting possible helical sections than it does with sheets.

Looking at this plot shows five regions where the graph exceeds the .45 value. These are the regions of the most probable helical sections. The five regions are from about 12-15, 33-37, 65-74, 83-87 and 94-98.

% amphi

               Amphi

 This program is for the prediction of helper
 T-cell antigenic sites that correlate with
 amphipathic helices

 Enter input filename: 1cyc.nrl_3d <rtn>

 Enter output filename: 1cyc.amphi <rtn>

 Enter desired block length [7 or 11]: 11 <rtn>

                  Begin (* 1 *) ? <rtn>
                End (*   103 *) ? <rtn>

 length is the sequence is 103
 if you would like a detailed output
 type 1 else type 0 0 <rtn>

% 
In the example above the block size of 11 was chosen as it is the preferred one given in the paper that contained the program. The result of this amphi analysis run look like this.

 Prediction of helper T cell antigenic sites for 1cyc.nrl_3d


 predicted amphipathic segments

      mid points  angles        as<
      of blocks
--------------------------------------

   K    12- 21    90.-130.    24.5
     P  31- 33   100.-100.     5.4
 * K    44- 52   125.-135.    16.8
   K P  64- 73    80.-110.    23.7
        97- 98   105.-110.     4.4
 No of predicted blocks   34
From the information given in the beginning of this week's exercise, those regions with scores greater than 10 are to be considered possible amphipathic regions. The most likely amphipathic regions are the three areas from 12-21, 44 -52 and 64-73.

Here is some background information on the notations used by the PEPTIDESTRUCTURE program. The Chou-Fasman prediction scheme predicts areas which it considers to be highly likely as a given structure type and displays this fact by using an upper case letter to mark that position in the sequence. Areas which it is not as sure about for any given structure type are shown with lower case letters. GOR believes that all its predictions are highly likely and only uses upper case characters.

Pooling all this information together in a composite file results in the data file shown below. In this file some further collecting of data was also done. Areas where the Chou-Fasman and GOR secondary structure predictions matched were shown with an x. Areas which had a weak helical prediction in the Chou-Fasman scheme and a sheet prediction in the GOR were also shown with an x and are considered to be helical areas in the resulting composite of the prediction data. Likewise, areas of strong sheet predictions from Chou-Fasman and helical from GOR were noted with an x and will be converted to areas of helix in the final prediction for the structure.

On the next page is the resulting composite file for the cytochrome c data. The names for the lines given on the left-hand side of the file denote the data contained in that line.

     Found motif was that of Cytochrome C

              motifs:              xxxxxx
               amphi:            xxxxxxxxxx                      xxxxxxx
              AI-Ind:                     xxxxx                      xxx
                 MOM:            xxxx                 xxxxx
         Chou-Fasman:  hhhhhhhBBBBBBBBBBBBttTTtt...TTBBBBB.hhhhhhh.tt.tt
                 GOR: HHHHHHHHHHHHTTTTTT..TTTTT..........TTT.........TTT
               agree:  xxxxxxx            xxxxx                       xx
     weak helix-beta:
   strong beta-helix:         xxxx
       predicted 2nd:  HHHHHHHHHHH        TTTTT                       TT
            Sequence: GDVAKGKKTFVQKCAQCHTVENGGKHKVGPNLWGLFGRKTGQAEGYSYTD
                      12345678901234567890123456789012345678901234567890
                               1         2         3         4         5

              motifs:
               amphi: xx           xxxxxxxxxx
              AI-Ind: xxxx                xxx
                 MOM:               xxxxxxxxxx        xxxxx xxxxxx
         Chou-Fasman: TTTTtthhhhhhHHHHHHH.TTt..TTHHHHHHhhhhhhhhhhBBBBB..
                 GOR: TTT..BBBB....HHHHHHHTTT................HHHHHHHHHH.
               agree: xxx          xxxxxx                    xxxx
     weak helix-beta:       xxx
   strong beta-helix:                                            xxxxx
       predicted 2nd: TTT   HHH    HHHHHH                    HHHHHHHHH
            Sequence: ANKSKGIVWNENTLMEYLENPKKYIPGTKMIFAGIKKKGERQDLVAYLKS
                      12345678901234567890123456789012345678901234567890
                               6         7         8         9         0
                                                                       1

              motifs:
               amphi:
              AI-Ind:
                 MOM:
         Chou-Fasman: ...
                 GOR: ...
               agree:
     weak helix-beta:
   strong beta-helix:
       predicted 2nd:
            Sequence: ATS
                      123

Explanation of the lines in this file.

motifs:- 	The sequence contained the motif for cytochrome c's binding 
		of a heme group.  The location of this motif is shown on the 
		motif line and exists at positions 14-19 of the sequence.  
		Checking a found motif against structural data can determine 
		if enough examples of the motif are known to assign 3-d 
		confirmations to that pattern or if enough constituent members 
		of the protein family have been solved to model the new 
		sequence on.  In this case the structure of the cytochrome c 
		family is very conserved and well represented in the 
		structural databases.  This actual area of the sequence is not 
		in a defined structural region.

amphi:- There are 3 amphi found regions at 12-21, 44-52 and 64-73. This would indicate possible helical or turn structures in these locations. Region 1 from 12-21 falls in the same area of the sequence as the located motif did, a region of undefined structure. Region 2 from 44-52 falls in a region considered to be a turn area by the two prediction methods. Helixes and turns can be difficult to distinguish between. Region 3 from 64-73 falls in an area that is predicted to start out with a helix and move on into a turn. AI-Ind:- There are three regions where the AI-Ind values are greater than or equal to a value of 1.0 in the p2s file. These are located at positions 21-25, 48-53 and 71-73. This would indicate possible loop or turn structures in these areas. All three of the AI regions fall in areas predicted to be turns by the two prediction techniques. MOM:- There are five marked regions of interest from the MOM plot. Each of these have peaks over a value of .45. The five regions are from 12-15, 33-37, 65-74, 83-87 and 94-98. These should denote possible helical regions. The first MOM region from 12-15 falls in a region of the sequence that matches part of the motif area - an undefined area. The 2nd MOM area from 33-37 falls in an area where the predictions can't agree and shows sheet or the beginning of a turn, not helical predictions. MOM area 3 from 65-74 coincides with an area of predicted helix going to a turn - a strong confirmation of the predicted structure. MOM area 4 from 83-87 has a helical prediction only from the Chou-Fasman technique. Area 5 from the plot from 94-98 falls in a mixed prediction area. However, the type of match shown, strong sheet in Chou-Fasman and helix in GOR, has a history of being an actual helical region from the study of other such secondary structural comparisons and could very well be a helix. Chou-Fasman:- The prediction symbols from the PEPTIDESTRUCTURE program's Chou-Fasman determination. GOR:- The prediction symbols from the PEPTIDESTRUCTURE program's GOR (Garnier-Robson) determination. agree:- Areas where two prediction schemes match are shown with x's. A match is shown regardless of the case of the symbol. weak helix-beta:- Areas where a Chou-Fasman weak helix prediction, h, matches with that of a sheet prediction from GOR are shown with x's. The final composite value will be for a helix. strong beta-helix:- Areas where a Chou-Fasman strong sheet prediction, B, matches with that of a helix prediction from GOR are shown with x's. The final composite value will be for a helix. predicted 2nd:- The resulting secondary structure prediction. Upper case letters were used in this line to show the predicted structure. This line is a simple composite of the data shown in the previous three lines, where the actual predicted structure in shown in the agree line and helix predictions put in the other areas shown in the other two lines.

Since the crystal structure for this sequence is known, compare the predicted results with those known from the actual x-ray data.


    predicted 2nd:  HHHHHHHHHHH        TTTTT                       TT
     actual x-ray: HHHHHHHHHHH  TTTT                               HH
         Sequence: GDVAKGKKTFVQKCAQCHTVENGGKHKVGPNLWGLFGRKTGQAEGYSYTD
                   12345678901234567890123456789012345678901234567890
                            1         2         3         4         5

    predicted 2nd: TTT   HHH    HHHHHH                    HHHHHHHHH
     actual x-ray: HHTTTT   HHHHHHHHHH     TTTT           HHHHHHHHHHH
         Sequence: ANKSKGIVWNENTLMEYLENPKKYIPGTKMIFAGIKKKGERQDLVAYLKS
                   12345678901234567890123456789012345678901234567890
                            6         7         8         9         0
                                                                    1

    predicted 2nd:
     actual x-ray: HHH

Sequence: ATS 123

Both the predicted and the actual data show the protein to be composed of helix and turn structures. They differ on the regions that are in each of these structures and the total number of defined residues. The prediction shows a total of 45 residues with defined structure and the actual x-ray data 51 residues. Twenty-five of the predicted residues were actually correct, all of these were for helical residues for a total accuracy of 49%. It would appear from this example that this type of approach is much better at finding helixes than turns.

Use your unknown sequence to do a similar process on. Replace the sequence name given in the previous examples with that of your own (unknown1.seq, unknown2.seq, unknown3.seq or unknown4.seq). All the unknown sequences were initially copied over to your account at the beginning the of exercise.

For each unknown there is a template file that has the Chou-Fasman and GOR values already entered. You will need to run MOTIFS, amphi, PEPTIDESTRUCTURE, and MOM on your assigned unknown and record the necessary information below in the appropriate places. Then rename the template file and fill in the data with the editor, pico. Study your results and come up with a prediction of its secondary structure.

motif found: _________________________________________________________________

amphi regions: _______________________________________________________________

AI-Ind regions: ______________________________________________________________

MOM areas: ___________________________________________________________________

% mv unknownx.template (your lastname).predict

8) Extra Credit (Optional)

Select another of the unknown sequences and go through the secondary structure prediction process with that sequence. This time use the extension extra for your composite prediction file.

% mv unknownx.template (your lastname).extra

optional motif found: __________________________________________________________

optional amphi regions: ________________________________________________________

optional AI-Ind regions: _______________________________________________________

optional MOM areas: ____________________________________________________________

9) Finishing up.

Rename the report form to have your last name, go into the file and use the editor, pico, to fill in the report and rcp it over to the teacher account. Rcp over your find, predict and extra (optional) files as well.

% mv week4.week4 (your lastname).week4

% pico (your lastname).week4

% rcp (your lastname).week4 teacher@ribozyme:receive

% rcp (your lastname).find teacher@ribozyme:receive

% rcp (your lastname).predict teacher@ribozyme:receive

% rcp (your lastname).extra teacher@ribozyme:receive (optional)

This concludes your computing session for this week. Log off ribozyme, get out of the emulator and back to the overlapping windows screen.

% logout

Press the alt and x keys together. This will cause the screen to ask if you really want to exit the program. Respond with y to get out of the teemtalk emulator and return to the overlapping windows screen.