BC/BP 578

Week 9

Modelling Series

Homology Modelling. Combining sequence analysis and molecular modelling skills to explore the theoretical modelling of unsolved sequences based on solved ones.

Author:

Susan Jean Johns

Homology Modelling Background Information

Homology modelling combines two sets of computing tools, sequence analysis and molecular modelling. It takes applying techniques from both tool sets in order to carry out a successful homology modelling project.

To do this process, you must know as much as possible about the protein sequence to be modelled. What are its characteristics? Is it related to any other proteins? Does it have similar functions to other types of proteins? Have those proteins' structures been solved? Does it contain any established motifs?

In the past, homology modelling was a long drawn-out process. The sequence to be modelled had to be fully explored with respect to other possible similar proteins. What were their shared characteristics and/or functions? Did they contain the same motifs? Had any of these proteins' structures been solved? What was the predicted secondary structure of the sequence to be modelled and how did that relate to that of the solved structures? Were there regions of insertions or deletions to be dealt with? Did these areas occur within functional regions or outside of them?

Now there is an automated modelling tool available. While it does not generate a homology model for every sequence that is sent in (only 25% of submitted requests), it does produce enough models to make it a reasonable first approach for any homology modelling task. This modelling tool exists on the web and requires a web browser that can handle forms. It can be used in one of two ways. The user can either give the SwissProtein access code for the sequence to be modelled or the primary sequence of interest can be entered in.

To give you an idea of what is involved in homology modelling read through the description below.

Homology modelling (description)

Analysis of the growing numbers of determined crystal structures shows that there are relatively few folding patterns for proteins. Members of the same family have the same folding patterns. This makes sense since proteins doing the same functions should have the same general structure. In general, function sites also maintain the same folding pattern . Such observations have become the foundation of the area known as homology modelling. In this area the alignment between primary sequences is used to create models of new structures from either known coordinate sets or prediction data.

The keys to this endeavor are understanding the nature of the known data set, and its relationship to the sequence to be modelled. This requires collecting all possible data on both sequences. The coordinate data, with its orientation of functional sites and their interactions with one another, needs to become second nature to the modeller. All possible forms of comparisons to find regions of similarity between the two sequences must be determined and understood. Serious attempts at homology modelling requires that the modeller to do extensive literature searches on both the known coordinate protein and the unknown protein, plus extensive sequence analysis determinations. This will provide the broadest informational base possible to insure the creation of an accurate model.

Necessary starting point

In order to do this sort of modelling, a pair of sequences needs to be in hand. The first is the unknown sequence to be modelled. The second is a sequence that has coordinated data associated with it upon which to base the model. This second sequence is found by doing database searches either locally or via the network to find a suitable match. Doing a GCG FASTA search on the NRL_3D database would be one possible means of finding such a coordinate data set. Another would be doing a GCG BLAST search over the nets to the PDB resources at NCBI. There needs to be effective reasoning why sequences with lower degrees of similarity are being matched to one another. Such a reason would be that they are known to share a similar disulfide bridge pattern(s) or physiological effects.

Relevant data on which to base alignments

The collection of relevant data for alignments begins with examining the sequence files for the two proteins of interest. Most primary sequence files from a database contain information in them on the features of the protein. From this source, the presence of sections of the protein not present in its mature form can be determined. X-ray structures are only determined on mature proteins. Any precursor portions of the protein need to be removed from the affected sequence before modelling can take place.

The following type of information is needed to more fully understand a coordinate data set. First, what are the secondary structural assignments for that protein? Has the author of the data made these assignments, do they agree with assignments made by structure analysis programs such as DSSP and Define_S? If not, where do they disagree and why? Are disulfide bridges important to the integrity of the structure? If so, where are they and what are the residues connected in this way? Have functional sites been identified in the structure file? Where are they and what structural elements are they composed of?

Not all coordinate data sets give this type of information. The author of the data may expect that anyone interested in the structure would already be familiar with its known characteristics from the literature. There are other means to determine this type of information rather than depending on the author.

What does visual examination of the structure reveal about the interplay of secondary structural elements with one another? Are certain parts of the molecule occupied with the functional aspects of the protein, while others seem to be spatial elements serving to get the desired functional parts of the protein into the proper positions for interactions to take place? Are there complex folding patterns present in the structure that are representative of a given family of proteins that need to be maintained? Are there any structural elements in the protein that don't belong to this folding pattern? Have substrates been determined in the data set and how do they effect the conformation of the given functional site? Is the protein a composite of different types of functional sites? How do they interact with one another? What functional sites are present in the sequence to be modelled and are they the same as those in the coordinate data? What sort of additional information is available on the data set based on its primary sequence?

Running sequence analysis programs from GCG on the extracted primary sequence from the coordinate data file can lead to information on functional site patterns, possible external location of helical regions and the hydrophobic nature of the protein. GCG's Motifs will provide functional site information based on site patterns found in the primary sequence. The homegrown program, Amphi, will provide information on possible surface helices. A number of other programs can provide information on hydrophobicity, GCG's Pepplot and PeptideStructure plus the homegrown pieces GES and PK23. GCG's Helicalwheel can be used to show if a located helical region has it component residues organized or not. Server resources can be used to augment locally available analysis techniques.

All the analyses that were run on the primary sequence of the coordinate data set need to be repeated on the sequence to be modelled to it. Although secondary structural prediction is still not very accurate, these determinations too must be make on the sequence. They provide an idea of where structural elements may exist. That information can be used in the alignment process. The program PeptideStructure can supply the needed data.

Creating Alignments

The basis of the entire process is the alignment of primary sequences. The quality of the alignment determines the quality of the final output product. Therefore, it is extremely important to be able to produce quality alignments.

A number of GCG programs can produce sequence alignments. These programs are GAP and BESTFIT for pairwise alignments, while PILEUP can provide multiple pairwise alignments. While GAP and BESTFIT both do pairwise alignments, they approach the problem differently. GAP produces the best overall alignments between the two sequences while BESTFIT produces the best local regions alignment. This subtle difference requires that both programs be run on the desired sequences and the result that produces the highest degree of similarity be used as the basis of the alignment process.

The quality of an alignment depends on the comparison table used by the software as well as the approach used. A comparison table has been created based on the analysis of amino acid substitutions after superpositioning of homologous protein structures. Running any alignment program with the command switch, -data=genmoredata:structgappep.cmp, has that program use this comparison table instead of the default one based on evolutionary substitutions. Using this table can greatly increase the degree of similarity between the two sequences and likewise alter the nature of the alignment. The results of using this comparison need to be compared with the earlier results to insure that the general nature of alignment are the same or different. Even if the nature of the alignment is altered and gaps have been introduced where none existed before, if the degree of similarity has increased and later analysis shows alignment between functional sites, this alignment should be used over one giving a lower degree of similarity with no gaps.

Sometimes alignments need to be forced, i.e., disulfide bonds are important to the structure of the proteins being studied and no CYS alignments can be produced with the normal comparison tables. This can be done by changing the values assigned in the comparison table to favor certain types of matches. In the example case listed the value assigned to a CYS-CYS match in the table could be increased two-to-ten fold until the desired number of CYS alignments were produced. Changing the comparison table values can be used to force pairing of other amino acid matches as well if the need arises. Normally, this option is only used as a last resort, however, you should know that it exists for problem alignments.

At times alignments between the desired proteins are relatively low and the nature of functional sites and their possible consensus with one another not well understood. When this occurs, one way to improve your understanding of what is going on is to attempt to do multiple pairwise alignments with PILEUP and use the consensus regions developed there as a guide to desired modelling alignment. PILEUP is run with a number of sequences which the modeller feels are related to the unknown sequence and with the coordinate sequence as a reference to see how the new sequences affect the alignment. The trick here is to use only enough sequences to clarify and not muddy the alignment issue. Including a sequence in this set that is not strongly enough related to the unknown sequence will only make matters worse.

Working with alignments

Once an alignment has been derived, it should be checked by looking at how the additional information about the two sequences relate to one another via the alignment. This is done by creating a file in which the various features found for a sequence are compiled (see week 5 for instructions or for the example data set alignment shown on page 17 of that exercise).

A similar collection of data must be created for the sequence to be modelled. Here the information types are the same as that for the coordinate data set with one exception: that from the coordinate data itself, i.e., no x-ray secondary structural assignments or Define_S and DSSP results. To make future information alignments easier, put the noted features below the sequence line and the numbering system above it.

Now append the file containing the information on the sequence to be modelled to that for the coordinate data. Using the desired alignment file as a reference, edit the combined files to reflect this alignment. For a useable alignment, there should be agreement on as many recorded features as possible. If prediction values in the coordinate don't agree with the reported data, does the modelled sequence show the same type of predictions in these disputed areas? Do the motif areas line up with one another?

Oftentimes in doing this process, the lengths of the two sequences are exactly the same and there is very good similarity between the two sequences. When this happens it is possible to just overlay the new sequence over the backbone sequence of the coordinate data. For an accurate model, these results should then be minimized or at least subjected to a pass or two by a distance geometry program to get the side chains in more realistic positions.

At other times, there are deletions and/or inserts to worry about. When a gap occurs in the to-be-modelled sequence that doesn't exist in the coordinate data, that area needs to be looked at. Is it a loop section that easily could be clipped out, and the gap in the sequence removed by rotating one or two existing residues on either side of the gap, close enough to one another to form a normal peptide bond length between the moved residues? If so, feel free to make such a change in the structure of the coordinate data set prior to doing an overlay.

When an insert is called for, look at the area that the insert would be placed in. Is it on the surface of the molecule? Is the hydrophobicity of the insert philic or phobic and in what direction? What are the rest of the secondary structural elements in that area? Could a similar type of structural unit be created that would be consistent with the rest of this area and match the hydrophobicity requirements for the insert? The structural elements involved in the functional sites must be kept in the same general spatial positions. If this insert is outside of that restricted area, almost any configuration that meets the hydrophobicity requirements can be created in the desired region.

Model building that requires modifying structural members of the functional site should be carefully thought out. Changes in these areas need to be coupled with changes elsewhere in the molecule that will allow the functional site to basically remain intact even if component helices or sheets are now longer or shorter than in the original. If these changes can be made to keep the truly vital residues in the same spatial locations, then the modifications should be tried and will produce results that can be tested visually to see if they are realistic.

Selected Molecule Information

In this week's exercise you are to use your selected molecule for some of the determinations. In weeks three and five you have worked collecting data on your selected molecule. To insure that the results you produce match those expected, please use the following data for your selected molecules. Information is given on any changes that were made to the respective PDB files. PDB files are located in the $GRAD_DIR/week9m location.

selected molecule 1 (small subunit of ribulose bisphosphate carboxylase/oxygenase)
nrl_3d access code to use is 4rubs
4rubs.bdt is the converted MacroModel file
4rubs.coords is the PDB coordinate file for the molecule of interest.
rbs_cylsn is the SwissProtein access code to use for the homology modelling attempt.

selected molecule 2 (ras P21 transforming protein - mammalian)
nrl_3d access code to use is 6q21
6q21.bdt is the converted MacroModel file
6q21.coords is the PDB coordinate file for the molecule of interest.
cc42_caeel is the SwissProtein access code to use for the homology modelling attempt.

selected molecule 3 (basic fibroblast growth factor - mammalian)
nrl_3d access code to use is 4fgf
4fgf.bdt is the converted MacroModel file
4fgf.coords is the PDB coordinate file for the molecule of interest.
fgfh_npvac is the SwissProtein access code to use for the homology modelling attempt.

selected molecule 4 (fungal superoxide dismustase)
nrl_3d access code to use is 1sdya
1sdy.bdt is the converted MacroModel file
1sdy.coords is the PDB coordinate file for the molecule of interest.
sodc_bruab is the SwissProtein access code to use for the homology modelling attempt.

Week 9 Exercise

This series of exercises acquaints you with a number of different skills needed to conduct homology modelling of protein structures. These skills include: surfing the Internet, sequence analysis determinations, manipulating PDB files, visualizing structural data, and effective editing. Modelling of structures should only be undertaken when there is some chance for success. To determine if it has the potential to succeed requires sequence analysis skills. Evaluating the results and visualizing the structure requires molecular modelling techniques.

Homology modelling has changed with the advent of the modelling server, Swiss Model. This server is a painless way to try getting a theoretical model of a protein structure. While not always successful, the amount of effort involved in making the attempt (minimal) makes this step an excellent time investment. Because this is a network process subject to all the problems on the net (i.e., sites and/or gateways going down), start with a visit to the Swiss Model web site. Conformation of the request submission and results (good or bad) are shipped back via e-mail. The determination of a structure can take several hours on the server. Since there is additional work needed in order to visualize these results, send requests early.

Enter the access codes of SwissProtein sequences to be modelled directly into the request form of the Swiss Model server. A protein sequence has been picked to be modelled that is related to your selected molecule. Everyone will be attempting to generate a model from the sw:pol_flv primary sequence.

1) Surfing the Internet to Make Modelling Requests

From the Launcher window screen, go to the Swiss Model web site. This site contains a form system for submitting homology modelling requests to their modelling server. Select the NETSCAPE icon with the arrow and press the mouse button. The arrow changes to an hourglass while the connection is being made to the VADMS home page. Use the arrow to select the Bookmarks menu, and the Swiss-Model: Automate...ein Modelling Server entry from this menu.

You are now connected to the Swiss Model home page. Depending on network traffic, it may take a moment for their logo to appear. This is a form driven system. Move down the page until you reach a section of the page entitled How to Access Swiss-Model:. Select the First Approach mode phrase. This will move you to another part of the web site. Once there move down the page until you reach the These fields MUST be completed section. Here there are three boxes into which you need to enter three different pieces of information: your e-mail address, your name and a title for the requested modelling job.

Move the cursor to the beginning of the address box. The arrow changes into a symbol that allows you to fill in the box. Enter your e-mail address. Your address is as follows: bcsxx@ribozyme.vadms.wsu.edu. Replace the xx with the actual numbers for your account. Move to the beginning of the next box and enter your name. In the final box, enter a short title for your modelling attempt. You will repeat this process twice. The title you enter depends on which one of the sequences you want to be used in this request.

Move down the page again until you reach the Swiss-Prot ID code to model: box. The SwissProtein codes for the various selected molecule homology sequences are repeated below. The code for sequence that everyone has to do is pol_flv.

rbs_cylsn is the SwissProtein access code for the homology modelling for select molecule 1
cc42_caeel is the SwissProtein access code for the homology modelling for select molecule 2
fgfh_npvac is the SwissProtein access code for the homology modelling for select molecule 3
sodc_bruab is the SwissProtein access code for the homology modelling for select molecule 4

Move down past the space for entering your own sequence to the button for submitting the request. Use the arrow to select the Send Request button. The system should put up a new screen at this point informing you that your request has been sent off. Depending on the network traffic it may take some time for this screen to appear.

Recently the Swiss-Model server has added a new feature. When the requested model is one which the server has done before and feels comfortable with as being a good modelled structure, the model is put in an repository. Then when that SwissProtein access code is used in a modelling request the request is routed to the repository instead of repeating the creation of the model. You are asked if you want to download the coordinates at that point. If you experience this, go ahead and download the model and then asked your instructor for additional assistance.

After the new screen has appeared, select the Back button from the top of the screen. This should return you to the forms screen again. Check to see that in information in the address and name boxes are still ok. Change the title box to reflect information on the next sequence you want to have modelled. Move down to the ID code box. Change the contents of this box to have the access code for the next sequence. Move down to the Send Request button and ship off another request.

To exit the program select the File menu from the control bar at the top of the screen and select its Exit option. This will return you to the Launcher window screen.

2) Log into ribozyme.

From the Launcher window, select the RIBOZYME icon and click the mouse button twice. Successful connection to ribozyme is denoted by the appearance of a ribozyme information line and a login: prompt.

3) Create a subdirectory to keep this week's work in.

Create the following subdirectory in your account to store this week's computing activities and move over to it..

% mkdir week9

% cd week9

Copy over the data needed for this week's activities. In the example given below xxxx represents the PDB access code and yyyy represents the NRL_3D access code for your selected molecule.

% cp $GRAD_DIR/week9m/xxxx.homo-template .

% cp $GRAD_DIR/week9m/xxxx.pdb .

% cp $GRAD_DIR/week9m/yyyy.nrl_3d .

% cp $GRAD_DIR/week9m/week9m.week9m .


4) Collect needed information on your selected molecule from previous sources.

In week 5 you collected data on your selected molecule with respect to its secondary structure and various means of predicting that type of information. Go back and extract this data from your week5 subdirectory or your lab manual for that week. If necessary, refer to pages 7, 9, 10, 13, 15 and 16 of the week 5 exercise where the required data was suppose to be recorded.

The data you need on your selected molecule is: author secondary structure assignments, CF and GOR predictions from the p2s file, the nnpredict results (these were sent back in a mail message), dssp and define_s results (these are from work done on model1). This is all the data you used to generate the Molscript output on your selected molecule that week. Record the desire information below.

selected molecule data:

author secondary structure assignments:

helix locations: ___________________________________________________________

sheet locations: ___________________________________________________________

turn locations: ____________________________________________________________

CF secondary structure predictions:

helix locations: ___________________________________________________________

sheet locations: ___________________________________________________________

turn locations: ____________________________________________________________

GOR secondary structure predictions:

helix locations: ___________________________________________________________

sheet locations: ___________________________________________________________

turn locations: ____________________________________________________________

nnpredict results:

helix locations: ___________________________________________________________

sheet locations: ___________________________________________________________

turn locations: ____________________________________________________________

dssp assignments:

helix locations: ___________________________________________________________

sheet locations: ___________________________________________________________

turn locations: ____________________________________________________________

define_s assignments:

helix locations: ___________________________________________________________

sheet locations: ___________________________________________________________

turn locations: ____________________________________________________________

Examine the actual PDB data file (xxxx.coords). Check to make sure that the entire sequence listed in the SEQRES section is actually contained in the coordinate data section. Not all PDB data files start at the beginning of the primary sequence because of problems with residue location uncertainty that often occurs at the beginning and end of molecular structures. The command line given below will cause the grep results to be displayed a screen's worth at a time. In this command line xxxx represents the PDB access code of your selected molecule. Record any observations you make about unusual behavior in your PDB file (coordinate data that doesn't start with the first residue in the sequence, coordinate data that doesn't end with the last residue in the sequence, any chain designations.

% grep " CA " xxxx.coords | more

% grep SEQRES xxxx.coords

PDB file observations: _______________________________________________

__________________________________________________________________________

__________________________________________________________________________

__________________________________________________________________________

You will need the primary sequence for your selected molecule. Beware of assuming that the sequences given in the NRL_3D databases exactly match those contained in the PDB data file. It may be shorter than the actual sequence being reported because of the data uncertainties. NRL_3D only reports the primary sequence for the residues with actual coordinate data. Needless to say the numbering of such sequences will be off.

% more xxxx.nrl_3d

NRL_3D file observations: ____________________________________________

__________________________________________________________________________

__________________________________________________________________________

__________________________________________________________________________


5) Collect functional site information.

To gather more information on your selected molecule, your homology molecule and the sw:pol_flv sequence, run the GCG program Motifs on your desired sequence files. This will provide you with additional data on whether these molecules contains known functional site patterns or not and where they are located within these molecules. Activate the GCG software package. Use the example given below as a guide to using the Motifs software. User input is show in bold type. Replace the yyyy with the access code of your selected molecule on the first pass through the program. On the second pass replace the yyyy.nrl_3d term with sw:zzzz where zzzz is the name of SwissProtein access code for the homology sequence you are suppose to use. On the third pass replace yyyy.nrl_3d with sw:pol_flv. Record the results of each pass through the program on page 12. You will need to know where any found patterns start and how long they are.

% gcg

% motifs

   MOTIFS looks for sequence motifs by searching through proteins for the
   patterns defined in the PROSITE Dictionary of Protein Sites and
   Patterns.  MOTIFS can display an abstract of the current
   literature on each of the motifs it finds. 

    MOTIFs from what protein sequence(s) ?  yyyy.nrl_3d <rtn>

    What should I call the output file (* yyyy.motifs *) ? <rtn>  

                   yyyy len:         yy ....................

                Total finds:          1
               Total length:         yy
            Total sequences:          1
             CPU time (sec):      01.57
                Output file: "/disk3/usr/local/people/bcsxx/week9/yyyy.motifs"

An example of an output file is given below and on the next page. The file starts out giving background information about the sequence that is being searched for functional motifs. It then lists any located patterns. The number listed on the line with the actual pattern is the starting position of that pattern in the primary sequence. After any found pattern is given information on the function site - references as to the site and if the pattern is specific or not.

 MOTIFS from: nrl_3d:2fgf

 Mismatches: 0                February 15, 1996 19:53  ..

  2FGF  Check: 1531  Length: 126   ! basic fibroblast growth factor - human

_____________________________________________________________________________

Hbgf_Fgf GxLx(S,T,A,G)x6(D,E)CxFxE GxLx(A)x{6}(E)CxFxE 62: AMKED GRLLASKCVTDECFFFE RLESN ***************************** * HBGF/FGF family signature * ***************************** Heparin-binding growth factors I and II (HBGF) [1,2] (also known as acidic and basic fibroblast growth factors (FGF) are structurally related mitogens which stimulate growth or differentiation of a wide variety of cells of mesodermal or neuroectodermal origin. These two proteins belong to a family of growth factors and oncogenes which is currently known [3,4] to include: - FGF-3 (int-2), induced by the integration of mouse mammary tumor virus (MMTV). - FGF-4 (hst-1; KS3), a transforming protein independently isolated from a human stomach tumor (hst-1) and from Kaposi's sarcoma (KS3). - FGF-5, an oncogene expressed in neonatal brain. - FGF-6 (hst-2), a transforming protein that exhibits strong mitogenic and angiogenic properties. - FGF-7 or keratinocyte growth factor (KGF), a paracrine effector of normal epithelial cell proliferation. - FGF-9 or glia-activating factor (GAF), a heparin-binding growth factor that may have a role in glial cell growth and differentiation during development. From the sequences of these related proteins, we have derived a signature pattern which includes one of the two conserved cysteine residues. -Consensus pattern: G-x-L-x-[STAG]-x(6)-[DE]-C-x-F-x-E -Sequences known to belong to this class detected by the pattern: ALL. -Other sequence(s) detected in SWISS-PROT: NONE. -Last update: June 1994 / Text revised. [ 1] Burgess W.H., Maciag T. Annu. Rev. Biochem. 58:575-606(1989). [ 2] Thomas K.A. Trends Biochem. Sci. 13:327-328(1988). [ 3] Benharroch D., Birnbaum D. Isr. J. Med. Sci. 26:212-219(1990). [ 4] Miyamoto M., Naruo K.-I., Seko C., Matsumoto S., Kondo T., Kurokawa T. Mol. Cell. Biol. 13:4251-4259(1993). ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Record the location of any motifs found within the sequences used.

selected molecule results: ________________________________________________

homology molecule results: ________________________________________________

pol_flv results: _________________________________________________________


6) Collect needed information on the homology sequence.

At the beginning of this exercise (page 6), you learned the SwissProtein access code for the protein you are to use for homology modelling. This protein is related to your selected molecule. Go back to page 6 and record that access code in the space provided below.

SwissProtein access code for protein to be used for homology modelling:

Now use this information to generate the necessary data for the comparisons by running the PeptideStructure program. Also use the file in readseq to create the necessary file to use in msu to make a request to the nnpredict server for another secondary structure prediction. In the example the zzzz represents the access code for your homology sequence.

% peptidestructure

PeptideStructure makes secondary structure predictions for a peptide sequence. The predictions include (in addition to alpha, beta, coil, and turn) measures for antigenicity, flexibility, hydrophobicity, and surface probability. PlotStructure displays the predictions graphically. PEPTIDESTRUCTURE for what peptide sequence ? sw:zzzz<rtn> Begin (* 1 *) ? <rtn> End (* xxx *) ? <rtn> Calculate hydrophilicity according to H)opp-Woods or K)yte-Doolittle Please choose one (* K *) : <rtn> What should I call the output file (* zzzz.p2s *) ? <rtn>

Use the more command to display the results of this run one screen's worth at a time. Record the Chou-Fasman and GOR prediction results below and on the next page.

% more zzzz.p2s

homology sequence prediction results (Chou-Fasman):

helix locations: ___________________________________________________________

sheet locations: ___________________________________________________________

turn locations: ____________________________________________________________

homology sequence prediction results (GOR):

helix locations: ___________________________________________________________

sheet locations: ___________________________________________________________

turn locations: ____________________________________________________________

Now use readseq to convert the sequence file into one that can be used in the msu mail server program. First fetch the sequence over into your account. In the example aaaa represents the access code of your homology sequence.

% fetch -in=sw:aaaa

After the fetch process is completed, the homology sequence file will be in your account called aaaa.sw. Use it in the readseq program.

% readseq
readSeq (1Feb93), multi-format molbio sequence reader.

Name of output file (?=help, defaults to display):
aaaa.pro<rtn>
         1. IG/Stanford           10. Olsen (in-only)
         2. GenBank/GB            11. Phylip3.2
         3. NBRF                  12. Phylip
         4. EMBL                  13. Plain/Raw
         5. GCG                   14. PIR/CODATA
         6. DNAStrider            15. MSF
         7. Fitch                 16. ASN.1
         8. Pearson/Fasta         17. PAUP/NEXUS
         9. Zuker (in-only)       18. Pretty (out-only)

Choose an output format (name or #):
8<rtn>

Name an input sequence or -option:
aaaa.sw<rtn>

Name an input sequence or -option:
<rtn>

Now display the contents of the output file to insure that it has worked. There should be a single line of background information followed by the actual sequence in lines of 50 characters each.

% cat aaaa.sw

Use the converted sequence in the msu utility to make the request for the secondary structure prediction from the nnpredict server. The example given below should be used as a guide for using the msu utility. In the guide xxxx.pro represents the name of your converted sequence file.

% msu

**************** MSU (Mail Server Utility) *********************
Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library

Sequence loaded: None

Options:
  L - Load sequence
  H - Retrieve HELP file from server   R - Register with service
  O - Other options                    Q - Quit (exit program)
Enter L, H, R, O, or Q to quit: l<rtn>
Enter file name: xxxx.pro<rtn>


**************** MSU (Mail Server Utility) *********************
Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library

Sequence loaded: xxxx.pro
Protein, zzz residues, 1-MRLTQ...EGGRY-zzz

Services: (press RETURN for next page)
  1 - EMBL BLITZ server
  2 - BIOCCELERATOR
  3 - FLASH
  4 - NCBI BLAST
  5 - EMBL FASTA server
  6 - NBRF/PIR FASTA server
  7 - CBRG (ETH Zuerich)
  8 - BLOCKS server
  9 - MotifFinder
 10 - ProteinPredict

Options:
  L - Load sequence                    S - Set sequence limits (1 - zzz)
  H - Retrieve HELP file from server   R - Register with service
  O - Other options                    Q - Quit (exit program)

Enter 1-15, L, S, H, R, O, or Q to quit:<rtn>


**************** MSU (Mail Server Utility) *********************
Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library

Sequence loaded: xxxx.pro
Protein, zzz residues, 1-MRLTQ...EGGRY-zzz

Services: (press RETURN for next page)
 11 - nnpredict
 12 - GenomeNet BLAST
 13 - GenomeNet FASTA
 14 - GenQuest (Q)
 15 - ProDom

Options:
  L - Load sequence                    S - Set sequence limits (1 - zzz)
  H - Retrieve HELP file from server   R - Register with service
  O - Other options                    Q - Quit (exit program)

Enter 1-15, L, S, H, R, O, or Q to quit: 11<rtn>


Service nnpredict
Neural network secondary protein structure prediction

Select prediction options:
  1 - n
  2 - a
  3 - b
  4 - a/b

1-4 or ? [1]:<rtn>
Request mailed to nnpredict@celeste.ucsf.edu at Sat Feb 17 17:02:05 1996
The reply should soon arrive in your mailbox

PRESS <RETURN> TO CONTINUE...<rtn>

**************** MSU (Mail Server Utility) *********************
Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library

Sequence loaded: xxxx.pro
Protein, zzz residues, 1-MRLTQ...EGGRY-zzz

Services: (press RETURN for next page)
 11 - nnpredict
 12 - GenomeNet BLAST
 13 - GenomeNet FASTA
 14 - GenQuest (Q)
 15 - ProDom

Options:
  L - Load sequence                    S - Set sequence limits (1 - zzz)
  H - Retrieve HELP file from server   R - Register with service
  O - Other options                    Q - Quit (exit program)

Enter 1-15, L, S, H, R, O, or Q to quit: q<rtn>

In a few minutes check in pine for a response to your nnpredict request. Ignore for the moment any responses you are received from the Swiss Model server. Record the results of the nnpredict request here when it comes back.

homology sequence prediction results (nnpredict):

helix locations: ___________________________________________________________

sheet locations: ___________________________________________________________

turn locations: ____________________________________________________________


7) Collect another secondary structure prediction off the Internet.

There is another secondary structure prediction service available on the Internet. PredictProtein is an electronic mail service by the Protein Design Group at the European Molecular Biology Laboratory, Heidelberg, Germany. A multiple sequence alignment is performed by a weighted dynamic programming method (MaxHom, R.Schneider) and a secondary structure prediction is produced by the profile network method (PHD). The prediction is made by a new method rated at an expected 70.2% average accuracy for the three states helix, strand, and loop (Rost and Sander, PNAS).

This information would be helpful in creating the final data collection file for all three of the proteins that you are interested in. Go through and convert your selected molecule sequence and the pol_flv sequence in to a msu usable files using the readseq instructions given in the previous section. . Then get back into the msu utility and submit the desired requests to the PredictProtein server for all three proteins (pol_flv, selected molecule and your homology attempt protein). The following abridged screen trace shows this process with the pol_flv sequence. Replace the information your instructor supplied with data referring to yourself.

% msu

**************** MSU (Mail Server Utility) ********************
Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library
Sequence loaded: None

Options:
  L - Load sequence
  H - Retrieve HELP file from server   R - Register with service
  O - Other options                    Q - Quit (exit program)

Enter L, H, R, O, or Q to quit: l <rtn>
Enter file name: pol_flv.pro<rtn>

**************** MSU (Mail Server Utility) *********************
Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library

Sequence loaded: pol_flv.pro
Protein, 50 residues, 1-KIRRV...SDRTA-50

Services: (press RETURN for next page)
  1 - EMBL BLITZ server
  2 - BIOCCELERATOR
  3 - FLASH
  4 - NCBI BLAST
  5 - EMBL FASTA server
  6 - NBRF/PIR FASTA server
  7 - CBRG (ETH Zuerich)
  8 - BLOCKS server
  9 - MotifFinder
 10 - ProteinPredict

Options:
  L - Load sequence                    S - Set sequence limits (1 - 231)
  H - Retrieve HELP file from server   R - Register with service
  O - Other options                    Q - Quit (exit program)

Enter 1-15, L, S, H, R, O, or Q to quit: 10

Service ProteinPredict
Neural network secondary protein structure prediction

Your name (or ?): Susan Johns <rtn>
Your address (part1) (or ?): WSU VADMS Center <rtn>
Your address (part2) (or ?): Pullman WA 99164-4660 <rtn>
Your email address (or ?): prcadams@ribozyme.vadms.wsu.edu <rtn>
Enter sequence description (or ?): pol_flv protein <rtn>

Request mailed to PredictProtein@embl-heidelberg.de at Wed Sep 28 11:42:36 1996
The reply should soon arrive in your mailbox
PRESS <RETURN> TO CONTINUE... <rtn>

**************** MSU (Mail Server Utility) *********************
Version 1.4 (Apr 1994) - R. Fuchs, EMBL Data Library

Sequence loaded: pol_flv.pro
Protein, 50 residues, 1-KIRRV...SDRTA-50

//////////////////////////////////////////////////////////////////////////////////////

At this point load in each of the other two protein sequences in turn that you are interested in and send off requests for PredictProtein results on those sequences. You will need to load in the sequence, select the desired service, and repeat filling in the required information lines for each request. After the request for the third protein has been sent off, exit from the msu program by entering q to the selection prompt..

Enter 1-15, L, S, H, R, O, or Q to quit: q <rtn>

Your prompt will be returned and you will await the results of your request (sometimes it takes overnight). When the predictions have been returned to you, and PredictProtein should send two messages, one just a confirmation of the job being submitted and another with the actual results: Export them and print them off. You will need to carefully read the predictions in order to understand them. In the PredictProtein output, getting your answer is a bit involved. The output is very long and the most relevant part isn't until near the very end. Read through it all and it should make sense by the time you hit the most probable (they call these "subset" predictions) secondary structure estimate near the end. Abridged results for the pol_flv protein from PredictProtein server follows. Notice that PredictProtein assembles a GCG style MSF alignment by default:

From phd@EMBL-Heidelberg.de Mon Dec 16 10:38:24 1996
Date: Mon, 16 Dec 1996 18:06:19 GMT
From: phd@EMBL-Heidelberg.de
To: teacher@ribozyme.vadms.wsu.edu
Subject: Predict-Protein

The following information has been received by the server:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

________________________________________________________________________

reference predict_e12656 (Mon Dec 16 19:02:31 MET 1996)
from teacher@ribozyme.vadms.wsu.edu
resp MAIL
orig MAIL
# pol_flv protein
kirrvrartpppepritlriggqpvtflvdtgaqhsvltrpdgplsdrta
________________________________________________________________________

////////////////////////////////////////////////////////////////////////

--- MAXHOM ALIGNMENT: IN MSF FORMAT
MSF of: /home/phd/server/work/predict_e12656_12714.hssp from:    1 to: 50
 /home/phd/server/work/predict_e12656_12714.ret_msf  MSF:   50  Type: P 16-Dec-.


 Name: predict_e126    Len:    50  Check: 8529  Weight:  1.00
 Name: pol_flv         Len:    50  Check: 8529  Weight:  1.00
 Name: pol_mlvav       Len:    50  Check: 8518  Weight:  1.00
 Name: pol_mlvrd       Len:    50  Check: 8347  Weight:  1.00
 Name: pol_mlvmo       Len:    50  Check: 8011  Weight:  1.00
 Name: pol_mlvff       Len:    50  Check: 8474  Weight:  1.00
 Name: pol_mlvf5       Len:    50  Check: 8341  Weight:  1.00
 Name: pol_mlvfp       Len:    50  Check: 8341  Weight:  1.00
 Name: pol_baevm       Len:    50  Check: 8971  Weight:  1.00
 Name: pol_galv        Len:    50  Check: 5169  Weight:  1.00
 Name: pol_sivm1       Len:    50  Check: 6384  Weight:  1.00
 Name: pol_sivmk       Len:    50  Check: 6384  Weight:  1.00
 Name: pol_biv27       Len:    50  Check: 9366  Weight:  1.00
 Name: pol_biv06       Len:    50  Check: 9366  Weight:  1.00

//

              1                                                   50
predict_e126  KIRRVRARTP PPEPRITLRI GGQPVTFLVD TGAQHSVLTR PDGPLSDRTA
pol_flv       KIRRVRARTP PPEPRITLRI GGQPVTFLVD TGAQHSVLTR PDGPLSDRTA
pol_mlvav     .....QGQEP PPEPRITLTV GGQPVTFLVD TGAQHSVLTQ NPGPLSDRSA
pol_mlvrd     .....QGQEP PPEPRITLKV GGQPVTFLVD TGAQHSVLTQ NPGPLSDRSA
pol_mlvmo     .....QGQEP PPEPRITLKV GGQPVTFLVD TGAQHSVLTQ NPGPLSDKSA
pol_mlvff     ..QGGQGQEP PPEPRITLRV GGQPVTFLVD TGAQHSVLTQ NPGPLSDKSA
pol_mlvf5     ..QGGQGQEP PPEPRITLKV GGQPVTFLVD TGAQHSVLTQ NPGPLSDKSA
pol_mlvfp     ..QGGQGQEP PPEPRITLKV GGQPVTFLVD TGAQHSVLTQ NPGPLSDKSA
pol_baevm     .....QGSGA PPEPRLTLSV GGHPTTFLVD TGAQHSVLTK ANGPLSSRTS
pol_galv      .....QGSDP LPEPRVTLTV EGTPIEFLVD TGAEHSVLTQ PMGKVGSR..
pol_sivm1     .......... .RRPVVTAHI EGQPVEVLLD TGADDSIVTG ilGP......
pol_sivmk     .......... .RRPVVTAHI EGQPVEVLLD TGADDSIVTG ilGP......
pol_biv27     .......... DKQPFIKVFI GGRWVKGLVD TGADEVVL.. ..........
pol_biv06     .......... DKQPFIKVFI GGRWVKGLVD TGADEVVL.. ..........

____________________________________________________________________

////////////////////////////////////////////////////////////////////////

About the protein
~~~~~~~~~~~~~~~~

HEADER     /home/phd/server/work/predict_e12656_127
COMPND
SOURCE
AUTHOR
SEQLENGTH    50
NCHAIN        1 chain(s) in predict_e12656_12714 data se
NALIGN       13
(=number of aligned sequences in HSSP file)

Abbreviations: PHDsec
~~~~~~~~~~~~~~~~~~~~

sequence:
AA : amino acid sequence
secondary structure:
HEL: H=helix, E=extended (sheet), blank=other (loop)
PHD: Profile network prediction HeiDelberg
Rel: Reliability index of prediction (0-9)
detail:
prH: 'probability' for assigning helix
prE: 'probability' for assigning strand
prL: 'probability' for assigning loop
note: the 'probabilites' are scaled to the interval 0-9, e.g.,
prH=5 means, that the first output node is 0.5-0.6
subset:
SUB: a subset of the prediction, for all residues with an expected
average accuracy > 82% (tables in header)
note: for this subset the following symbols are used:
L: is loop (for which above " " is used)
".": means that no prediction is made for this residue, as the
reliability is:  Rel < 5

Abbreviations: PHDacc
~~~~~~~~~~~~~~~~~~~~

solvent accessibility:
3st: relative solvent accessibility (acc) in 3 states:
b = 0-9%, i = 9-36%, e = 36-100%.
PHD: Profile network prediction HeiDelberg
Rel: Reliability index of prediction (0-9)
P_3: predicted relative accessibility in 3 states
note: for convenience a blank is used intermediate (i).
10st:relative accessibility in 10 states:
= n corresponds to a relative acc. of n*n %
subset:
SUB: a subset of the prediction, for all residues with an expected
average correlation > 0.69 (tables in header)
note: for this subset the following symbols are used:
"I": is intermediate (for which above " " is used)
".": means that no prediction is made for this residue, as the
reliability is: Rel < 4

Abbreviations: PHDhtm
~~~~~~~~~~~~~~~~~~~~

secondary structure:
HL:  T=helical transmembrane region, blank=other (loop)
PHD: Profile network prediction HeiDelberg
PHDF:filtered prediction, i.e., too long transmembrane segments
are split, too short ones are deleted
Rel: Reliability index of prediction (0-9)
detail:
prH: 'probability' for assigning helical transmembrane region
prL: 'probability' for assigning loop
note: the 'probabilites' are scaled to the interval 0-9, e.g.,
prH=5 means, that the first output node is 0.5-0.6
subset:
SUB: a subset of the prediction, for all residues with an expected
average accuracy > 82% (tables in header)
note: for this subset the following symbols are used:
L: is loop (for which above " " is used)
".": means that no prediction is made for this residue, as the
reliability is:  Rel < 5


protein:       predict        length       50

                    ....,....1....,....2....,....3....,....4....,....5....,....6
           AA      |KIRRVRARTPPPEPRITLRIGGQPVTFLVDTGAQHSVLTRPDGPLSDRTA|
           PHD sec |              EEEEEE   EEEEEEE     EEE            |
           Rel sec |96589999999986499883794299998399985424269997458999|
   detail:
           prH sec |00000000000000000000000000000000000000000000000000|
           prE sec |01200000000011689886103589888600002656420001220000|
           prL sec |97789999999987300103796400000389887333579998678999|
   subset: SUB sec |LLLLLLLLLLLLLL.EEEE.LL..EEEEE.LLLLL....LLLLL.LLLLL|

   ACCESSIBILITY
   3st:    P_3 acc |eeeeeeeeeeeeeeebebebeeeebebbbebbbebbbbbeeeee eeeee|
   10st:   PHD acc |99789799977977606060796606000600070000087877478799|
           Rel acc |92687559734762152624442132665102341255154735036568|
   subset: SUB acc |e.eeeeeee.eee..b.b.bee....bbb....e..bb.eee.e..eeee|

_____________________________________________________________________

-------------------------------------------------------------------------
--- PredictProtein: NEWS from November, 1996                          ---
---                                                                   ---
--- You can now query the minimal waiting time before you may obtain  ---
--- a result from PredictProtein:                                     ---
---   http://www.embl-heidelberg.de/predictprotein/PPstatus.log       ---
---                                                                   ---
--- Note: in general weekends are relatively empty, Fridays relatively---
--- busy.                                                             ---

Record the results of the PredictProtein requests here when they come back. Continue on with the rest of the exercise while you wait.

pol_flv sequence prediction results (PredictProtein):

helix locations: ___________________________________________________________

sheet locations: ___________________________________________________________

turn locations: ____________________________________________________________

selected molecule sequence prediction results (PredictProtein):

helix locations: ___________________________________________________________

sheet locations: ___________________________________________________________

turn locations: ____________________________________________________________

homology sequence prediction results (PredictProtein):

helix locations: ___________________________________________________________

sheet locations: ___________________________________________________________

turn locations: ____________________________________________________________


8) Creating alignments

With the data collection phase mostly finished, move on to creating an alignment. This will be done by running the GCG programs GAP and BESTFIT on the pair of sequences to be aligned. An example of using GAP is given below as a guide. User input is show in bold type. In the example xxxx.nrl_3d represents the selected molecule and yyyy.sw represents homology sequence. The term zzzz.pair stands for the name of the output file. Name this file to represent the program used. An appropriate filename would be rbs-gap.pair.

% gap

GAP uses the algorithm of Needleman and Wunsch to find the alignment of
two complete sequences that maximizes the number of matches and minimizes
the number of gaps. 

 GAP of what sequence 1 ?  xxxx.nrl_3d<rtn>
                 Begin (* 1 *) ? <rtn>
               End (*    zz *) ? <rtn>

 to what sequence 2 (* xxxx.nrl_3d *) ? yyyy.sw<rtn>
                 Begin (* 1 *) ?  <rtn>
               End (*    zz *) ?  <rtn>

 What is the gap creation penalty (* 12 *) ? <rtn>

 What is the gap extension penalty (* 4 *) ?  <rtn>

 What should I call the paired output display file (* xxxx.pair *) ? zzzz.pair<rtn>

 Aligning ...-.
 Aligning ...-.

          Gaps:     0
       Quality:  95.1
 Quality Ratio: 1.419
  % Similarity: 95.522
        Length:    zz

Run the sequence pair through both programs. Repeat this process using the command switch -data=genmoredata:structgappep.cmp on each program. Record the percentage of identity given by each pass through the programs below:

regular gap %____________________ 	regular bestfit %_____________________

gap -data= %_____________________ 	bestfit -data= %______________________

Look at the actual output files for the four runs. At times, the only difference using the command switch to call for the special comparison table causes in the output is an increase in the number of : marks in the alignment. Even so this slight change can increase the similarity percentage upwards by 3 to 10%. Other times, using this comparison table can greatly alter the alignment product, entering gaps and insertions into the alignment and greatly increasing the degree of similarity between the two sequences. The effect produced depends entirely on the sequences being worked with. An example of an alignment output file is given below. Produce hard copy of your produced alignments to help in deciding which one to use as the basis of comparison between your selected and homology sequences.

GAP of: 1gf2.seq  check: 8740  from: 1  to: 67

//////////////////////////////////////////////////////////////////////////
 Symbol comparison table: /disk2/usr/local/soft/seq/gcg/gcgcore/data/rundata/nws
 CompCheck: 1254

///////////////////////////////////////////////////////////////////////////
 Percent Similarity: 95.522   Percent Identity: 91.045

 1gf2.seq x igf2-h.seq     October 6, 1992  07:53  ..

               .         .         .         .         .
    1 AYRPSETLCGGELVDTLQFVCGDRGFYFSRPASRVSRRSRGIVEECCFRS 50
      |||||||||||||||||||||::||||||||.||:.||||||||||||||
    1 AYRPSETLCGGELVDTLQFVCDGRGFYFSRPSSRINRRSRGIVEECCFRS 50
               .
   51 CDLALLETYCATPAKSE 67
      |||||||||||.|||||
   51 CDLALLETYCAAPAKSE 67

% lpr *.pair

Examine your four files carefully. The | symbol between the two aligned stands for an identity, : for a high degree of similarity, . a lower degree of similarity and a blank space no similarity. Pick as the basis for your comparison alignment the output file that has the highest percent identity with the least number of gaps in the selected molecule sequence. When the number of gaps are equal, pick the one with the shortest overall total gap length. Do not count as gaps the sections with periods that occur before and after the NRL_3D sequence in the alignment.

Record below the name of the pair file you would use as the basis for a comparison study between your selected molecule and its homology sequence. Given the reasons why you chose this particular pair file as well.

chosen pair file: _________________________________________________________

reasons for selection: ___________________________________________________

______________________________________________________________________________

______________________________________________________________________________


9) Combine all the data into one file.

Put all the collected data into a comparison file. Edit the homo-template file you copied over at the beginning of the exercise to contain all the gathered data. An example of the desired type of composite file is given below. After all the data has been entered, copy this file to have your last name as the filename and the extension, comp. Produce hard copy of this file to examine it closer.

 homology comparison template for 1gf2 study

        author:            TTHHHHHHHHHHHTTTT                HHHHHHHH
      define_s:    SSSSSSSSSXHHHHHHHHHHHH  SSSSSSS   SSSSS HHHHHHHSSS
          dssp:      S SS   THHHHHHHHHHHHHH S    S SSS   S  HHHHHTTS
        motifs:                                                 xxxxx
      GOR pred:       TTTTTTTT SSSSSSSSSTTTTTTT        TTTTTT TTTTTTH
       CF pred:       TTt  tTT     SSSSSTTTTSSSSSttt   ttttthhhhhhhht
     nnpredict:                  HHEEE      EE     H        EEE

                 1 AYRPSETLCGGELVDTLQFVCGDRGFYFSRPASRVSRRSRGIVEECCFRS
                   ||||||||||||||||||||::|||||||||:||::||||||||||||||
                 1 AYRPSETLCGGELVDTLQFVCDGRGFYFSRPSSRINRRSRGIVEECCFRS
     nnpredict:                  HHEEE      EE      E       EEE
       CF pred:       TTt  tTT SSSSSSSStTTTtSSSSSTTTt  TTttthhhhhhhht
      GOR pred:       TTTTTTTT SSSSSSSSSTTTTTT   TT TTTTTTTT TTTTTTTH 
        motifs:                                                 XXXXX


        author:      HHHHHH         
      define_s:    SXHHHHHH     SSSS
          dssp:    SSHHHHHHHSS SS S 
        motifs:    xxxxxxxxxx                          
      GOR pred:    HHHHHHHHHHH      
       CF pred:    ttssssssss  tt   
     nnpredict:      H HH HH        
                51 CDLALLETYCATPAKSE
                   |||||||||||:|||||                 
                51 CDLALLETYCAAPAKSE
      nnpredict:   H HHHHHHH
        CF pred:   ttsssssssHHHHHHHH
       GOR pred:   HHHHHHHHHHHHHHHHH 
         motifs:   XXXXXXXXXX

% lpr (your lastname).comp

Closely examine this file. Do the located motif patterns line up? If the predicted secondary structures are different from the coordinate data, are they consistent with each other for these molecules? Are the overall lengths of the sequences the same? Are there gaps or inserts to worry about? Consider these points while you continue this exercise.


10) Checking on the results from the Swiss Model server.

Go into pine and check for any results from your modelling requests to the server. There should be a number of messages there from the server. Every modelling request is acknowledged with a mail message given you the identification number that the server will you while processing your request. You should receive one of these acknowledgment messages for each job you have submitted. They are very similar to the one given below. In the example the identification number and your request title appear in bold type. Record those terms on the next page to help you keep your results straight when they come back.

From:   SMTP%"swissmod@ggr.co.uk" 18-OCT-1995 13:23:12.13
To:     TEACHER
CC:
Subj:   Welcome_to_SwissModel

Date: Wed, 18 Oct 95 20:53:31 +0100
From: swissmod@ggr.co.uk
Message-Id: <9510181953.AA13812@ch0x01.gimb.glaxo>
Subject: Welcome_to_SwissModel

Apparently-To: <teacher@jaguar.csc.wsu.edu>

 
 /\=====================================================================/\
//=======================================================================\\
|                                                                         |
|   >>>>>   Welcome to the Swiss-Model Protein Modeling Server   <<<<<    |
|                                                                         |
\\=======================================================================//
 \/=====================================================================\/

     Experimental Swiss-Model Protein Modeling E-mail Server (GLAXO IMB)
     If results of this search are reported or published, please mention
     that the computation was performed at the GLAXO Institute for Molecular
     Biology SA using the Swiss-Model Automated Protein Modeling service.

     Full address:        Dr. Manuel C. Peitsch
                          GLAXO Institute for Molecular Biology S.A.
                          14, chemin des Aulx
                          Case Postale 674
                          1228 Plan-les-Ouates, Geneva
                          Switzerland

                          Phone :  +41 22 706 96 66
                          FAX   :  +41 22 794 69 65
                          e-mail:  mcp13936@ggr.co.uk

=============================================================================

Swiss-Model makes use of ProMod (PROtein MODeling tool) briefly described in:

          Peitsch, M. C., Jongeneel, C. V.  (1993)
          A 3-D model for the CD40 ligand predicts that it is
          a compact trimer similar to the tumor necrosis factors.
          Int. Immunol. 5,233-238.

          Peitsch, M. C.  (1995)
          Protein modeling by E-mail
          Bio/Technology  13,658-660.

=============================================================================
============================================================ MC. Peitsch ====
=============================================================================

Swiss-Model (ProServer Version 1.1) started on Wed Oct 18 20:53:28 MET 1995

Process identification is AAAa13736

The modelling procedure is now in progress, and its results should be
sent to you shortly.


Title of your Request

  defs try1


Swiss Model server submission information:

request #1   request code is: ____________________   title is: _________________

request #2   request code is: ____________________   title is: _________________


After you have recorded the required information delete these mail messages from your pine inbox.

If a modelling process has been successful, three mail messages are sent back. Two of the three messages contain postscript files. These files are plots of the modelling process. One is a profile of the modelling process. The other relates to an energy view of the modelling process. The third message actually contains the coordinate results of the modelling process.

The following is the beginning of a profile postscript message. Only the first few lines are shown. When processing this message for later use, remove the top lines of the file down to the statement starting with %!. Notice that the identification number is given on the Subj: and Subject: lines.

From:   SMTP%"swissmod@ggr.co.uk" 16-OCT-1995 16:49:50.27
To:     PRCADAMS
CC:
Subj:   SwissModel-LastModelProfile-AAAa09537

Date: Tue, 17 Oct 95 00:24:54 +0100
From: swissmod@ggr.co.uk
Message-Id: <9510162324.AA09852@ch0x01.gimb.glaxo>
Subject: SwissModel-LastModelProfile-AAAa09537
Apparently-To: <prcadams@jaguar.csc.wsu.edu>

%!
/Helvetica findfont 10 scalefont setfont
50 400 translate
newpath
   1    1 moveto  500    1 lineto
4 4 div setlinewidth stroke
 500    1 moveto  500  200 lineto

Next is an example of a energy postscript message. Only the first few lines are shown. When processing this message for later use, remove the top lines of the file down to the statement starting with %!PS-Adobe. Notice that the identification number is given on the Subj: and Subject: lines. This is a more typical example of the starting of a postscript file.

From:   SMTP%"swissmod@ggr.co.uk" 16-OCT-1995 16:50:30.24
To:     PRCADAMS
CC:
Subj:   SwissModel-LastModelProsaII-AAAa09537

Date: Tue, 17 Oct 95 00:24:55 +0100
From: swissmod@ggr.co.uk
Message-Id: <9510162324.AA09858@ch0x01.gimb.glaxo>
Subject: SwissModel-LastModelProsaII-AAAa09537
Apparently-To: <prcadams@jaguar.csc.wsu.edu>

%!PS-Adobe-2.0 EPSF-1.2
%%BoundingBox: 74 96 528 728
%%Page: 1 1
%%EndComments
72 300 div dup scale

Next is an example of a coordinate data message. Only the first few lines are shown. When processing this message for later use, remove the top lines of the file down to the statement starting with HEADER. Notice that the identification number is given in a REMARK line as well as in the Subj: and Subject: lines. The title of your request is also given in a REMARK line.

From:   SMTP%"swissmod@ggr.co.uk" 13-OCT-1995 11:28:36.67
To:     PRCADAMS
CC:
Subj:   SwissModel-LastModel-AAAa14478

Date: Fri, 13 Oct 95 19:02:31 +0100
From: swissmod@ggr.co.uk
Message-Id: <9510131802.AA15015@ch0x01.gimb.glaxo>
Subject: SwissModel-LastModel-AAAa14478
Apparently-To: <prcadams@jaguar.csc.wsu.edu>

HEADER    SWISS-MODEL (Automated Protein Modelling Server)
EXPDTA    THEORETICAL MODEL (Secondary)
AUTHOR    ProMod (SEE REFERENCE IN JRNL Records)
JRNL     1  AUTH   M.C.PEITSCH
JRNL     1  TITL   PROTEIN MODELING BY EMAIL
JRNL     1  REF    BIO/TECHNOLOGY                V.  13   258 1995
JRNL     1  REFN   ISSN 0733-222X
JRNL     2  AUTH   M.C.PEITSCH,C.V.JONGENEEL
JRNL     2  TITL   A 3-DIMENSIONAL MODEL FOR THE CD40 LIGAND REVEALS A
JRNL     2  TITL 2 CLOSE SIMILARITY TO THE TUMOR NECROSIS FACTORS
JRNL     2  REF    INT.IMMUNOL.                  V.   5   233 1993
JRNL     2  REFN   ASTM INIMEN  UK ISSN 0953-8178                  759
REMARK
REMARK     REFINEMENT of primary model with CHARMm
REMARK
REMARK     Your Request is: euggr
REMARK     Date : Fri Oct 13 18:58:20 MET 1995
REMARK     SMID : AAAa14478
REMARK

Not all requests are successful. If the process has not been successful, one message is sent back. It states that the attempt was unsuccessful and suggests that you cut down your sequence file to areas of good alignment before attempting to do any further modelling on the protein. The requirements for similarity quality are given in the message. Notice the No Success phrase in the Sub: and Subject: lines along with the identification number. Only parts of this type of message are shown here.

From:   SMTP%"swissmod@ggr.co.uk"
To:     TEACHER
CC:
Subj:   SwissModel-No_Success-AAAa09250

Date: Tue, 17 Oct 95 22:30:31 +0100
From: swissmod@ggr.co.uk
Message-Id: <9510172130.AA09357@ch0x01.gimb.glaxo>
Subject: SwissModel-No_Success-AAAa09250

Apparently-To: <teacher@jaguar.csc.wsu.edu>

////////////////////////////////////////////////////////////////////////////

   Your modeling request could not be carried out.

    Please look at the other messages issued by the server.
    The degree of similarity of your sequence with proteins of
    known 3D structure may be to low.

    At present, Swiss-Model will generate models for sequences
    which respond to these criteria:

    BLAST search P value : < 0.0001

FASTA search standard deviations above mean : > 9.0 Global degree of sequence identity (SIM) : > 25 % spread of > 40% of the submitted sequence. This means that if a relatively short domain, within a long protein, may considered to low in similarity, even though a model could be built for it. So define the segment which you wish to model, and submit it in raw sequence format.

With these examples in mind, read the rest of your mail messages. When you find one from Swiss Model, extract it into a file. Figure out which of the various requests you submitted the response actually relates to. Use the example below to do this. This example assumes that you were reading a message containing the coordinates for the euggr modelling attempt. Use the following names for your files, homology and pol_flv. For extensions use pro-ps for the profile postscript files, use en-ps for the energy postscript files and swiss-pdb for the coordinate files. Use the extension bomb for any requests that failed.

In pine, pressing the e key when reading a mail message will prompt you for a filename for the message to be extracted. This file is placed in your current directory location. You will get a message back at the bottom of the screen showing that the e-mail message has been written to the designated file. Press the n key to read the next mail message or if you are at the end of your mail messages press q to quit. Extract all your Swiss Model server messages, then exit pine. Record the names of the files you created below.

file names: ______________________________________________________________

With these files in hand, go through and use the pico editor to process the data so you can use it. Remove the mail information from the top of these files. In the case of the profile and energy postscript files this is all that needs to be done. The bomb files really don't need to be edited if you don't want to. Just record which of the modelling attempts worked and which didn't. The coordinate files will require further processing in order to be able to use them in dssp, define_s and MacroModel .

worked: ____________________        didn't work: _____________________


11) Working with the successful modelling results.

The file of interest here is the swiss-pdb file. It contains the coordinate data for the generated model from the server. As it stands with the mail header information removed, it is ready to use in Molscript. There are some questions that exist though. How much of the sequence was modelled? Does the modelled section represent the critical portion of the molecule?

Check to see if the original sequence file gives any information as to the nature of the active site in the protein. In the command line below xxxx.xxx represents the name of the file you want to look at. Record any found active site information below.

% more xxxx.xxx

active site information: ________________________________________

_____________________________________________________________________

_____________________________________________________________________

Now check the swiss-pdb file for alpha carbon atoms. There is one alpha carbon for each residue in a PBD protein structure file. Use the command line given below to do this task. In the command line given xxxx.swiss-pdb represents the name of your successful homology modelling coordinate data file. Record the first five residues at the beginning of the modelled sequence and the last five residues in the space provided. Also note the length of the sequence modelled.

% grep " CA " xxxx.swiss-pdb | more 

first 5 residues: ________________________________________________________

last 5 residues: _________________________________________________________

modelled sequence length: ________________________________________________


From this information, it is obvious that the entire protein sequence was not modelled, only part of it. To determine just what part, do the following. Copy the sequence file for the request over to a second file called xxxx-part.sw, where xxxx represents the original name of the protein sequence.

% cp xxxx.sw xxxx-part.sw

Now use the pico editor to go into to this xxxx-part.sw file and delete from it the parts of the original sequence that aren't in the resulting model. When finished use reformat to make the resulting file useable in GCG again.

% pico xxxx-part.sw

% reformat -in=xxxx-part.sw

Create a fil file with the pico editor that contains the names of the original selected molecule file, the homology sequence file and the name of the file you just created. A fil file contains just one filename per line. You can have database:access_code as one of the entry lines if your desired file in still in a database and not in the current location of your account. Give your fil file the name check.fil. Use this fil file in the PILEUP program to find out where the modelled portion of the homology sequence is with respect to the other two sequences. Use the example given below as your guide. In this guide XXXX represents the selected molecule's access code in NRL_3D, xxx_xxxx the SwissProtein access code for the homology sequence and yy-part.sw the part of the homology sequence that was modelled by the Swiss Model server.

% pileup -in=@check.fil

PileUp creates a multiple sequence alignment from a group of related
sequences using progressive, pairwise alignments.  It can also plot a
tree showing the clustering relationships used to create the alignment.


   1            XXXX   xxx aa
   2        xxx_xxxx   xxx aa
   3      yy-part.sw    xx aa

 What is the gap creation penalty (* 12 *) ? <rtn>

 What is the gap extension penalty (* 4 *) ? <rtn>

 This program can display the clustering relationships graphically.
 Do you want to:

     A) Plot to a FIGURE file called "pileup.figure"
     B) Plot graphics on HP7550 attached to /dev/tty15
     C) Suppress the plot

 Please choose one (* A *):  c<rtn>

 What should I call the output file name (* check.msf *) ? <rtn>

 Determining pairwise similarity scores...

   1   x     2       0.55
   1   x     3       0.70
   2   x     3       1.50

 Aligning...

   1     ....-.
   2     ......-.


        Total sequences:          3
       Alignment length:        158
               CPU time:      00.49

            Output file:/disk3/usr/local/people/bcsxx/week9/check.msf

Display this output file either on the screen or produce hardcopy of it. Look closely at the alignment. Record below where the actual overlap occurs on the selected molecule sequence. You will need to have data on the starting and ending points plus one point somewhere in the middle of the alignment.

beginning point: _____________________________________________________

mid point: ___________________________________________________________

ending point: ________________________________________________________

homology sequence beginning point: ___________________________________

homology sequence mid point: _________________________________________

homology sequence ending point: ______________________________________


12) Moving data to model1.

With the coordinate data in hand and information on where it might align with the original selected molecule structure, move the homology coordinate data file over to your account on model1. In the example given below xxxx.swiss-pdb represents the name of your coordinate data file and yyyyy your password on model1.

% ftp model1.vadms.wsu.edu
Connected to model1.vadms.wsu.edu.
220 model1.vadms.wsu.edu MultiNet FTP Server Process 3.4(14) at Sun 18-Feb-96 2:
44PM-PST
Name (model1.vadms.wsu.edu:bcsxx):<rtn>
331 User name (bcsxx) ok. Password, please.
Password:yyyyy<rtn>
230 User BCSXX logged into DISK1:[BCSXX] at Sun 18-Feb-96 2:45PM-PST, job 14
b.
Remote system type is VMS.
ftp> type ascii
200 Type A ok. 
ftp> put xxxx.swiss-pdb<rtn>
local: xxxx.swiss-pdb remote: rbsx.pdb
200 Port 18.195 at Host 134.121.43.151 accepted.
150 ASCII Store of DISK1:[BCSXX]XXXX.SWISSPDB;1 started.
226 Transfer completed.  55369 (8) bytes transferred.
55369 bytes sent in 0.02 seconds (2864.86 Kbytes/s)
ftp> quit<rtn>
221 QUIT command received. Goodbye.
% 


13) Move over to model1 to work on the data.

Telnet over to model1 to continue your computing tasks for this week.

% telnet model1.vadms.wsu.edu <rtn>
Trying 134.121.12.92...
Connected to model1.vadms.wsu.edu.
Escape character is '^]'.


        Welcome to OpenVMS VAX V6.1

Username: BCSXX<rtn>

Continue to log into the model1 platform. You have already moved over the required pdb file for the tasks at hand. While this file is ok for Molscript use, its current form will not work in the programs that are to be run on model1. Go through and correct the order in which the atoms of the residues appear in the file by using pdb_fix.

$ pdb_fix

 Program PDB_fix
 This program converts non-standard PDB files
 into ones that will work in MacroModel
 Enter name of file to work with: xxxx.swiss-pdb <rtn>

 Enter name of output file created: xxxx.swiss-pdb-fix <rtn>
$    

Edit this file with the eve editor and add a MASTER and END line to the very end of the swiss-pdb-fix file. This is best done by using the Do key and entering bot for bottom to go to the end of the file.

$ eve xxxxx.swiss-pdb-fix

To get coordinate data secondary structure assignments run this corrected data file through the dssp and define_s programs. These programs were written for real PDB data files which only have a 4 character file name and a pdb extension. In order to have your data work in these programs it will be necessary of copy over that data to a different filename. One that matches this expected naming convention. In the example yyyy represents the 4 character code you decide to use for this file.

$ copy xxxxx.swiss-pdb-fix yyyy.pdb

Run the define_s program first. Two prompts appear at the end of the run asking if certain generated files are to be deleted. Go ahead and delete them by responding with y.

$ @define_s yyyy

Search the created output file for the results in the following manner.

$ sea yyyy.sss elemnt

Displayed on the screen are the lines that contain the term elemnt. In these lines B denotes a sheet and @ a helix. Record the define_s results on your homology model.

define_s homology results:

helix locations: ___________________________________________________________

sheet locations: ___________________________________________________________


Now run the dssp program. This one will take a while to complete. The best way to get the result of this run when it is finished is to have them printed off on the lab printer with the cpr command.

$ dssp yyyy

$ cpr yyyy.dssp

Record the dssp result on the next page in the space provided. This software uses the letters G and H to denote helical findings, B, S and E are used for sheets and T for turns.

dssp homology results:

helix locations: ___________________________________________________________

sheet locations: ___________________________________________________________

turn locations: ____________________________________________________________


With the secondary structure assignments determined, run the xxxx.swiss-pdb-fix file through the bfiler program to convert it into MacroModel format.

$ bfiler
 BFiler (v 0.2)
18-FEB-96 15:26:41

BFiler: SELECT A MENU ITEM FROM BELOW--

     HELP=Information

     TAPE=Read Brookhaven format files Brookhaven tape and
            translate to MMOD format,

     COPY=Copy files from Brookhaven tape to disk
            without translation

     DISK=Translate Brookhaven format files to MMOD files

     BARE=Translate Bare Brookhaven atom table (from file(s)
            disk) to MMOD format file(s)

     EXIT=Exit BFiler

BFiler>disk<rtn>

BFILER-DISK:This routine attempts to translate Brookhaven
     format files which are on a disk
BFILER-DISK:Continue?(y)><rtn>

Default suffix is ".BRK"

Type in the names of the files you want to process,
       Hit return after each code name and
       a bare "." to finish>
xxxx.swiss-pdb-fix<rtn>
.<rtn>

Below is a list of names for files you want to translate --
 Options: (1) type in corrected entry;
          (2) type "i" to insert an entry,
          (3) type "x" to delete entry,
          (4) type "." to finish,
          (5) hit return to verify entry:
XXXX.SWISS-PDB-FIX<rtn>

Go back and re-edit the filecodes?(n)><rtn>

Looking for file XXXX.SWISS-PDB-FIX

Reading XXXX.SWISS-PDB-FIX
WARNING - invalid text in this file
EXPDTA    THEORETICAL MODEL (Secondary)
WARNING - invalid text in this file
JRNL     2  TITL 2 AUTOMATED COMPARATIVE PROTEIN MODELLING.

Reading atomic coordinates...
Typing atoms...
Creating bond entries...

BFiler: SELECT A MENU ITEM FROM BELOW--

     HELP=Information

     TAPE=Read Brookhaven format files Brookhaven tape and
            translate to MMOD format,

     COPY=Copy files from Brookhaven tape to disk
            without translation

     DISK=Translate Brookhaven format files to MMOD files

     BARE=Translate Bare Brookhaven atom table (from file(s)
            disk) to MMOD format file(s)

     EXIT=Exit BFiler

BFiler>exit<rtn>


14) Doing a classic homology alignment.

Many times the overall length of two sequences being modelled are the same and there is a high degree of similarity between the two molecules. When this happens, the modelling process can be as simple as overlaying the sequence of the second protein upon the backbone coordinates of the first. Even without further work such as additional distance geometry runs on the modelled structure, the generated model can be used for visualizing how the various parts of the protein possibly interact with one another, the nature of the charge distribution and the general shape of the molecule.

An example of this type of homology modelling was work done for Dr. Wm. Trumble of the U of I on cardiotoxin. In this study, cardiotoxin was mapped to the coordinates of neurotoxin. Even though the similarity between the two was low (about 26%) there was a strong consensus that the disulfide bridges were the same in each of the protein families and physical studies indicated that the two proteins performed similar functions. The present of a nasty little gap region was handled by removing it from the structure in the model and replacing it with a GLY residue to bridge the gap and correspond to the neurotoxin's way of doing things. Using these assumptions, a model was created. Later the x-ray structure of cardiotoxin was determined.

Get into MacroModel and read in the file, cardio-models. This data file was created with the backbone structure of the original neurotoxin file on the left in red, the modelled backbone of the cardiotoxin structure in the middle in green and the determined cardiotoxin backbone structure on the right in aqua. The three-finger hand is present if somewhat distorted in the fingers. The connection area where the bulge was shows distortion in the lower right-hand portion of the molecule, however, the general shape of the molecule holds true. Who knows how much more similar the model and the final structures would have been if distance geometry work had been carried out on the model? To view this file more effectively, once in MacroModel, select ANALYZ and the A LAB to turn off the labels on the atoms and then read in the file.

Try this modelling technique yourself. Read in the complete structure for melittin by selecting READ, giving the access code for the melittin file, 2mlt, pressing RETURN for structure number, and responding with y to clear the working area.

Select one of the two chains on the screen to do further work with and delete the other one. This is done by using the DELT button of the INPUT portion of the program. Selecting this button twice allows for the deletion of a molecule. Once the phrase Molecule deletion appears at the top of the screen, move the cursor over to a location on the chain you want to remove and press the mouse button. The unwanted chain is removed. Remove the two sulfate groups in this manner also.

A good way to create a structure that allows the monitoring of the overlaying process is to strip the structure to its backbone and then color code the result via its various residue types. Select ANALYZ, SETS, MainS, DISPLA, Dis, then Rtype, responding with w to do just the working set data. If you feel comfortable identifying the backbone components with the atom labels, select A LAB to put them back on. If your structure appears less than perfect after this last action, select Updat to have the screen redrawn.

Now write the data you have created to a file, saving only the displayed fragment, giving it the extension, backbone. Once the file has been created, read the file back into the program. This is necessary in order to create a complete amino acid version of the structure and not just a backbone in the next step.

You are going to be overlaying the following sequence upon this coordinate set, YAGVALAVLALIIPSLLTWQSRKHNP. This is a portion of a tyrosine permease sequence from E. coli that shows a 39% similarity with the melittin sequence. The three-letter code version of this sequence is given below.

Tyr Ala Gly Val Ala Leu Ala Val Leu Ala Leu Ile Ile Pro Ser

Leu Leu Thr Trp Gln Ser Arg Lys His Asn Pro

To overlay the sequence, start at the beginning of the sequence and do the following. Select INPUT, PEPTID, and then move the cursor to the first amino acid in the sequence, press the mouse button, then move the cursor to some point on the first residue of on the structure, press the mouse button again and the new residue will be drawn in that position. Since this is a new residue, it has the standard atom coloration. You can use this change in color to mark your progress through the sequence.

You can use Clip to expand regions of the structure where individual residues are hard to determine. The process is the same. Once the screen has been expanded, pick the desired residue and move the cursor to a spot on the residue to be replaced.

In this process, it is best to do the replacements all in one setting, for what you are really doing is establishing a second chain on the backbone of the first. They are two different chains that appear to be one. If an amino acid is the same in both sequences it still has to be replaced or the second chain that you are making will be incomplete.

Once you have finished your structure, write the data to a file called overl. While the structure created is not ideal, it will serve as a starting point to do other processes to improve the quality of the model.


15) Working with your homology data.

Now work with your homology data. Select ANALYZ. Check to make sure that the A LAB button is off (the color of the button should be white). The structures being worked with are too large to keep the labels on. Read in the selected molecule bdt file. Strip this data set down to it backbone and then color the entire structure red. Then use the Sets button to go through and find the beginning, mid and ending points of the alignments on this structure. It may be necessary to use the Clip button to zoom in on the desired parts of the structure. Color the beginning point blue, the mid point yellow and the ending point white. Save this data to a file.

Repeat this process with the homology bdt file. Color the backbone of this structure green. Then go through and color the corresponding beginning, mid and end points of the structure as before. Save the data to a file.

Read both the structures into the working window at the same time. Use the superpositioning function of the program to align the two structures. Select tack points from the red molecule first, as that is the orientation that is the default. Modelled structures often have a different orientation than the ones they were modelled on.

After the positioning is completed, deleted the red molecule from the screen and save the data. Use the filename homo-corrected for the data. Get out of the modelling program.

Convert the homo-corrected.dat file into PDB format again through the use of the mmodpdb program. Use the guide given for this task below

$ mmodpdb

THIS PROGRAM READS V1.5-2.0 MACROMODEL STRUCTURE FILES
AND PRODUCES FORMATTED PDB STYLE OUTPUT FILES

Enter MacroModel input filename:
homo-corrected.dat<rtn>
Enter .PDB output filename:
homo-corrected.pdb<rtn>
Charge file (.CHG) not found, charges set to 0.0


Enter MacroModel input filename:
<rtn>
FORTRAN STOP

Display the resulting file off on the screen with the type command. You will notice that the chain is now called X and that the name of the atoms are not what you are used to in PDB files.

$ type homo-corrected.pdb

The homo-corrected.pdb file will have to be edited with eve since it needs to have alpha carbons in order to have secondary structure shown by the Molscript program. Edit the file and replace the "C02" terms with "CA ".

Once the file has been changed ftp it over to ribozyme where the Molscript runs will be done. In the example given below yyyyy represents your password.

$ ftp ribozyme.vadms.wsu.edu
model1.vadms.wsu.edu MultiNet FTP user process 3.4(111)
Connection opened (Assuming 8-bit connections)
<ribozyme.vadms.wsu.edu FTP server ready.
RIBOZYME.VADMS.WSU.EDU>l bcsxx<rtn>
<Password required for bcsxx.
Password:yyyyy<rtn>
<User bcsxx logged in.
RIBOZYME.VADMS.WSU.EDU>cd week9<rtn>
<CWD command successful.
RIBOZYME.VADMS.WSU.EDU>type ascii<rtn>
Type: Ascii (Non-Print), Structure: File, Mode: Stream
RIBOZYME.VADMS.WSU.EDU>put homo-corrected.pdb<rtn>
  To remote file:<rtn>
<Opening ASCII mode data connection for 'homo-corrected.pdb'.
<Transfer complete.
RIBOZYME.VADMS.WSU.EDU>quit<rtn>
<Goodbye.

Log out off of model1 by entering logout as in a regular connection session.

$ logout


16) Molscript work with the homology data.

Back on ribozyme use the pico editor to create the following Molscript input file. Use the examples given below as guide for your own files. You will need one input file to display the original data that was sent back from the Swiss Model server. A second one for the corrected pdb file with the determined secondary structure assignments in it.

guide for the initial data set

(In this file the entire structure will be shown as a simple coil. Replace the xxx term with a shorten form of the name of your selected molecule. The term yyyy.swiss-pdb is the name of the original data from the server. Replace zz with the length of this modelled sequence. You may need to adjust the label position in the image by changing the -15.0 value.)

! this is an attempt to plot the xxx homology model from Swiss Model
plot
   read mol "yyyy.swiss-pdb";
   transform atom * by centre position atom *;

   coil from 1 to zz;

   set depthcue 0.0, labelsize 10.0;
   label 0.0 -15.0 0.0 "xxx homology model";
end_plot      

For the construction of your final Molscript image, you will need to start by adding dssp and define_s information to your lastname.comp file. Do this by copying this file to lastname.final and then editing it with pico to add the following lines. Put a line below the motifs line on the homology sequence side of the alignment entitled model: and show by the placement of x's where the modelled portion of the sequence is. Below the model: line add a dssp: line for the modelled portion of the sequence. Use the symbols H, S and T to denote the location of secondary structural elements as found by this method. Beneath the dssp: line add a define_s: line. Use the same symbols as in the dssp: line to show the location of determined secondary structural elements. Print off your finished file.

% cp lastname.comp lastname.final

% pico lastname.final

% lpr lastname.final

Study these results carefully. Make your own assessment of where the secondary structural elements are in the modelled homology sequence fragment. Record those assignments below.

homology sequence fragment secondary structure assignments:

helix locations: ___________________________________________________________

sheet locations: ___________________________________________________________

turn locations: ____________________________________________________________


guide for the aligned data set

(In this file the entire structure will be shown with its secondary structure assignments. Replace the Xxx term with the position in the chain of that feature. Remember that in this file the chain is called X and so the locations are given as X1 etc. Copy the coil, strand, helix and turn lines and modify them as often as needed to form a complete structure. Again you may need to adjust the location of the label.)

! this is an attempt at displaying the corrected homology model
plot
   read mol "homo-corrected.pdb";
   transform atom *
     by centre position atom *;

   coil from Xx to Xxx;

   set planecolour red;
     strand from Xxx to Xxx;

   set planecolour green;
     helix from Xxx to Xxx;

   set planecolour blue;
     turn from Xxx to Xxx;

   set depthcue 0.0, labelsize 10.0;
   label 0.0 -15.0 0.0 "corrected homology model";
end_plot

Run your Molscript jobs in the following manner. Replace the xxxxx.in with the name of our Molscript input file and the yyyyy.ps with the desired name of your output file. Once the Molscript run successfully completes, send the output file off to the printer. Call the output file for the initial data set lastname-homo-initial.ps and the corrected orientation data should be named lastname-corrected.ps

% molscript <xxxxx.in> yyyyy.ps

% lpr yyyyy.ps


17) Figuring out the failed modelling attempt.

The pol_flv sequence failed to produce a homology model from the Swiss Model server. You got back a message that stated that you would need to cut back the sequence to delete low similarity regions before you sent off the sequence to be attempted again. This sequence is a remotely related to the human HIV protease sequence whose structure has been predicted. The access code for this sequence in NRL_3D is 7hvpa.

Create an alignment with the pol_flv sequence and the nrl_3d:7hvpa sequence with the GAP program. Print off the generated output file, flv.pair.

% gap

Gap uses the algorithm of Needleman and Wunsch to find the alignment of
two complete sequences that maximizes the number of matches and minimizes
the number of gaps.

 GAP of what sequence 1 ?  nrl_3d:7hvpa<rtn>

                  Begin (* 1 *) ? <rtn>
                End (*    99 *) ? <rtn>

 to what sequence 2 (* nrl_3d:7hvpa *) ? sw:pol_flv<rtn>

                  Begin (* 1 *) ? <rtn>
                End (*   128 *) ? <rtn>

 What is the gap creation penalty (* 12 *) ? <rtn>

 What is the gap extension penalty (* 4 *) ? <rtn>

 What should I call the paired output display file (* .pair *) ? flv.pair<rtn>

 Aligning ......-.

          Gaps:      1
       Quality:   54.5
 Quality Ratio:  0.551
  % Similarity: 47.475
        Length:    128

% lpr flv.pair

Look closely at the output of this alignment process. The Swiss Model server appears to require a very high level of identity in order to work if it feels that the sequence being modelled doesn't belong to a well defined family of proteins with a number of already solved structures. Select a region the pol_flv sequence that is at least 20 residues long and has approximately 50% identity in the alignment. Record this fragment on the next page.

high identity part of pol_flv sequence: _____________________________________

Log out of ribozyme and get back to the Launcher window screen. From here select the NETSCAPE icon and make another request to the Swiss Model server. This time enter your recorded fragment in the box below the box in the form for entering the SwissProtein access code. After the request has been made get out of the Netscape program and log back on to your account on ribozyme.


18) Finishing up

Copy over the report form for this exercise, rename it to have your last name, go into the file and use the editor, pico to fill in the report and send it over to the teacher account. After you have most of your report form filled out and have reached the question on your pol_flv fragment, exit the pico program and check your mail. Hopefully by this time a response will have come back from the Swiss Model server on your modelling request. If not, send over the required files to the teacher account and then check your mail again. Only complete your report form when you have received back from Swiss Model a response to your modelling request.

% mv week9m.week9m (your lastname).week9m

% pico (your lastname).week9m

% rcp (your lastname).week9m teacher@ribozyme:receive

% rcp (your lastname).comp teacher@ribozyme:receive

% rcp (your lastname).final teacher@ribozyme:receive

% rcp (your lastname)-homo-inital.ps teacher@ribozyme:receive

% rcp (your lastname)-corrected.ps teacher@ribozyme:receive


This concludes your computing session for this week. Log off the computer.

Now exit the emulator program by selecting Quit from the File menu of the control bar.

References

Define_S, F.M. Richards and C.E. Kundrot, Proteins 3: 71-84 (1988).

DSSP, W. Kabsch and C. Sander. Biopolymers 22: 2577-2637 (1983).

Per J. Kraulis, "MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures", Journal of Applied Crystallography (1991) vol 24, pp 946-950.

Rost, Burkhard; Sander, Chris: "Prediction of protein structure at better than 70% accuracy.",J. Mol. Biol., 1993, 232, 584-599.

Peitsch, M.C. Protein Modelling by E-mail. Bio/Technology 1995, 13, 658-660.

Peitsch, M.C. ProMod and Swiss-Model: Internet-based tools for automated comparative protein modelling. Biochem Soc Trans 1996, 24, 274-279.

Internet resources used:

Swiss Model site:
http://expasy.hcuge.ch/swissmod/SWISS-MODEL.html

PredictProtein site:
http://www.embl-heidelberg.de/predictprotein/predictprotein.html

nnpredict site:
http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html