Homology Modelling. Combining sequence analysis and molecular modelling skills you will explore the theoretical modelling of unsolved sequences based on the structures of solved ones.
Author:
Susan Jean Johns
Homology modelling combines two sets of analysis tools, sequence analysis and molecular modelling. It takes skills from both tool sets in order to carry out a successful homology modelling project.
To do this process, you must know as much as possible about the protein sequence to be modelled. What are its characteristics? Is it related to any other proteins? Does it have similar functions to other types of proteins? Have those proteins' structures been solved? Does it contain any established motifs?
Five proteins will be used for this exercise with the following SwissProtein access codes and characteristics.
sw:def5_mouse - an antibiotic protein from mouse- 36 residues
sw:mi1a_human - inflammatory protein from humans - 92 residues
sw:cyc_euggr - an unusual cytochrome c from euglena gracilis - 102 residues
sw:id3_human - an HLH transcription protein - 119 residues
sw:sstn_rat - a protein strongly related to elongation factors - 463 residues
In the past, homology modelling could be a long drawn-out process. The sequence to be modelled had to be fully explored with respect to other possible similar proteins. What were their shared characteristics and/or functions? Did they contain the same motifs? Had any of these proteins' structures been solved? What was the predicted secondary structure of the sequence to be modelled and how did that relate to that of the solved structures? Are there any regions of gaps or inserts that had to be dealt with? Did these areas occur within functional regions or outside of them?
Now there is an automated modelling tool available. While it does not generate a homology model for every sequence that is sent in, it does produce enough models to make it a reasonable first approach to the homology modelling problem. This modelling tool exists on the web and requires a web browser that can handle forms. It can be used in one of two ways. The user can either give the Swissprotein access code for the sequence to be modelled or the sequence can be pasted in the available form for that purpose. This second mode of operation is beyond the available equipment's capabilities, so only the first one will be used.
To give you an idea of what was involved in homology modelling read through the description below.
Analysis of the growing numbers of determined crystal structures shows that there are relatively few folding patterns for proteins. Members of the same family have the same folding patterns. This makes sense since proteins doing the same functions should have the same general structure. In general, function sites also maintain the same folding pattern. Such observations have become the foundation of the area known as homology modelling. In this area the alignment between primary sequences is used to create models of new structures from either known coordinate sets or prediction data.
The key to this endeavor is understanding the nature of the known data set and its relationship to the sequence to be modelled. This requires collecting all possible data on both sequences. The coordinate data, with its orientation of functional sites and their interactions with one another, needs to become second nature to the modeller. All possible forms of comparisons to find regions of similarity between the two sequences must be determined and understood. Serious attempts at homology modelling requires that the modeller to do extensive literature searches on both the known coordinate protein and the unknown protein, plus extensive sequence analysis determinations. This provides the broadest informational base possible to help create an accurate model.
In order to do this sort of modelling, one needs a pair of sequences. The first is the unknown sequence to be modelled. The second is a sequence that has coordinated data associated with it upon which to base the model. This second sequence is found by doing database searches either locally or via the network to find a suitable match. Doing a GCG FASTA search on the NRL_3D database is one way. Another is doing a GCG BLAST search over the nets to the PDB resources at NCBI. One needs a good reason to match sequences with lower degrees of similarity to one another. Such a reason could be that they are known to share a similar disulfide bridge pattern (s) or physiological effects.
Collect relevant data for alignments by examining the sequence files for the two proteins of interest. Most primary sequence files from a database contain information in them on the features of the protein. From this source, the presence of sections of the protein not present in its mature form can be determined. X-ray structures are only determined on mature proteins. Any precursor portions of the protein must be removed from the affected sequence before modelling can take place.
What are the secondary structural assignments for the protein? Has the author of the data made these assignments, do they agree with assignments made by structure analysis programs such as DSSP and Define_S? If not, where do they disagree and why? Are disulfide bridges important to the integrity of the structure? If so, where are they and what are the residues connected in this way? Have functional sites been identified in the structure file? Where are they and what structural elements are they composed of?
Not all coordinate data sets give this type of information. The author of the data may expect that anyone interested in the structure would already be familiar with its known characteristics from the literature. There are other means to determine this type of information rather than depending on the author.
What does visual examination of the structure reveal about the interplay of secondary structural elements with one another? Are certain parts of the molecule occupied with the functional aspects of the protein, while others seem to be spatial elements serving to get the functional parts of the protein into the proper positions for interactions to take place? Are there complex folding patterns present in the structure representative of a given family of proteins that need to be maintained? Are there any structural elements in the protein that don't belong to this folding pattern? Have substrates been determined in the data set and how do they affect the conformation of the given functional site? Is the protein a composite of different types of functional sites? How do they interact with one another? What functional sites are present in the sequence to be modelled and are they the same as those in the coordinate data? What sort of additional information is available on the data set based on its primary sequence?
Running sequence analysis programs from GCG on the extracted primary sequence from the coordinate data file can lead to information on functional site patterns, possible external location of helical regions and the hydrophobic nature of the protein. GCG's Motifs provides functional site information based on site patterns found in the primary sequence. The homegrown program, Amphi, provides information on possible surface helices. A number of other programs give information on hydrophobicity, GCG's Pepplot and PeptideStructure plus the homegrown pieces GES and PK23. GCG's HelicalWheel can show if a located helical region has organized component residues. Servers resources can be used to augment locally available analysis techniques.
All the analyses that were run on the primary sequence of the coordinate data set must be repeated on the sequence to be modelled to it. Although secondary structural prediction is still not accurate, these determinations must be made on the sequence. They provide an idea of where structural elements may exist. That information can be used in the alignment process. The GCG program PeptideStructure can supply the needed data.
The basis of the entire process is the alignment of primary sequences. The quality of the alignment is extremely important since it determines the quality of the final output product.
A number of GCG programs can produce sequence alignments. These programs are GAP and BESTFIT for pairwise alignments, while PILEUP can provide multiple pairwise alignments. GAP and BESTFIT do pairwise alignments differently. GAP produces the best overall alignments between the two sequences while BESTFIT produces the best local regions alignment. This subtle difference requires that both programs be run on the desired sequences, the result that produces the highest degree of similarity is used as the basis of the alignment process.
The quality of an alignment depends on the comparison table used by the software as well as the approach used. A comparison table has been created based on the analysis of amino acid substitutions after superpositioning of homologous protein structures. Running any alignment program with the command switch, -DATa=genmoredata:structgappep.cmp, has that program use this comparison table instead of the default one based on evolutionary substitutions. Using this table can greatly increase the degree of similarity between the two sequences and likewise alter the nature of the alignment. One must compare these results with the earlier results to see if the general nature of alignment are the same or different. Even if the nature of the alignment is altered and gaps have been introduced where none existed before, if the degree of similarity has increased and later analysis shows alignment between functional sites, this alignment should be used over one giving a lower degree of similarity with no gaps.
Sometimes alignments need to be forced, i.e., disulfide bonds are important to the structure of the proteins being studied and no CYS alignments can be produced with the normal comparison tables. This can be done by changing the values assigned in the comparison table to favor certain types of matches. For example, the value assigned to a CYS-CYS match in the table could be increased two to ten-fold until the desired number of CYS alignments were produced. Changing the comparison table values can force pairing of other amino acid matches as well if the need arises. Normally, this option is only used as a last resort.
At times, alignments between the desired proteins are relatively low and the nature of functional sites and their possible consensus with one another not well understood. When this occurs, one way to improve your understanding is to do multiple pairwise alignments with PILEUP and use the consensus regions developed there as a guide to a desired modelling alignment. PILEUP is run with a number of sequences which the modeller feels are related to the unknown sequence and with the coordinate sequence as a reference to see how the new sequences affect the alignment. The trick here is to use only enough sequences to clarify and not muddy the alignment issue. Including a sequence in this set that is not strongly enough related to the unknown sequence will only make matters worse.
Once an alignment has been derived, check it by looking at how the additional information about the two sequences relate to one another via the alignment. This is done by creating a file in which the various features found for a sequence are compiled.
A similar collection of data must be created for the sequence to be modelled. Here the information types are the same as that for the coordinate data set with one exception: that from the sequence data has no x-ray secondary structural assignments or Define_S and DSSP results. To make future information alignments easier, put the noted features below the sequence line and the numbering system above it.
Now append the file containing the information on the sequence to be modelled to the file for the coordinate data. Using the desired alignment file as a reference, edit the combined files to reflect this alignment. For a useable alignment, there should be agreement on as many recorded features as possible. If prediction values in the coordinate don't agree with the reported data, does the modelled sequence show the same type of predictions in these disputed areas? Do the motif areas line up with one another?
Oftentimes the lengths of the two sequences are exactly the same and there is good similarity between the two sequences. When this happens, just overlay the new sequence over the backbone sequence of the coordinate data. For an accurate model, these results should then be minimized or at least subjected to a pass or two by a distance geometry program to get the side chains in more realistic positions.
At other times, there are gaps and inserts to worry about. When a gap occurs in the to-be-modelled sequence that doesn't exist in the coordinate data, that area needs to be looked at. Is it a loop section that easily could be clipped out, and the gap in the sequence removed by rotating one or two existing residues on either side of the gap, close enough to one another to form a normal peptide bond length between the moved residues? If so, feel free to make such a change in the structure of the coordinate data set prior to doing an overlay.
When an insert is called for, look at the area that the insert would be placed in. Is it on the surface of the molecule? Is the hydrophobicity of the insert philic or phobic and in what direction? What are the rest of the secondary structural elements in that area? Could a similar type of structural unit be created that would be consistent with the rest of this area and match the hydrophobicity requirements for the insert? The structural elements involved in the functional sites must be kept in the same general spatial positions. If this insert is outside of that restricted area, almost any configuration that meets the hydrophobicity requirements can be created in the desired region.
Model building that requires modifying structural members of the functional site should be carefully thought out. Changes in these areas need to be coupled with changes elsewhere in the molecule that will allow the functional site to basically remain intact even if component helices or sheets are now longer or shorter than in the original. If these changes can be made to keep the truly vital residues in the same spatial locations, then the modifications will produce results that can be tested visually to see if they are realistic.
Exercise for week 14
This series of exercises acquaints you with a number of different skills needed to conduct homology modelling of protein structures. These skills include: surfing the Internet, sequence analysis determinations, manipulating PDB files, visualizing structural data, and effective editing. Modelling of structures should only be undertaken when there is some chance for success. To determine if it has the potential to succeed requires sequence analysis skills. Evaluating the results and visualizing the structure requires molecular modelling techniques. Follow instructions in bold by pressing the ENTER key. The <rtn> symbol given in program examples means to press the ENTER key as well.
Homology modelling has changed with the advent of the modelling server, Swiss Model. This server is a painless way to try getting a theoretical model of a protein structure. While not always successful, the amount of effort involved in making the attempt (minimal) makes this step an excellent time investment. Because this is a network process subject to all the problems on the net (i.e., sites and/or gateways going down), start with a visit to the Swiss Model web site. Conformation of the request submission and results (good or bad) are shipped back via e-mail. The determination of a structure can take several hours on the server. Since there is additional work needed in order to visualize these results, send requests early.
Enter the access codes of SwissProtein sequences directly into the request form of Swiss Model. All the sequences to be used this week are in the current release of the SwissProtein database.
1) Surfing the Internet to Make Modelling Requests
From the windows screen, we will go to the Swiss Model web site. This site contains a form system for submitting homology modelling requests to their modelling server. Select the Netscape icon (the large N) with the arrow and press the left mouse button. The arrow changes to an hourglass while the connection is being made to the VADMS home page. Use the arrow to select the Bookmarks menu, and the Swiss-Model: Automate...ein Modelling Server entry from this menu.
You are now connected to the Swiss Model home page. Depending on network traffic, it may take a moment for their logo to appear. This is a form driven system. Move down the page until you reach a section of the page entitled How to Access Swiss-Model:. Select the First Approach mode phrase. This will move you to another part of the web site. Once there move down the page until you reach the These fields MUST be completed section. Here there are three boxes into which you need to enter three different pieces of information: your e-mail address, your name and a title for the requested modelling job.
Move the cursor to the beginning of the address box. The arrow changes into a symbol that allows you to fill in the box. Enter your e-mail address. Your address is as follows: expxx@ribozyme.vadms.wsu.edu. Replace the xx with the actual numbers for your account. Move to the beginning of the next box and enter your name. In the final box, enter a short title for your modelling attempt. You will do this process 4 or 5 times depending on if you do the extra credit or not. The title you enter depends on which one of the sequences you want to be used in this request. Refer back to page two for information on these proteins from which to extract a short title.
Move down the page again until you reach the Swiss-Prot ID code to model: box. The five SwissProtein codes in order are: cyc_euggr, def5_mouse, mi1a_human, id3_human and sttn_rat. Move down past the space for entering your own sequence to the button for submitting the request. Use the arrow to select the Send Request button. The system should put up a new screen at this point informing you that your request has been sent off. Depending on the network traffic it may take some time for this screen to appear.
After the new screen has appeared, select the BACK button from the top of the screen. This should return you to the forms screen again. Check to see that in information in the address and name boxes are still ok. Change the title box to reflect information on the next sequence you want to have modelled. Move down to the ID code box. Change the contents of this box to have the access code for the next sequence. Move down to the Send Request button and ship off another request.
Repeat this process as many times as needed to make all the modelling requests you want.
To exit the program select the File option from the top of the screen and select its Exit option. This will return you to the overlapping windows screen.
2) Activate the computer.
Activate the machine you want to use, make connections with ribozyme and log into your account.
3) Move to this week's subdirectory and copy over to it the necessary files.
% cd fourteen
Now copy over all the files needed to do this week's exercise. They are located in the directory location $UGRAD_DIR/week14.
% cp $UGRAD_DIR/week14/* .
4) Checking to see if the conformations have arrived yet.
The Swiss Model server sends back a conformation for any submitted job. This is in the form of a mail message. You should receive one of these for each job you have submitted. Go into pine mailer and look at these messages. They are very similar to the one given below. In them is the request code number for your submission. In this example it appears in bold type. Record those codes on the next page to help you keep your results straight when they start coming back. The title you gave the search can also serve to help to keep the resulting information sorted. Record that information as well.
From: SMTP%"swissmod@ggr.co.uk" 18-OCT-1995 13:23:12.13
To: TEACHER
CC:
Subj: Welcome_to_SwissModel
Date: Wed, 18 Oct 95 20:53:31 +0100
From: swissmod@ggr.co.uk
Message-Id: <9510181953.AA13812@ch0x01.gimb.glaxo>
Subject: Welcome_to_SwissModel
Apparently-To: <teacher@ribozyme.vadms.wsu.edu>
/\=====================================================================/\
//=======================================================================\\
| |
| >>>>> Welcome to the Swiss-Model Protein Modeling Server <<<<< |
| |
\\=======================================================================//
\/=====================================================================\/
Experimental Swiss-Model Protein Modeling E-mail Server (GLAXO IMB)
If results of this search are reported or published, please mention
that the computation was performed at the GLAXO Institute for Molecular
Biology SA using the Swiss-Model Automated Protein Modeling service.
Full address: Dr. Manuel C. Peitsch
GLAXO Institute for Molecular Biology S.A.
14, chemin des Aulx
Case Postale 674
1228 Plan-les-Ouates, Geneva
Switzerland
Phone : +41 22 706 96 66
FAX : +41 22 794 69 65
e-mail: mcp13936@ggr.co.uk
=============================================================================
Swiss-Model makes use of ProMod (PROtein MODeling tool) briefly described in:
Peitsch, M. C., Jongeneel, C. V. (1993)
A 3-D model for the CD40 ligand predicts that it is
a compact trimer similar to the tumor necrosis factors.
Int. Immunol. 5,233-238.
Peitsch, M. C. (1995)
Protein modeling by E-mail
Bio/Technology 13,658-660.
=============================================================================
============================================================ MC. Peitsch ====
=============================================================================
Swiss-Model (ProServer Version 1.1) started on Wed Oct 18 20:53:28 MET 1995
Process identification is AAAa13736
The modelling procedure is now in progress, and its results should be
sent to you shortly.
Title of your Request
defs try1
request #1 request code is: ________________ title is: ________________
request #2 request code is: ________________ title is: ________________
request #3 request code is: ________________ title is: ________________
request #4 request code is: ________________ title is: ________________
request #5 request code is: ________________ title is: ________________
5) Run the demo that describes this week's activities.
The demo for this week shows how to determine whether a modelling attempt should be made on a protein sequence. The steps in making a Swiss Model request are handled in text form. The rest of the steps are shown by working through the cytochrome c example. This is a demo like the one in week 9 where GCG materials are combined with modelling information. The demo is self-pacing in that it has a number of pause statements in the body of the demo which you can use to control the flow of information. GCG is automatically started and the graphics device set just by running the demo. To start this process enter the command given on the next page. After the first section on GCG analysis and PDB data modification, a MacroModel session is automatically launched. You will have to enter the name of the log file, week14.log, and respond with n to the batch processing question.
% demo14
Background information is given on the nature of the demo you are viewing. Textual information is given on the Swiss Model submission process. GCG is activated and the graphics device set. Information is given on the nature of the cytochrome c family of proteins. These proteins are very conserved, more so within evolutionary families than between them. A file is displayed showing the diversity of these proteins across the evolutionary spectrum.
The first step in any modelling attempt is to run a FASTA search with a sequence on the NRL_3D database. This points out if there are any similar sequences with known structures to the sequence in question. A small subset of the best alignments is put into a fil file to be used in other programs. A PILEUP run is carried out using this fil file and the results examined. If there is good agreement over vast regions of the resulting alignment, then the probability is high that the sequence can be modelled.
When the results from Swiss Model come back, the mail is checked to see if the process was successful or not. The cytochrome c sequence produces a model structure. The length of the modelled structure is checked to see just how much of the original sequence was actually used to generate the data. The coordinate data file is then modified to correct for an ordering problem and passed through the conversion program BFILER.
The converted data is displayed with MacroModel. This is a complete structure with all the size chains. To make it easier to look at, the data is stripped down to its backbone. This backbone structure is then compared to the backbone of one of the PDB files used to generate the model. Finish off this demo by selecting three atom sets to make the superpositioning possible. After the three atom sets have been chosen, select RigSp and plot the results. Select Stop to exit the demo, responding with y to the two prompts at the end of the program.
The demo moved you over to the modelling machine. To get back to ribozyme for the sequencing part of this week's exercise enter the term logout.
$ logout
6) Working through the modelling of an unusual cytochrome c sequence.
Cytochrome c is a very highly conserved protein. This is more true within evolutionary families than between them. There is a motif pattern for this protein, CXXCH. This motif works for all but four of the known cytochrome c proteins. Cyc_euggr is one of these unusual proteins.
There are over 100 cytochrome c sequences currently in the databases on the ribozyme. These have been located and analyzed. To show just how conserved this protein is within an evolutionary family, look at the file that has been prepared of the mammal cytochrome cs. This data was taken from the alignment of the 113 sequences and so the gaps shown by a period are related to that alignment, not that with the mammal sequence themselves.
% more cc-family.data
There is much more diversity when all the sequences are taken into account. A subset of the data set was created to bring this point across. Look at this alignment and see how the level of consensus had dropped, but it is still there. The positions were all the sequences match is shown on the consensus line. Notice how the sequence lengths vary, however the core of the protein shows very few gaps, indicating a highly conserved spatial conformation as well.
% more ccs-pretty.data
The first step in any modelling attempt is to see if there are any similar sequences to the sequence to be modelled in the NRL_3D database. This database is composed only of protein sequences with known structures. The database is fairly small, only a little over 5000 sequences, and any analysis runs made using it can be done without batch processing. Use the example given below to guide your FASTA run. Terms in bold type should be entered into the program.
% fasta
FASTA does a Pearson and Lipman search for similarity between a query
sequence and any group of sequences. For nucleotide database searches,
FASTA is more sensitive than BLAST.
FASTA with what query sequence ? sw:cyc_euggr <rtn>
Begin (* 1 *) ? <rtn>
End (* 102 *) ? <rtn>
Search for query in what sequence(s) (* SwissProt:* *) ? nrl_3d:* <rtn>
What word size (* 2 *) ? <rtn>
List how many best scores (* 40 *) ? 10 <rtn>
What should I call the output file (* cyc_euggr.fasta *) ? <rtn>
1 Sequences 255 aa searched NRL_3D:12CA
101 Sequences 26,760 aa searched NRL_3D:1CHG3
201 Sequences 41,409 aa searched NRL_3D:1RSM
////////////////////////////////////////////////////////////////////////
4,901 Sequences 839,101 aa searched NRL_3D:1R08A
5,001 Sequences 855,557 aa searched NRL_3D:1PTSB3
(Peptide) FASTA of: cyc_euggr from: 1 to: 102 October 18, 1995 11:16
TO: nrl_3d:* Sequences: 5,069 Symbols: 865,709 Word Size: 2
Scoring matrix: GenRunData:fastapep.Cmp
Variable pamfactor used
Gap creation penalty: 12.0 Gap extension penalty: 4.0
////////////////////////////////////////////////////////////////////////
How many more scores would you like to see (* 0 *) ? <rtn>
The list contains 10 entries.
How many alignments would you like to see (* 10 *) ? <rtn>
Aligning...
CPU time used:
Database scan: 0:00: 3.4
Post-scan processing: 0:00: 1.7
Total CPU time: 0:00: 5.2
Output File: cyc_euggr.fasta
Look closely at the results of this run. Select the two best alignments and create a fil file of them and the cytochrome c sequence being used. Record their access codes below along with the degree of similarity they had to the query sequence.
best alignments were with: _________________________________________________ quality of the alignments are: _____________________________________________
Use pico to do create your fil file. Call the name of this file cc-align.fil. The content of your fil file should look like that given below where xxxx is the access code of the first similar structure and yyyy the second.
sw:cyc_euggr nrl_3d:xxxx nrl_3d:yyy
With this fil file in hand, run an alignment to see just how well the three sequences align with one another. This is done with PILEUP.
% pileup
PILEUP creates a multiple sequence alignment from a group of related
sequences using progressive, pairwise alignments. It can also plot a
tree showing the clustering relationships used to create the alignment.
PileUp of what sequences ? @cc-align.fil <rtn>
1 cyc_euggr 102 aa
2 xxxx 104 aa
3 yyyy 103 aa
What is the gap weight (* 3.00 *) ? <rtn>
What is the gap length weight (* 0.10 *) ? <rtn>
This program can display the clustering relationships graphically.
Do you want to:
A) Plot to a FIGURE file called "PileUp.Figure"
B) Plot graphics on HP7550 attached to /dev/ttyx
C) Suppress the plot
Please choose one (* A *): c <rtn>
What should I call the output file name (* pileup.msf *) ? cc-align.msf <rtn>
Determining pairwise similarity scores...
1 x 2 1.01
1 x 3 0.96
2 x 3 1.32
Aligning...
1 .....-.
2 .....-.
.....-.
Total sequences: 3
Alignment length: 104
CPU time: 00.35
Output file:/disk3/usr/local/people/expxx/fourteen/cc-align.msf
Type off on the screen the results of this run.
Look at the contents of this output file. The order in which sequences appear is significant. The structure sequence just above the cytochrome c to be modelled is the one with the structure with the greatest degree of similarity to it.
Notice that there is only one gap in the alignments except at the end where two of the sequences are shorter than the other one. This is a good sign for the modelling process. Handling gaps or inserts is always tricky and a single residue gap shouldn't be any real problem for the automated modelling software. Print off this file and use a highlighter to mark the areas of agreement in the alignment. Determine the percentage of columns of residues that match in this file. Record that value below. When identical matches are over 40% in alignments of nearly the same length, the probability is very good that a model can be generated for the desired sequence.
% of identical matches: _________________________________________________ Based on the percentage of identical matches, will this structure be modelled or not? ________
Now you must wait for the results of the Swiss Model request to see if this is a correct assumption or not.
7) Working with a small sequence.
Not all sequences are as conserved as the cytochrome cs, which have been heavily studied, producing many x-ray structures from which to generate a model.
The small defensin proteins have not been studied much. Modelling techniques are based on similarities between sequence alignments. When very small proteins are aligned with larger ones, there are large regions of the alignment where no alignment actually exists. Always choose similar-sized proteins to do modelling with. The access code for the defensin sequence to be used in this study is def5_mouse.
Run a FASTA search with this sequence against the NRL_3D database. Use the example given below to guide your FASTA run. Terms in bold type should be entered into the program.
% fasta
FASTA does a Pearson and Lipman search for similarity between a query
sequence and any group of sequences. For nucleotide database searches,
FASTA is more sensitive than BLAST.
FASTA with what query sequence ? sw:def5_mouse <rtn>
Begin (* 1 *) ? <rtn>
End (* 36 *) ? <rtn>
Search for query in what sequence(s) (* SwissProt:* *) ? nrl_3d:* <rtn>
What word size (* 2 *) ? <rtn>
List how many best scores (* 40 *) ? 10 <rtn>
What should I call the output file (* def5_mouse.fasta *) ? <rtn>
1 Sequences 255 aa searched NRL_3D:12CA
101 Sequences 26,760 aa searched NRL_3D:1CHG3
201 Sequences 41,409 aa searched NRL_3D:1RSM
////////////////////////////////////////////////////////////////////////
4,901 Sequences 839,101 aa searched NRL_3D:1R08A
5,001 Sequences 855,557 aa searched NRL_3D:1PTSB3
(Peptide) FASTA of: def5_mouse from: 1 to: 36 October 18, 1995 11:16
TO: nrl_3d:* Sequences: 5,069 Symbols: 865,709 Word Size: 2
Scoring matrix: GenRunData:fastapep.Cmp
Variable pamfactor used
Gap creation penalty: 12.0 Gap extension penalty: 4.0
////////////////////////////////////////////////////////////////////////
How many more scores would you like to see (* 0 *) ? <rtn>
The list contains 10 entries.
How many alignments would you like to see (* 10 *) ? <rtn>
Aligning...
CPU time used:
Database scan: 0:00: 3.5
Post-scan processing: 0:00: 1.7
Total CPU time: 0:00: 5.2
Output File: def5_mouse.fasta
Look closely at the results of this run. Select the four best alignments from this output file. Record their access codes below along with the degree of similarity they had to the sequence that was searched with. In addition, record the length of the alignment overlap.
best alignments were with: __________________________________________________ quality of the alignments are: ______________________________________________ alignment overlaps: _________________________________________________________Output files of this type don't give you information of the lengths of the files being compared, only the percent of similarity and the region of the sequence in which that alignment takes place. Working with such a small protein requires care. Check the lengths of the compared sequences.
Use typedata on the four chosen files to determine the actual lengths of the sequences and if they are close to the length of the defensin or not. Record the respective lengths below. Use the example given as a guide. In this example, the xxxxx denotes the access code of the selected data set being looked at.
% typedata nrl_3d:xxxxx file #1 length is: __________________ file #3 length is: __________________ file #2 length is: __________________ file #4 length is: __________________
The NRL_3D database contains a large number of duplicate sequences due to the way x-ray data is reported. When access codes have the same first four characters, it usually means that they are duplicates. This is almost a certainty when the sequence lengths match as well.
Having carried out these steps, how many of the top four files can you still use? Create a fil file of them and the defensin sequence being used. Use pico to do create your fil file. Call the name of this file def-align.fil.
With this fil file in hand, it is time to run an alignment to see how well the chosen sequences do align with one another. Use PILEUP for this.
% pileup
PILEUP creates a multiple sequence alignment from a group of related
sequences using progressive, pairwise alignments. It can also plot a
tree showing the clustering relationships used to create the alignment.
PileUp of what sequences ? @def-align.fil <rtn>
1 def5_mouse 36 aa
2 xxxx 30 aa
What is the gap weight (* 3.00 *) ? <rtn>
What is the gap length weight (* 0.10 *) ? <rtn>
This program can display the clustering relationships graphically.
Do you want to:
A) Plot to a FIGURE file called "PileUp.Figure"
B) Plot graphics on HP7550 attached to /dev/ttyx
C) Suppress the plot
Please choose one (* A *): c <rtn>
What should I call the output file name (* pileup.msf *) ? def-align.msf <rtn>
Determining pairwise similarity scores...
1 x 2 0.73
Aligning...
1 .-.
Total sequences: 2
Alignment length: 36
CPU time: 00.32
Output file:/disk3/usr/local/people/expxx/fourteen/def-align.msf
Type off on the screen the results of this run.
Notice that there are no gaps in the alignment except at the beginning where the sequences are of different lengths. This is a good sign. Print off this file and use a highlighter to mark the areas of agreement in the alignment. Determine the percentage of columns of residues that match in this file. Record that value below.
% of identical matches: _________________________________________________ Based on the percentage of identical matches, will this structure be modelled or not? ________
Now wait for the results of the Swiss Model request to see if this is a correct assumption or not.
8) Working with a slightly longer protein.
The inflammatory proteins have not been studied too much so far. The access code for the inflammatory protein sequence to be used in this study is mi1a_human.
Run a FASTA search with this sequence against the NRL_3D database. Use the example given below to guide your FASTA run. Terms in bold type should be entered into the program.
% fasta
FASTA does a Pearson and Lipman search for similarity between a query
sequence and any group of sequences. For nucleotide database searches,
FASTA is more sensitive than BLAST.
FASTA with what query sequence ? sw:mi1a_human <rtn>
Begin (* 1 *) ? <rtn>
End (* 92 *) ? <rtn>
Search for query in what sequence(s) (* SwissProt:* *) ? nrl_3d:* <rtn>
What word size (* 2 *) ? <rtn>
List how many best scores (* 40 *) ? 10 <rtn>
What should I call the output file (* mi1a_human.fasta *) ? <rtn>
1 Sequences 255 aa searched NRL_3D:12CA
101 Sequences 26,760 aa searched NRL_3D:1CHG3
201 Sequences 41,409 aa searched NRL_3D:1RSM
////////////////////////////////////////////////////////////////////////
4,901 Sequences 839,101 aa searched NRL_3D:1R08A
5,001 Sequences 855,557 aa searched NRL_3D:1PTSB3
(Peptide) FASTA of: mi1a_human from: 1 to: 92 October 18, 1995 11:16
TO: nrl_3d:* Sequences: 5,069 Symbols: 865,709 Word Size: 2
Scoring matrix: GenRunData:fastapep.Cmp
Variable pamfactor used
Gap creation penalty: 12.0 Gap extension penalty: 4.0
////////////////////////////////////////////////////////////////////////
How many more scores would you like to see (* 0 *) ? <rtn>
The list contains 10 entries.
How many alignments would you like to see (* 10 *) ? <rtn>
Aligning...
CPU time used:
Database scan: 0:00: 3.4
Post-scan processing: 0:00: 1.7
Total CPU time: 0:00: 5.1
Output File: mi1a_human.fasta
Type off on the screen the results of this run.
Look closely at the results. Select the four best alignments from this output file. Record their access codes on the next page, along with the degree of similarity they had to the query sequence
best alignments were with: __________________________________________________ quality of the alignments are: ______________________________________________
The NRL_3D database contains a large number of duplicate sequences due to the way x-ray data is reported. When access codes have the same first four characters, it usually means that they are duplicates.
Having carried out these steps, how many of the top four files can you still use? Create a fil file of them and the mi1a sequence being used. Use pico to do create your fil file. Call the name of this file mi-align.fil.
With this fil file in hand, run an alignment to see how well the chosen sequences align with one another. Use PILEUP for this.
% pileup
PILEUP creates a multiple sequence alignment from a group of related
sequences using progressive, pairwise alignments. It can also plot a
tree showing the clustering relationships used to create the alignment.
PileUp of what sequences ? @mi-align.fil <rtn>
1 mi1a_human 92 aa
2 xxxx 69 aa
3 yyyy 69 aa
What is the gap weight (* 3.00 *) ? <rtn>
What is the gap length weight (* 0.10 *) ? <rtn>
This program can display the clustering relationships graphically.
Do you want to:
A) Plot to a FIGURE file called "PileUp.Figure"
B) Plot graphics on HP7550 attached to /dev/ttyx
C) Suppress the plot
Please choose one (* A *): c <rtn>
What should I call the output file name (* pileup.msf *) ? mi-align.msf <rtn>
Determining pairwise similarity scores...
1 x 2 1.18
1 x 3 1.18
2 x 3 1.50
Aligning...
1 ...-.
2 ...-.
Total sequences: 3
Alignment length: 92
CPU time: 00.56
Output file:/disk3/usr/local/people/expxx/fourteen/mi-align.msf
Type off on the screen the results of this run.
Notice that there are no gaps in the alignment except at the ends where the sequences are of different lengths. Notice that even though the two NRL_3D data files had different codes that they are identical. Print off this file and use a highlighter to mark the areas of agreement in the alignment. Determine the percentage of columns of residues that match in this file. Record that value below.
% of identical matches: _________________________________________________ Based on the percentage of identical matches, will this structure be modelled or not? __________________
Now wait for the results of the Swiss Model request to see if this is a correct assumption or not.
9) Working with a HLH transcription factor protein.
This type of protein has a very distinctive region, the HLH area. HLH stands for helix-loop-helix and this is a well characterized structural conformation.
Run a FASTA search with this sequence against the NRL_3D database. Use the example given below to guide your FASTA run. Terms in bold type should be entered into the program.
% fasta
FASTA does a Pearson and Lipman search for similarity between a query
sequence and any group of sequences. For nucleotide database searches,
FASTA is more sensitive than BLAST.
FASTA with what query sequence ? sw:id3_human <rtn>
Begin (* 1 *) ? <rtn>
End (* 119 *) ? <rtn>
Search for query in what sequence(s) (* SwissProt:* *) ? nrl_3d:* <rtn>
What word size (* 2 *) ? <rtn>
List how many best scores (* 40 *) ? 10 <rtn>
What should I call the output file (* id3_human.fasta *) ? <rtn>
1 Sequences 255 aa searched NRL_3D:12CA
101 Sequences 26,760 aa searched NRL_3D:1CHG3
201 Sequences 41,409 aa searched NRL_3D:1RSM
////////////////////////////////////////////////////////////////////////
4,901 Sequences 839,101 aa searched NRL_3D:1R08A
5,001 Sequences 855,557 aa searched NRL_3D:1PTSB3
(Peptide) FASTA of: id3_human from: 1 to: 119 October 18, 1995 11:16
TO: nrl_3d:* Sequences: 5,069 Symbols: 865,709 Word Size: 2
Scoring matrix: GenRunData:fastapep.Cmp
Variable pamfactor used
Gap creation penalty: 12.0 Gap extension penalty: 4.0
////////////////////////////////////////////////////////////////////////
How many more scores would you like to see (* 0 *) ? <rtn>
The list contains 10 entries.
How many alignments would you like to see (* 10 *) ? <rtn>
Aligning...
CPU time used:
Database scan: 0:00: 4.3
Post-scan processing: 0:00: 2.2
Total CPU time: 0:00: 7.5
Output File: id3_human.fasta
Look closely at the results of this run. Since the phrase helix-loop-helix comes up in some of the titles, they require closer looking into. None of the values given here are very high, suggesting short regions of similarity. Ignore the myosin fragment and select one of the hemoglobin sequences to give you three sequences to work with. [The helix-loop-helix sequences all had the same first four characters and probably are duplicates].
best alignments were with: ______________________________________________Having carried out these steps, create a fil file of them and the id3 sequence being used. Use pico to do create your fil file. Call the name of this file id-align.fil.
With this fil file in hand, run an alignment to see how well the chosen sequences align with one another. Use PILEUP for this.
% pileup PILEUP creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. PileUp of what sequences ? @id-align.fil <rtn> 1 id3_human 119 aa 2 xxxx 62 aa 3 yyyy 142 aa What is the gap weight (* 3.00 *) ? <rtn> What is the gap length weight (* 0.10 *) ? <rtn> This program can display the clustering relationships graphically. Do you want to:A) Plot to a FIGURE file called "PileUp.Figure" B) Plot graphics on HP7550 attached to /dev/ttyx C) Suppress the plot Please choose one (* A *): c <rtn> What should I call the output file name (* pileup.msf *) ? id-align.msf <rtn> Determining pairwise similarity scores... 1 x 2 0.52 1 x 3 0.35 2 x 3 0.39 Aligning... 1 ...-. 2 .......-. Total sequences: 3 Alignment length: 165 CPU time: 00.53 Output file:/disk3/usr/local/people/expxx/fourteen/id-align.msf
The numbers shown during this run don't look good. The length of the chosen sequences are not very close to one another. Their pairwise similarity scores are low and the resulting alignment length is longer than any of the component sequences. Not a good sign. Type off the results of this run on the screen.
Notice that there many gaps in the alignment. Almost none of the positions show any column matches. There are some areas of two-way matches between the id3_human sequence and the helix-loop-helix one.
There must be something here, but what? Re-edit the fil file for this protein and drop off the hemoglobin name from the list. Repeat the PILEUP run on the fil file. Examine the results of this run. Record your observations on this alignment below.
alignment comments: ____________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________
The alignment shown is real. The question is, what is it from? To see if these proteins contain any motif patterns, run the MOTIFS program on the fil file. Use the example below as a guide.
% motifs
MOTIFS looks for sequence motifs by searching through proteins for the
patterns defined in the Dictionary of Protein Sites and
Patterns. MOTIFS can display an abstract of the current literature on
each of the motifs it finds.
from what protein sequence(s) ? @id-align.fil<rtn>
What should I call the output file (* id-align.motifs *) ? <rtn>
ID3_HUMAN len: 119 .......................
xxxx len: 62 .......................
Total finds: 1
Total length: 181
Total sequences: 2
CPU time (sec): 3.19
Output file: "/disk3/usr/local/people/expxx/fourteen/id-align.motifs"
Examine the results of this run. Are there any motifs, and if so, in which protein? The two sequences did align. Go back and check the msf file again. Use your highlighter to mark the area of the motif and see if this is the region in which the two sequences show the highest degree of similarity.
The evidence looks very shaky. There is not a very high probability that the automated modeller software will be able to model this protein. Wait for the results of the Swiss Model request to see if this is a correct assumption.
10) Working with a protein related to elongation factors (Optional)
The elongation factor-related protein to be worked with has the access code, sttn_rat. This protein is much larger than your other ones, 463 residues. The same principles apply. Take care, find similarities between protein sequences of similar lengths, just as you did when dealing with very small proteins.
By now you have gone through the screening steps enough to know them by heart. Do a FASTA run using this sequence on the NRL_3D database. Pick the top three sequences to use in your fil file for this determination. Use pico to create this fil file and call it long-fil.
When you have the fil file, do a PILEUP run on the data in the fil file. Examine the contents of the generated msf file. How do the overall lengths of the four sequences compare? Are there reasonable areas of similarity between all four sequences? Revise your fil file based on your findings and repeat the PILEUP run. How do things look this time? Record your observations below.
alignment comments: ___________________________________________________ _______________________________________________________________________ _______________________________________________________________________ _______________________________________________________________________
Wait for the Swiss Model results to see if your prediction is correct or not.
11) Doing an overlay on a short sequence.
While you wait for the Swiss Model results, here is an example of one of the types of homology modelling techniques used before this server came online, overlaying a sequence on existing coordinates. To do this you will need to be on the modelling machine.
% model1
Try this modelling technique. Get into the MacroModel program. Read in the complete structure for melittin by selecting READ, giving the access code for the melittin file, 2mlt, pressing ENTER for structure number.
Select one of the two chains on the screen to do further work with and delete the other one using the DELT button. Selecting this button twice allows for the deletion of a molecule. Once the phrase Molecule deletion appears at the top of the screen, move the cursor over to a location on the chain you want to remove and press the space bar. The unwanted chain is removed. Remove the two sulfate groups in this manner also.
A good way to create a structure that allows the monitoring of the overlaying process is to strip the structure to its backbone and then color code the result via its various residue types. Select ANALYZ, SETS, MainS, DISPLA, Dis, then Rtype, responding with w to do just the working set data. If you feel comfortable identifying the backbone components without the atom labels, select A LAB to remove them. If your structure appears less than perfect after this last action, select Updat to have the screen redrawn.
Now write the data you have created to a file, saving only the displayed fragment, giving it the extension, backbone. Once the file has been created, read the file back into the program. This is necessary in order to create a complete amino acid version of the structure and not just a backbone in the next step.
You will overlay the following sequence, YAGVALAVLALIIPSLLTWQSRKHNP, upon this coordinate set. This is a portion of a tyrosine permease sequence from E. coli that shows a 39% similarity with the melittin sequence. The three-letter code version of this sequence is given below.
Tyr Ala Gly Val Ala Leu Ala Val Leu Ala Leu Ile Ile Pro Ser Leu Leu Thr Trp Gln Ser Arg Lys His Asn Pro
To overlay the sequence, start at the beginning of the sequence . Select PEPTID, and then move the cursor to the first amino acid in the sequence, press the space bar, move the cursor to some point on the first residue of on the structure, press the space bar again and the new residue will be drawn in that position. Since this is a new residue, it has the standard atom coloration. You can use this change in color to mark your progress through the sequence.
Use Clip to expand regions of the structure where individual residues are hard to determine. The process is the same. Once the screen has been expanded, pick the desired residue and move the cursor to a spot on the residue to be replaced.
In this process, it is best to do the replacements all in one setting, for what you are really doing is establishing a second chain on the backbone of the first. They are two different chains that appear to be one. If an amino acid is the same in both sequences it still has to be replaced or the second chain that you are making will be incomplete.
Once you have finished your structure, write the data to a file with your last name as the filename and the extension, overl. Get out of the modelling program. While the structure created is not ideal, it will serve as a starting point to do other processes to improve the quality of the model.
A model such as the one just created could be used to do further work. The produced structure would have to be minimized to adjust the side chains to their new environment. Return to ribozyme.
$ logout
12) Working with Swiss Model results.
Assuming that you now have some e-mail messages back from the Swiss Model server, here is how to use them. First, copy the request codes you got back from your original submissions (actually called process identification) and recorded on page 8 here in the space provided below.
request #1 code is: ________________ request #4 code is: ________________ request #2 code is: ________________ request #5 code is: ________________ request #3 code is: ________________
You should have a number of mail messages waiting for you. If the modelling process has been successful, three mail messages are sent back. If the process has not been successful, one message is sent back.
Two of the three messages contain the following type of postscript files. These files are plots of the modelling process. One is a profile of the modelling process. The other relates to an energy view of the modelling process. The third message actually contains the coordinate results of the modelling process.
The following is the beginning of a profile postscript message. Only the first few lines are shown. When processing this message for later use, remove the top lines of the file down to the statement starting with %!. Notice that the process code is given on the Subj: and Subject: lines.
To: TEACHER CC: Subj: SwissModel-LastModelProfile-AAAa09537 Date: Tue, 17 Oct 95 00:24:54 +0100From: swissmod@ggr.co.uk Message-Id: <9510162324.AA09852@ch0x01.gimb.glaxo> Subject: SwissModel-LastModelProfile-AAAa09537 Apparently-To: <teacher@ribozyme.vadms.wsu.edu> %! /Helvetica findfont 10 scalefont setfont 50 400 translate newpath 1 1 moveto 500 1 lineto 4 4 div setlinewidth stroke 500 1 moveto 500 200 lineto 4 4 div setlinewidth stroke
Next is an example of the energy postscript message. Only the first few lines are shown. When processing this message for later use, remove the top lines of the file down to the statement starting with %!PS-Adobe. Notice that the process code is given on the Subj: and Subject: lines. This is a more typical example of the starting of a postscript file.
From: SMTP%"swissmod@ggr.co.uk" 16-OCT-1995 16:50:30.24 To: TEACHER CC: Subj: SwissModel-LastModelProsaII-AAAa09537 Date: Tue, 17 Oct 95 00:24:55 +0100 From: swissmod@ggr.co.uk Message-Id: <9510162324.AA09858@ch0x01.gimb.glaxo> Subject: SwissModel-LastModelProsaII-AAAa09537 Apparently-To: <teacher@ribozyme.vadms.wsu.edu> %!PS-Adobe-2.0 EPSF-1.2 %%BoundingBox: 74 96 528 728 %%Page: 1 1 %%EndComments 72 300 div dup scale 90 rotate
Next is an example of the coordinate data message. Only the first few lines are shown. When processing this message for later use, remove the top lines of the file down to the statement starting with HEADER. Notice that the process code is given in two of the REMARK lines as well as in the Subj: and Subject: lines.
From: SMTP%"swissmod@ggr.co.uk" 13-OCT-1995 11:28:36.67 To: TEACHER CC: Subj: SwissModel-LastModel-AAAa14478 Date: Fri, 13 Oct 95 19:02:31 +0100 From: swissmod@ggr.co.uk Message-Id: <9510131802.AA15015@ch0x01.gimb.glaxo> Subject: SwissModel-LastModel-AAAa14478 Apparently-To: <teacher@ribozyme.vadms.wsu.edu> HEADER SWISS-MODEL (Automated Protein Modelling Server) EXPDTA THEORETICAL MODEL (Secondary) AUTHOR ProMod (SEE REFERENCE IN JRNL Records) JRNL 1 AUTH M.C.PEITSCH JRNL 1 TITL PROTEIN MODELING BY EMAIL JRNL 1 REF BIO/TECHNOLOGY V. 13 258 1995 JRNL 1 REFN ISSN 0733-222X JRNL 2 AUTH M.C.PEITSCH,C.V.JONGENEEL JRNL 2 TITL A 3-DIMENSIONAL MODEL FOR THE CD40 LIGAND REVEALS A JRNL 2 TITL 2 CLOSE SIMILARITY TO THE TUMOR NECROSIS FACTORS JRNL 2 REF INT.IMMUNOL. V. 5 233 1993 JRNL 2 REFN ASTM INIMEN UK ISSN 0953-8178 759 REMARK REMARK REFINEMENT of primary model with CHARMm REMARK REMARK Your Request is: euggr REMARK Date : Fri Oct 13 18:58:20 MET 1995 REMARK SMID : AAAa14478 REMARK
Not all requests are successful. Swiss Model sends back only one message in this case. It states that the attempt was unsuccessful and suggest that you cut down your sequence file to areas of good alignment before attempting to do any further modelling on the protein. The requirements for similarity quality are given in the message. Notice the No Success phrase in the Sub: and Subject: lines along with the process code. Only parts of this type of message are shown here.
From: SMTP%"swissmod@ggr.co.uk"
To: TEACHER
CC:
Subj: SwissModel-No_Success-AAAa09250
Date: Tue, 17 Oct 95 22:30:31 +0100
From: swissmod@ggr.co.uk
Message-Id: <9510172130.AA09357@ch0x01.gimb.glaxo>
Subject: SwissModel-No_Success-AAAa09250
Apparently-To: <teacher@ribozyme.vadms.wsu.edu>
////////////////////////////////////////////////////////////////////////////
Your modeling request could not be carried out.
Please look at the other messages issued by the server.
The degree of similarity of your sequence with proteins of
known 3D structure may be to low.
At present, Swiss-Model will generate models for sequences
which respond to these criteria:
BLAST search P value : < 0.0001
FASTA search standard deviations above mean : > 9.0
Global degree of sequence identity (SIM) : > 25 % spread of > 40%
of the submitted sequence.
This means that if a relatively short domain, within a long protein,
may considered to low in similarity, even though a model could be
built for it. So define the segment which you wish to model, and
submit it in raw sequence format.
With these examples in mind go into the pine mailer. Read your messages. When you find one(s) from Swiss Model, extract them into file(s). Figure out which of the various requests you submitted the response actually relates to. Use the following names for your files, euggr, def, mi1a, id3 and sttn. For extensions use pro-ps for the profile postscript files, use en-ps for the energy postscript files and swiss-pdb for the coordinate files. Use the extension bomb for requests that failed.
Once you are in pine and reading a message, you can extract it by entering e and answering the prompt with your desired output name. Because of the way in which your account is set up, the extracted mail message will be in your current directory location. After exiting pine, you can now work with the file in the pico editor it get it into usable shape.
With these files in hand, go through and use the editor to process the data so you can use it. Remove the mail information from the top of the files. In the case of the profile and energy postscript files this is all that needs to be done. The bomb files really don't need to be edited if you don't want to. Just record which of the modelling attempts worked and which didn't. The coordinate files will require further processing in order to be able to use them in MacroModel.
euggr _____________________ id3 ________________________ def _______________________ sttn _______________________ mi1a ______________________
Print off the resulting postscript files. Be sure that you have edited out the mail headers prior to doing this. If these headers are not removed, the result is not a plot on a single page of paper but many printed sheets containing the postscript plot instructions and no image. Use the example to guide your printing process. These files are not standard postscript and require a special command line to get them to print.
% lpr -Plps euggr.pro-ps
Now process the coordinate files. Yes, these structures have been modelled, but just how much of the sequence has been modelled? To answer this question use the search utility to look at the various coordinate files for the number of alpha carbons they have. There is only one alpha carbon per amino acid residue. Use the command below to do this. Record your results here and on the next page.
% grep CA def.swiss-pdb | wc -l
The results are a single line showing the number of lines in the file containing the sought for term.
euggr's length 102 resulting model's length : ____________________________ def's length 36 resulting model's length : ____________________________ mi1a's length 92 resulting model's length : ____________________________ id3's length 119 resulting model's length : ____________________________ sttn's length 463 resulting model's length : ____________________________
If you look closely at one of the coordinate files you will find that the x, y and z coordinate values are given for all the atoms in the structure. Standard PDB formatted files have a fixed format for the way these coordinate lines are entered. There is nothing wrong with the format of the lines, only the order in which they appear. The standard format requires that the backbone atoms coordinates be given first and then the rest of the amino acid residue. This is not the case with these files. There is a program that can fix that problem. It is called pdb_fix and is located on the modelling platform. To use it you will have to ftp your files over to that platform and run the program on it there.
% model1
Now, you need to ftp the data from your ribozyme account to here. This is done by following the instructions given below. Replace the expxx in the example with your actual account name. What is happening in this example is that you are logging into your ribozyme account, moving over to the fourteen subdirectory, insuring that the data transfer will be in text format and then getting the desired data files over to your working space on model1. Replace the xxxx in the example with the one of the names of the files you want to bring over.
$ ftp ribozyme.vadms.wsu.edu model1.vadms.wsu.edu MultiNet FTP user process 3.4(111) Connection opened (Assuming 8-bit connections) <ribozyme.vadms.wsu.edu FTP server ready. RIBOZYME.VADMS.WSU.EDU>l expxx <rtn> <Password required for expxx. Password: [enter your password] <User expxx logged in. RIBOZYME.VADMS.WSU.EDU>cd fourteen <rtn> <CWD command successful. RIBOZYME.VADMS.WSU.EDU>type ascii <rtn> Type: Ascii (Non-Print), Structure: File, Mode: Stream RIBOZYME.VADMS.WSU.EDU>get xxxx <rtn> To remote file: <rtn> <Opening ASCII mode data connection for 'xxxx'. <Transfer complete.
Now repeat the get portion of this example until you have moved all the files you want to work with.
RIBOZYME.VADMS.WSU.EDU>quit <rtn> <Goodbye. $
Run pdb_fix on all the coordinate files you have. An example of its operation is given below. Enter terms in bold type. xxxx represents the name of the file being used.
$ pdb_fix Program PDB_fix This program converts non-standard PDB files into ones that will work in MacroModel Enter name of file to work with: xxxx.swiss-pdb <rtn> Enter name of output file created: xxxx.swiss-fix <rtn> $
Add two necessary lines at the bottom of these fixed files prior to running the files through the next step. These files next need to be used in the program BFILER. This program expects two lines to be at the end of any file it worked on. Use the append function to put these two lines there. A file containing the necessary lines has been created and is called ending. Run append on each of the swiss-fix files you have.
$ append ending xxxx.swiss-fix
With the files properly modified, run them through BFILER. An example of this program's operation is given below. User input is shown in bold type. You can enter more than one data file into this program at a pass. You could get all your data files converted in just one running of the program.
$ bfiler
BFiler (v 0.2)
19-OCT-95 11:02:22
BFiler: SELECT A MENU ITEM FROM BELOW--
HELP=Information
TAPE=Read Brookhaven format files Brookhaven tape and
translate to MMOD format,
COPY=Copy files from Brookhaven tape to disk
without translation
DISK=Translate Brookhaven format files to MMOD files
BARE=Translate Bare Brookhaven atom table (from file(s)
disk) to MMOD format file(s)
EXIT=Exit BFiler
BFiler>disk <rtn>
BFILER-DISK:This routine attempts to translate Brookhaven
format files which are on a disk
BFILER-DISK:Continue?(y)> <rtn>
Default suffix is ".BRK"
Type in the names of the files you want to process,
Hit return after each code name and
a bare "." to finish>
xxxx.swiss-fix <rtn>
At this point in the program, you can enter all the files you want to have converted. They are entered one name to a line. When you are finished inform the program that you are by entering a period and pressing the ENTER key. [The next line shown here in the example.]
. <rtn>
Below is a list of names for files you want to translate --
Options: (1) type in corrected entry;
(2) type "i" to insert an entry,
(3) type "x" to delete entry,
(4) type "." to finish,
(5) hit return to verify entry:
XXXX.SWISS-FIX <rtn>
Go back and re-edit the filecodes?(n)> <rtn>
Looking for file XXXX.SWISS-FIX
The program will go through a process each of the data files you told it to work on. There will be an error message about the data having an invalid text line. This will not affect the operation of the program.
Reading XXXX.SWISS-FIX
WARNING - invalid text in this file
EXPDTA THEORETICAL MODEL (Secondary)>
Reading atomic coordinates...
Typing atoms...
Creating bond entries...
BFiler: SELECT A MENU ITEM FROM BELOW--
HELP=Information
TAPE=Read Brookhaven format files Brookhaven tape and
translate to MMOD format,
COPY=Copy files from Brookhaven tape to disk
without translation
DISK=Translate Brookhaven format files to MMOD files
BARE=Translate Bare Brookhaven atom table (from file(s)
disk) to MMOD format file(s)
EXIT=Exit BFiler
BFiler>exit <rtn>
$
Your structures are now ready to view in MacroModel. They all have the extension bdt.
13) Looking at the converted structures.
To visualize your results, get into MacroModel. The data files you have created are now useable within the program.
1) Activate MacroModel.
2) Select ANALYZ and then A Lab. Some of the structures you will read in are large, labels will only cause confusion.
3) Select READ, enter the name of one of the coordinates files you have converted. These should be named euggr.bdt, def.bdt, mi1a.bdt, id3.bdt, or sttn.bdt. Respond to the prompt for a structure number by pressing the ENTER key. After the first file has been read in you will need to delete that structure to be able to better examine the next one. Therefore respond to the deletion prompt with y.
4) Select Sets and MainS to read the protein backbone into memory. Select Set1, respond with d to deposit the backbone into a storage spot. Select DISPLA, Set1 again and respond with r for retrieve. Select Dis to display only the backbone atoms on the screen.
5) Write this data to a file in order to have it for a comparison later on. Select WRITE, save only the displayed fragment, and give the file a name similar to this: xxxxx-back. Replace the xxxxx with a name reflecting the protein being work with. You could use euggr, def, mi1a, id3, and sttn to keep the naming consistent with the rest of the exercise.
6) Select READ, and enter one of the following names, euggr-ref, def-ref, mi1a-ref, id3-ref, sttn-ref, depending on which protein was originally read in step three. Delete the current image on the screen. The reference structure only contains the backbone atoms. It is colored white.
7) Select READ again and enter the name of the backbone file you just created. Keep the reference file. The two images on the screen are similar to one another, but not identical. If the modelling process was a simple overlaying of the sequence on a known set of coordinates, the two backbones would be identical.
8) Select GEOMTR and then SuprA. Pick three sets of comparison points. Select a point on the white structure first. When the three sets have been chosen, select RigSp and plot the results of the superpositioning. Select Scale and increase the size of the image to fill the screen.
9) Select WRITE. Save your comparison data to a file. Give it a name similar to expxx-euggr.comp. Examine the results of this comparison and enter your observations below and on the next page in the space provided.
euggr observations: ______________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ def observations: ________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ mi1a observations: _______________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ id3 observations: ________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ sttn observations: _______________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________
10) Repeat this process, going back to step 3, until you have created a
comparison file for each of the coordinate files you generated.
11) Select STOP and exit the program.
To get back to ribozyme for the final part of this week's exercise enter the term logout.
$ logout
14) Working with images.
Rename the following two Molscript files, week14-1.images and week14-2.images using the command lines given below and then print them off on the lab printer.
% mv week14-1.images (your lastname)-w14-1.images % mv week14-2.images (your lastname)-w14-2.images % lpr (your lastname)-w14-1.images % lpr (your lastname)-w14-2.images
The images in these two Molscript files are comparisons of the modelled structures with their possible reference ones. Add these pages to your molecular images collection.
15) Finishing up.
Rename the report form to your last name, go into the file using the pico editor and fill it out. Rcp the form to the teacher account.
% mv wee14.week14 (your lastname).week14 % pico (your lastname).week14 % rcp (your lastname).week14 teacher@ribozyme:receive
This concludes your computing session for this week. Log off ribozyme, get out of the emulator and back to the overlapping windows screen.
% logout
Press the alt and x keys together. This will cause the screen to ask if you really want to exit the program. Respond with y to get out of the teemtalk emulator and return to the overlapping windows screen.
Peitsch, M.C. and Jongeneel, C.V. A 3-DIMENSIONAL MODEL FOR THE CD40 LIGAND REVEALS A CLOSE SIMILARITY TO THE TUMOR NECROSIS FACTORS, INT.IMMUNOL. 5: 233 1993