An introduction to using biocomputing to analyze and explore database sequences and the information contain therein.
Document prepared by: Susan Jean Johns
Report any errors or problems with
this material to her
Biochemistry is the study of the molecular basis of life. In this study, nucleotides and proteins play a vital role. Nucleotides are the molecules of heredity in living systems. DNA provides this function in prokaryotic and eukaryotic organisms, while either DNA or RNA can be the genetic material in viruses. Proteins are involved in catalyzing reactions, and transporting and storing small molecules and ions. They also contribute to coordinated motion, mechanical support, immune protection, and generation and transmission of nerve impulses and growth control.
Nucleotide Background
The basic structural units of nucleotides are deoxyribonucleotides for DNA and ribonucleotides for RNA. Each is composed of a base, a sugar and a phosphate group. The bases carry the genetic information, while the sugar and phosphate groups perform a structural role. The sugar in DNA is deoxyribose, while in RNA it is ribose. The bases are either a derivative of purine or pyrimidine. The backbones of these molecules consist of the sugars linked by phosphate groups.
DNA is capable of forming a double stranded helix structure in which the two chains are coiled around a common axis and run in opposite directions to one another. The bases are on the inside of the helical structure. The planes of the sugars are nearly at right angles to those of the bases. The two chains are held together by hydrogen bonding between the bases. Adenine is always paired with a thymine and guanine with a cytosine.
There are a number of possible helical conformations for DNA. The standard or normal one is known as the B form. Under physiologic conditions, DNA is almost entirely in the B form. The B form contains a major and minor groove in its structure and is a right-handed double helix.
The primary sequences of nucleotides have been determined for a number of years. This data is being collected in databases. Along with the actual primary sequence is stored feature information about any regions that encode for proteins or other features of note, such as intron sites. These databases provide reported sequences for comparison and analysis work.
Protein Background
The basic structural units of proteins are amino acids. There are twenty different amino acids from which all proteins are constructed. An amino acid is composed of a NH2-CH(R)-COOH structure, where R represents a side chain. Each of these twenty amino acids varies in the size, shape, charge, hydrogen-bonding capacity and chemical reactivity of its respective side chain. The individual amino acids are linked to one another via peptide bonds, -CO-NH-. Proteins have been determined to be precisely defined amino acid sequences whose composition is specified by genes.
Knowledge of protein sequences is important for the following reasons. First, such sequence information is essential in determining the mechanism of biological reactions. Second, relating sequence information and three-dimensional structure provides the rules that govern protein folding. The sequence serves as link between the genetic message in DNA and the three-dimensional structure that performs the protein's biological function. Thirdly, alterations in sequence can produce abnormal function and disease. Finally, a protein's sequence can reveal information about its evolutionary history.
Proteins have a primary structure, the order of the individual amino acids that comprise the protein and the location, if any, of its disulfide bonds. They also have a secondary structure. This refers to the spatial arrangement of amino acid residues that are near one another in the primary sequence. Such arrangements can produce periodic structures due to steric interactions and result in such constructions as helixes, sheets and turns. The tertiary structure of a protein refers to the spatial relationship of amino acid residues that are far apart from one another in the primary sequence. When a protein is composed of more than one primary sequence, or chain, there are additional relationships between these chains or sub-units with one another known as quaternary structures.
Proteins, therefore have two different types of available data. Data based on their primary sequence and that derived from determining their actual structural conformation through x-ray crystallography. Since the understanding of proteins is so vital to the understanding of the molecular basis of life, both of these types of data are being collected into databases for use by the molecular biology community. These databases provide reported sequences and structures for comparison and analysis work.
By understanding the information contained in a nucleotide's primary sequence, its relationship to other nucleotides can be determined. Regions of the sequence that encode for proteins can be discovered and studied. The information contained in a protein's primary sequence allows its relationship to other proteins, its philic and phobic regions and its possible secondary structure can be determined. Nucleotides and proteins can be modelled and visualized to help understand how they function.
Biocomputing Background
The growth in the use of computers and the development of nucleotide and protein databases paralleled one another. Biocomputing arose from the need to be able to analyze the growing amounts of data being generated on biological materials. As the amount of material grew, it was assembled into databases whose contents could best be accessed through the use of computers. Ever increasing amounts of data meant that sequences could be compared with one another, common characteristics located, and functional patterns established. The use of database information can reduce the amount of time needed to characterize an unknown nucleotide or protein sequence.
The most commonly used software package for doing sequence analysis on this campus is from the Genetics Computer Group, GCG. This suite of programs allows a wide variety of analysis techniques to be applied to any sequence to determine information on its characteristics and possible function.
Molecules Used
A number of nucleotide and protein sequences will be used in today's computer exercises. They all contain information on a family of proteins known as protein kinases. Protein kinases are enzymes that specifically add phosphates to target proteins. One group of these kinases add the phosphate to a tyrosine in a target protein and are known as tyrosine protein kinases. Another group that adds the phosphate to either a serine or a threonine of a target protein are called serine/threonine protein kinases. These two types of protein kinases are responsible for the vast majority of phosphorylation events in the cell. Protein kinases perform other functions besides just phosphorylation. They therefore can be very large and contain a number of functional patterns or motifs.
Sequence Analysis Techniques
In order to do sequence analysis, it is necessary to find a sequence to work with. Sequences can either be entered directly into the computer or located in a database. The number of nucleotide sequences contained in databases has grown substantially over the years. In 1982, release 3 of GenBank contained 606 sequences with 680,338 bases. Release 103 of GenBank (Oct 1997) contains 1,765,847 sequences with 1,160,300,687 bases. This explosion in information makes it important for individuals working with nucleotides to know how to access and use sequence databases.
To locate a sequence in a database without knowing the actual primary sequence, some information about the sequence is needed, such as the species it comes from or it function. Using this information, a listing of sequences that contain this general term can be generated to serve as a basis for finding the materials of interest.
Once a sequence is located, analysis software can be used to find patterns of importance within a sequence or related sequences. One such pattern might be a short length of a sequence or a restriction site. Analysis can be run to determine where areas that might encode for proteins are located. Once such an area has been found, the sequence can be translated from a nucleotide sequence into a protein sequence and the results checked with the information reported in the literature to find out if the protein really exists or not. To help in sequence characterizations, comparisons can be run on the derived protein sequences to find other sequences similar to the one of interest.
A number of proteins have had their 3D confirmations solved by X-ray techniques or other means. These structures have been placed in the PDB structural database. By studying these confirmations it has been found that the members of a given protein family share a similar structure since they are performing the same functions. This form following function idea has lead to the development of homology modeling. In this technique a model is generated for unsolved protein based on its relationship to a solved structure.
The sequence analysis techniques mentioned earlier will be used to study protein kinase sequences. Follow along in the handout as the computer analysis takes place. There will be space available to enter observations and comments needed to answer the questions at the end of the handout.
Computer Background Information
To conduct the sequence analysis portion of today's laboratory session an automatic shell will be run. The exercise is self paced. You control the rate at which information is presented on the screen by responding to prompts. Whenever the phrase Press <Return> to continue appears, finish reading or studying the section on the screen and then press the RETURN key on the keyboard to move on. When Continue reading on in the exercise, when finished press <Return> to continue appears, return to reading the next section in the handout for more information and then press the RETURN key to move on. When the highlighted term --More-- appears at the bottom of the screen, press the space bar and another screen's worth of data come up on the terminal. The shell is set up to ask the user if a section should to be repeated. At this query, respond with either y and press RETURN or n and press RETURN. Items that appear in italics in this handout are to be entered on the keyboard followed by pressing the RETURN key.
Running an analysis process
To start the analysis process, enter the following.
% analyze
There is an introductory screen in which the software package being used, GCG, is activated.
Notice the listing of databases that appears on the screen. The first two of these contain nucleotide sequences (GenBank and EMBL) and the next two protein sequences (PIR-PROTEIN and SWISS-PROT). PROSITE is a dictionary of functional patterns or motifs. REBASE is a listing of restriction enzyme patterns supplied with the software.
Record the version number of the GCG package being used for this exercise: ____________
Pattern recognition
The human mind can distinguish visual patterns with remarkable accuracy. However, when the desired pattern is a set of alphanumeric characters from within a larger set of such characters, a computer is a much more effective and reliable means of locating such a pattern. To give you idea about how this works in the following section you will be looking first manually and then using the computer to find the character pattern TAAG from a nucleotide sequence of 1000 bases on the screen. Press RETURN to see the sequence.
Record your impressions of the problems encountered doing manual pattern searching.
____________________________________________________________________ ____________________________________________________________________ ____________________________________________________________________ ____________________________________________________________________
In this section, the computer is used to search for the same pattern in the same sequence you used for your manual search and then with a second one that is over 48,000 bases long. The results of each run is shown at the end of each search. You are actually running these searches real time as you watch the screen.
Based on your experiences doing both a manual and a computer assisted pattern search, which method would you rather use? Record your observations below.
____________________________________________________________________ ____________________________________________________________________ ____________________________________________________________________ ____________________________________________________________________
Database Searching
The computer uses the same pattern searching techniques when looking for data of interest in database. Due to time limitations, a series of database searches have already been run on the various databases looking for protein kinase sequences to study. The output files generated by these searches will be used throughout the exercise.
The GCG program STRINGSEARCH was used to the local nucleotide databases (GenBank and EMBL) for the phrase "protein kinase". This resulted in a listing of 1086 data files containing this phrase. A similarly conducted database search on the local version of SwissProtein resulting in a listing of 688 data files. When the sequences from the structural database was searched, the result contained 21 data files.
Just because the desired phase shows up in a data file doesn't mean that it contains the information being sought. An example of this is given next. Not only are there protein kinases in the search listing, but also protein kinase inhibitors. Both of which would have be located in the searches that were run previously. To check to see that this is true, a search was done on the original output file for the term inhibitors. Press the RETURN key to see the results.
Record the number of inhibitor files in the data list: __________________
Closely examining the results of our database search shows that the results contain all sorts of protein kinases. On the following page is a small sample of some of the types of data files found. Depending on what aspect of protein kinases you wanted to explore, further STRINGSEARCHs would be done on this data listing to find just actually what you wanted to work with.
Gb_ba:Asu68034 U68034 Anabaena sp. strain PCC 7120 histidine protein k
Gb_in:Celcapkcc1 M37114 C.elegans cAMP-dependent protein kinase catalyti
Gb_in:Dmpftaire X99512 D.melanogaster PFTAIRE mRNA for serine/threonine
Gb_in:Drodg1a1 M27113 D.melanogaster cGMP-dependent protein kinase (DG
Gb_in:S65712 S65712 calcium/calmodulin-dependent protein kinase II {
Gb_in:Tcu69958 U69958 Trypanosoma cruzi cdc2-related protein kinase (t
Gb_om:Btu09405 U09405 Bos taurus putative protein kinase C inhibitor m
Gb_om:Ssu72970 U72970 Sus scrofa calcium/calmodulin-dependent protein
Gb_pl:Athrlpka M84658 A.thaliana receptor-like protein kinase mRNA, co
Gb_pl:Psu11553 U11553 Pisum sativum putative protein kinase (PsPK3) mR
Gb_pl:Psu83281 U83281 Pisum sativum protein kinase homolog PsPK4 mRNA,
Gb_pr:Humncsrc M34469 Human membrane-associated tyrosine protein kinas
To see such a process, press RETURN and you will be running the STRINGSEARCH program on a subset of the original data limited to bacterial sequences from GenBank. This search is looking for histidine protein kinases.
Record the number of histidine protein kinases found: ____________
Even this narrowing of the data to a specific area might not contain all the information in the databases on the desired subject. This is due to the fact that this information has been collected over a period of time and the keywords and terms used to describe a material evolves with time. If you are new to an area of research, you may not know all the relevant terms needed to extract the required data to work with.
Nucleotide sequences can be analyzed to see if they contain regions that encode for protein(s). Nucleotide sequence data contained within databases many times have included this information in the feature section of the reference part of the data. The sequence(s) of the protein(s) that it encodes for are also given. In the GenBank database, this information is in the FEATURE section of a data file. The name of the protein that was found is given along with its sequence. These encoded protein sequences start with the term /translation= and are fairly easy to spot.
To see the information recorded in within a data file requires the use of the TYPEDATA version of the GCG program FETCH. To page through two actual protein kinase nucleotide sequences press RETURN.
At times like this the best means of locating the desired data is to run database searches based on actual sequence rather than reference information. This may mean using an actual sequence to find all the other sequences that are similar to it in the databases. Due to the fact that a number of different nucleotide codons can encode for the same amino acid, database searches of this sort are usually done on protein sequences when at all possible. These searches are time consuming and are best done in batch mode on the computer.
Functional Pattern Searching
Proteins are analyzed in a matter similar to that for nucleotides. The programs used reflect the differences in the primary sequences of the two types of materials and the way the two materials are characterized. A series of four protein kinase protein sequences have been selected to work with for this portion the exercise. They have been named pk1, pk2, pk3 and pk4 to make the reporting of your results easier.
Once sequence information is located, analysis software can be used to find patterns of importance within a sequence. One such pattern might be a short length of sequence, an enzymatic cleavage site, or a pattern relating to function. Proteins have been extensively studied with respect to correlating sequence patterns with function. A dictionary or database of these patterns has been established and is known as PROSITE.
There are a number of protein kinase patterns in the PROSITE database. The documentation on these patterns contains the following note. -Note: if a protein analyzed includes the two protein kinase signatures, the probability of it being a protein kinase is close to 100%. Therefore, it is possible to check the four supposed protein kinase sequences taken from our initial database search and confirm if they are indeed protein kinases or not. The GCG program that determines if a protein sequence contains any of the PROSITE functional patterns or motifs is called MOTIFS. The beginning of a motif is denoted in the output file by a line going all the way across the page. A true protein kinase motif begins with the phase "Protein_Kinase_". Run each of the sequences through this program and record your results below. Press RETURN to start this process.
pk1:
number of motifs found: ________ number of protein kinase motifs ________
a protein kinase (yes/no) ________
pk2:
number of motifs found: ________ number of protein kinase motifs ________
a protein kinase (yes/no) ________
pk3:
number of motifs found: ________ number of protein kinase motifs ________
a protein kinase (yes/no) ________
pk4:
number of motifs found: ________ number of protein kinase motifs ________
a protein kinase (yes/no) ________
One of the terms to watch out for when doing database searches is putative. This means that the researcher doing the nucleotide or protein characterization and submitting the data entry to a database believes that reported function is true. It may not be. The authors of the protein kinase motifs may not have motifs developed for all the various possible types of protein kinases. If the protein sequence being used does have two protein kinase pattern hits it is protein kinase. If it has only one, it may be.
The protein kinase patterns being used are very complex and not just a simple set of a few fixed characters like the manual search you did on the beginning with a nucleotide sequence. There are positions in that vary among a given set of amino acids, other positions which can't be a given set of amino acids. The spacing between portions of the pattern varies. All these variations would be difficult for a person to keep track of, but for a computer it is easy. In the text you are about to see, the individual patterns have been broken up into short lines so they fit on the screen. To see the motifs in question, press RETURN.
One way to see how similar sequences are is to align them to one another. An alignment has been made of some representative protein kinase sequences. A number of protein families are very conserved, the protein kinases aren't. Their various motifs occur are slightly different locations in the sequences. The alignment to be shown is that for the region containing both the protein_kinase_atp and protein_kinase_st motifs. The actual motif regions are marked off from the rest of the sequence with a dash so you can spot them easily. Gaps in the sequences are denoted by periods. These are needed to make the alignment possible. To see the alignment press RETURN.
Members of the protein kinase family have had their 3D structures solved and this data has been placed in the PDB database. Postscript images of two of these structures, one showing a tyrosine protein kinase and another showing a serine/threonine protein kinase have been generated using the program Molscript from PDB data files. This data was processed in the following manner. The protein_kinase_atp region is colored in red, the protein_kinase_tyr or st region is shown in black. Secondary structural elements are also displayed. Known helixes are shown as coiled ribbons and colored in purple when they do not occur in a motif region. Established sheet sections are shown as pointed arrows and color aqua when they do not occur in a motif region. The images are labeled so you can tell which one you are looking at.
There should be a blinking yellow cursor in the dark blue window in the upper left hand portion of the screen. Enter the following term at the % prompt.
% gs pk1.ps
This launches the program Ghostscript which will be used to view the two images. A white window appears on the right hand side of the terminal screen and the protein picture is drawn there. Look closely at the image and record below the following pieces of information for each image: the type of protein kinase being displayed, the location of the atp motif and the structural elements making it up, and finally the location of the black region of the structure.
pk1:
type of protein kinase: __________________________________________
location of atp motif in the structure: _______________________________
structural elements making up the atp motif: _________________________
location of the black region: _____________________________________
Move the cursor with the mouse over into the dark blue window and click. This moves the working window back to the dark blue one and changes the cursor to a solid yellow rectangle. Press the RETURN key and then enter the term quit to the GS> prompt.
GS> quit
To view the second image, enter the following and record the desired information below.
% gs pk2.ps
pk2:
type of protein kinase: __________________________________________
location of atp motif in the structure: _______________________________
structural elements making up the atp motif: _________________________
location of the black region: _____________________________________
Move the cursor with the mouse over into the dark blue window and click. This moves the working window back to the dark blue one and changes the cursor to a solid yellow rectangle. Press the RETURN key and then enter the term quit to the GS> prompt.
GS> quit
There is yet another way to visualize information from the PDB database. This is with the program Rasmol. Rasmol allows the displaying of PDB data via a number of different coloring schemes and image types. It also allows the users to move the structure on the screen, rotating it to get a better understanding of its 3D shape.
To play with the first molecule enter the following.
% rasmol pk1.rasmol
A new window appears on the right hand-side of the screen and a wire frame image of the structure in displayed. To change the image to something close to the previous Molscript ones, use the mouse to move the cursor to the Display pull down menu and select the Cartoons option. The structure is redrawn in white with helical and sheet representations added. To change the colors, use the Structure option from the Colours pull down menu. The structure now has yellow sheets and purple helixes.
To move the structure on the screen, move the cursor to the small square in the middle of the scroll bar at the bottom of the window. Press down the mouse button and slowly move the square in the desired direction. The molecule moves along with the scroll bar indicator. Play around with the structure. When you are finished looking at this molecule, select Exit from the File pull down menu.
Start this process over again with the second protein kinase molecule by entering.
% rasmol pk2.rasmol
Use the mouse to move the cursor to the Display pull down menu and select the Cartoons option. Follow this by using the Structure option from the Colours pull down menu. The structure now is colored as the previous one was. Use the instructions given previously to move it around. When you are finished, select Exit from the File pull down menu.
This concludes your exercise for today. Fill out the report form on the last page of the exercise and hand it in to your lab instructor. If you are interested in following up on the biocomputing techniques presented in this lab session, there is a course, BC/BP 378, offered fall semester, that will allow you to do that.
References:
Per J. Kraulis, "MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures", Journal of Applied Crystallography (1991) vol 24, pp 946-950.
Roger A. Sayle and E. J. Milner-White, "RasMol: Biomolecular graphics for all", Trends in Biochemical Sciences 20(Sept):374-376, 1995.
Wisconsin Package Version 9.0, Genetics Computer Group (GCG), Madison, Wisc.

BC/BP 366 Computer Analysis
Name: ________________________________________________
Lab session: ___________________________________________
Answer the following questions in the space provided.
1) What were your observations about using the computer for doing pattern searches on nucleotide sequences.
2) What concerns would you have about a listing of sequences obltained from a
database?
3) Structural motifs can be very complex _________, with positions having a
great deal of variability.
4) To visualize the structure of a protein you need data from what database?