Learning about the concept of databases and how they are organized. Discovering the databases in the VADMS system and learning how to gather information from them. Exploring the nets for more biocomputing databases.
Author:
Susan Jean Johns
With the growing need to distribute information, one solution is the creation of databases containing information on a given subject. Since the need for various types of information changes with subject matter, many types of databases now exit. Each one fills a specific need for a specific audience.
Some databases are composed of only textual information such as that produced on a typewriter. These databases are known as ascii databases. Other databases are not textual, but binary in nature. These databases require special software to get at their information. Some databases are simply a long file containing information on a given subject. Others are a collection of files containing a certain type of data collected in a given format. Some data are stored in ascii form, others are not. The VADMS computing resource contains all these types of databases.
These databases are comprised of file(s) which are all in ascii or text format. This means that they can be read by a user directly from the component parts of the database. Depending on how the given database is organized, a user accesses one file or a number of them.
Some databases are just large versions of an ascii database. Information is collected into a small number of files which are in ascii format. Pointer files which bring all the database parts together are stored in a binary format. The size of the information base being worked with requires that software be used to locate the related parts of the database and produce the data in a complete form.
The nucleotide and peptide sequence databases are in this category. VADMS supports the following sequence databases: EMBL, GenBank, Pir, NRL_3D and SwissProtein. In the VADMS' computing resource, these databases are stored in GCG format. This means that each raw database consists of seven files. Each database member has the same filename and a different extension denoting its role in the database structure.
The Cambridge Structural Database, CSD is a non-ascii database. It is composed of the data generated on small molecules by x-ray crystallographic techniques and is stored in binary fashion. In order to access this data, special CSD software is necessary. There are over 150,000 entries in this database and very complex forms of analysis can be accomplished through use of the supplied software.
In general, the way to use this database is to create a file that is used as data input for a program called Quest. The output from this first search can be a listing of hits or the creation of a mini database on which additional analyses can be run.
By working with known data, additional information can be gathered about an unknown sequence by comparing possible regions of similarity. Similar regions could indicate similar functions or characteristics between the known and unknown sequences. This can lead to hypotheses about the unknown sequence that can be tested out in the wet lab. Using computers cuts down on the amount of time required to characterize a new sequence.
Known structural data can be used to provide a starting point to create models of similar materials. It can also be used to compare protein folding types or provide starting materials for docking studies. Studying existing data can provide information on spatial organization of molecules that can help model structures which have not been crystallized.
In order to increase the relevance of the class to a wider cross-section of students, a number of molecules have been selected to appeal to individuals working in diverse areas. These molecules are to be used when doing the remaining exercises in the course. Select the molecule that most closely fits the general type of work that you are doing in the wet lab or plan to use in your project and continue to use it throughout the rest of semester. Each entry has had at least one representative structure and a genomic sequence solved. Please use the one which interests you the most for the rest of the semester:
1) higher plant ribulose bisphosphate carboxylase/oxygenase, small subunit only
2) mammalian P21 ras proto-oncogene transforming protein
3) mammalian basic fibroblast growth factor
4) fungal superoxide dismutase
Week 3 Exercise
This series of exercises will acquaint you with searching various databases for information. These databases can either be local or located somewhere on the Internet. Items in these instructions which appear in bold should be entered followed by pressing the RETURN key.
l) Activate the computer
Pressing any key changes the terminal from screen saver mode to active.
2) Select the RIBOZYME icon
From the launcher bar, select the RIBOZYME icon by moving the cursor arrow with the mouse over to the RIBOZYME icon and pressing the left mouse button twice.
Successful connection to ribozyme is denoted by the appearance of a ribozyme information line and a login: prompt.
IRIX (ribozyme) login:
Once the login: prompt appears, log on to the machine by entering first your account name to the login: prompt, and then your password to the Password: prompt. If necessary refer back to the exercise for week 2 on page 7 for your password.
Now that you are on ribozyme, it is time to explore local databases. To do this, you will go through a number of steps designed to give you insight into how these databases work.
4) Create a subdirectory to keep this week's work in.
To keep data in separate working areas, it is necessary to create subdirectories. This is done with the mkdir command as you know from your computing experiences last week. Create the following subdirectory in your account.
% mkdir week3
Now move into that location using the following command line.
% cd week3
From now on in the course you will create a new subdirectory to house the work required for each week's exercise series. This is to help keep the generated files separated and associated with the week in which they were generated. It will also aid in removing these files when you no longer need them. To further help in this process, now copy over from the $GRAD_DIR/week3 location all the files you will need to do the exercise. This includes the report form for the week.
% cp $GRAD_DIR/week3/*.* .
Just in case you didn't receive an e-mail message from Dr. Sundvall last week, check again. Get into the pine program. If you have a message, press the RETURN key to get to the INBOX screen. Pressing the RETURN key again will produce a listing of the messages waiting for you. Use the arrow keys to select the message you want to read. When the message line is highlighted, pressing the RETURN key will cause it to be displayed on the screen. After you have read the message, press the e key. This will export the mail message to an external file. You will be prompted for a name for the file. Call the file snake.answer. Then press the q key to exit the program and respond to the Really quit pine? prompt with y.
% pine
Use pico to remove the mail header information from the snake.answer file.
% pico snake.answer
Then get back into pine and send e-mail to teacher about the arrival of the snake sequence from Dr. Sundvall. Use the information at the end of week's 2 exercise to complete this part (pages 13-15).
6) Working with simple ascii databases.
There are various types of ascii databases. The simplest of these consist of a single ascii file looked at by using the UNIX search utility, grep. Unfortunately, VADMS no longer supports any simple ascii databases. However, the following section using a single text file will give you an idea of what this type of database searching is like.
To explore such a database, let's work with the file, ccc.mol_wt. You have already copied this file over to your current location in your account.
Look at the contents of this file with the cat command. It is the output file from an old program that generated molecular weights of selected sequences and will simulate a searchable simple ascii database.
cat ccc.mol_wt .
This file contains a listing of cytochrome c results. To insure that we get only cytochrome c hits in our search it is necessary to enclose our search term within quotes and a space after the final c in the search term. Use the following command line as a model for your search.
% grep "Cytochrome c " ccc.mol_wt
The following type of output appears on your screen.
P1;CCCH - Cytochrome c - Chicken . . . P1;CCCK - Cytochrome c - Yeast (Issatchenkia orientalis) . . . P1;CCCM - Cytochrome c - Arabian camel . . . P1;CCCN - Cytochrome c - Sea-island cotton . . . P1;CCCRCF - Cytochrome c - Crithidia fasciculata . . . P1;CCCRCO - Cytochrome c - Crithidia oncopelti . . . P1;CCCS - Cytochrome c - Castor bean . . . P1;CCCZ - Cytochrome c - Chimpanzee (tentative sequence) . . .To produce a file of this search, use the following variation of the previous command.
% grep "Cytochrome c " ccc.mol_wt > ccc.hits
In your week 3 subdirectory, there is now a file called ccc.hits. Use this file and your newly acquired editing skills to put this listing into alphabetical order based on the source of the material. From the listing above, your edited file should start with the following line.
P1;CCCM - Cytochrome c - Arabian camel . . . % pico ccc.hits
Send this modified file to the teacher account. Everyone else in the course has the same file with the same name in their accounts, so rename the ccc.hits file to reflect your name by entering a command line similar to that given below, where (your lastname) is your last name.
% cp ccc.hits (your lastname).hits
Now send off the file using the following command line. The (your lastname).hits represents the name of your file.
% rcp (your lastname).hits teacher@ribozyme:receive
The nucleotide and peptide databases are large ascii data files with binary indexes that point to the various parts of the database. These databases are stored in GCG format and are composed of seven data files.
Initial searches on these databases can be done using the GCG program STRINGSEARCH. It is important to remember in searches of these databases that STRINGSEARCH looks for exact character string matches. Not everyone uses the same naming conventions for things and it may take more than one try to locate what you are looking for. Try a number of searching terms before concluding that your data is not in the databases.
For a practice run, attempt to locate all the nucleotide sequences that come from the African clawed frog using the STRINGSEARCH program. Do a similar search on the SwissProtein peptide database. Use the instructions given below to carry out these tasks.
The GCG program STRINGSEARCH requires that you activate the software package by entering gcg. When you do this, a welcome message to the package appears on the screen. This is handy information that shows the latest release version numbers for the various sequence databases that VADMS supports. Activating the GCG software package only needs to be done once a computing session.
% gcg
Welcome to the WISCONSIN PACKAGE
Version 9.0-UNIX, December 1996
Installed on irix
Copyright 1982, 1983, 1984, 1985, 1986, 1987, 1989, 1991, 1992, 1994, 1996
Genetics Computer Group, Inc. All rights reserved.
Published research assisted by this software should cite:
Program Manual for the Wisconsin Package,
Version 9, December 1996, Genetics Computer Group,
575 Science Drive, Madison, Wisconsin, USA 53711
Databases available:
GenBank Release 98.0 (12/96)
EMBL (Abridged) Release 43.0 ( 5/95)
PIR-Protein Release 49.0 ( 6/96)
SWISS-PROT Release 33.0 ( 3/96)
NRL_3D Release 2x.0 ( 6/96)
PROSITE Release 12.2 ( 3/95)
Restriction Enzymes (REBASE) ( 6/95)
Help is available with the command % genhelp or by
calling (608) 231-5200 or sending e-mail to Help@GCG.Com
This process may seem to take a relatively long time. A large number of
assignments are being made so that you can easily use the package. Once this
is done, you only need to enter a program's name in order to run it. A copy of
the GCG manual is located inside the desk drawer of each carrel in the
lab.
To run the STRINGSEARCH program follow the example given below. In this example the term GenEMBL stands for the combined GenBank and EMBL nucleotide databases. With the recent explosive growth in the expressed sequence tags portion of these databases, GCG has removed these sequences from the GenEMBL definition to speed up database searches.
% stringsearch
StringSearch identifies sequences by searching for character patterns
such as "globin" or "human" in the sequence documentation.
STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? <rtn>
Do you want to search through:
A) definitions
B) complete sequence annotation
Please choose one (* A *): <rtn>
Search for what text patterns ? "african clawed frog" <rtn>
What should I call the output file (* genembl.strings *) ? frog_nuc.lis <rtn>
*** Gb_ov:S69724 ***
S69724 elongation factor 1 gamma {type 1} [Xenopus laevis=South African clawed f
rogs, oocytes, mRNA, 1441 nt]. 9/94 1,441bp
///////////////////////////// many lines of data///////////////////////////////
*** Gb_ov:Xelxlmyc2x ***
L11363 African clawed frog L-myc oncogene (xL-myc2) mRNA, complete cds. 6/93 1,0
57bp
Sequences searched: xxxxxx
Sequences with matches: x
Patterns sought: african clawed frog
Output file: frog_nuc.lis
Now repeat this process looking for african clawed frog sequences in the SwissProtein database using the instructions given below.
% stringsearch
StringSearch identifies sequences by searching for character patterns
such as "globin" or "human" in the sequence documentation.
STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? sw:* <rtn>
Do you want to search through:
A) definitions
B) complete sequence annotation
Please choose one (* A *): <rtn>
Search for what text patterns ? "african clawed frog" <rtn>
What should I call the output file (* genembl.strings *) ? frog_sw.lis <rtn>
///////////////////////////// many lines of data ///////////////////////////////
this will take a while since there are over 500 hits for this term in this database
*** Sw:Zo8i_Xenla ***
P18853 xenopus laevis (african clawed frog). oocyte zinc finger protein xlcof8.4
i (fragment). 2/94 145aa
Sequences searched: xxxxx
Sequences with matches: xxx
Patterns sought: african clawed frog
Output file: frog_sw.lis
Put these two files together and produce a mini dataset of all the current listings of sequence files that arise from this source. Such a listing can be used for further searching to find specific compounds of interest from African clawed frogs. Combine these two data files together using the following command line.
% cat frog_nuc.lis frog_sw.lis > frog_all.lis
Now use the UNIX grep function to search this mini database for the term histone. Record below in the space provided the access codes for three of these histone hits. The access codes are the terms immediately following the : at the beginning of the line. The database in which the sequence was found is given before the : and the access code for the sequence after it. Further to the right is the beginning of the title or definition line for the sequence. This line usually contains the accession number for the sequence, the source of the material, a short description of what it is and how long it is. If the information is longer than 80 characters you may not see the end of the line on your terminal. An example of this type of information line is given below.
% grep histone frog_all.lis Sw:H1b_Xenla P06893 xenopus laevis (african clawed frog). histone h1b. 11/90 219aa
access code 1: __________________________ access code 2: _______________________
access code 3: __________________________
The information stored in sequence databases needs pointers to allow you to extract the desired sequence information. A number of different terms are similar in name but different in purpose. A sequence's access code is composed of a group of 6 to 10 alphanumeric characters, depending on its database of origin. For well-established sequences with verified data, access codes are mainly letters. Unverified sequences have more numbers than letters in their access codes. In the PIR databases, questionable data has an * before its title or definition information.
At the time when a sequence is deposited into a database, it is given an accession number. For nucleotide data this number starts with a letter denoting one of the 14 major database collection areas followed by five numbers. Peptide sequences may have 2 letters at the beginning of their accession numbers followed by 4 numbers. Accession numbers do not change as the data is absorbed into different databases. Often the best way to search for newly published sequences is through their accession numbers because their final access codes may not have been determined before the paper went to press. Accession number searching is possible through GCG's STRINGSEARCH program as well. The accession number given at the time the sequence is deposited is known as the primary accession number. If the sequence was developed from work on earlier sequences, those numbers will also be given and they are known as secondary accession numbers.
Unfortunately, each of the three major sources of databases use their own slightly different forms of data entry and access codes. GenBank uses a code with a maximum length of 10, EMBL has a maximum length of 8, the NBRF databases (PIR and NRL_3D) a length of 6 and the SwissProtein database a length of 10.
Now that you've found some files of interest, you can bring them into your account by using GCG's FETCH. One advantage of FETCH is that it can work from a file containing a listing of filenames. To show you just how this works, create such a file containing the database names your recorded on the previous page. GCG programs can accept shortened names for the databases, such as sw for SwissProtein and p for PIR databases. Enter one access code per line. A good way to distinguish this type of file from other types is to give it the extension fil. Create such a fil file for yourself with pico that contains these access codes in it, one per line. Using the example given below as a guide
% pico frog-histone.fil
Enter your three lines of data into the file in a manner such as is given below.
sw:H1a_Xenla sw:H1b_Xenla sw:H1c1_Xenla
Exit from the editor and use your newly created file.
Most GCG programs can be run in a command line mode. This means that some programs that require simple input from the user can be given this information on the same line as the name of the program and the program automatically does whatever is needed. Fetch is one of these programs. It can also be run interactively, with the user giving all the necessary data when prompted for it. However, for practice, run fetch in automatic mode. Fetch creates the following type of filename when run this way. The database name becomes the extension and the access code for the sequence becomes the filename.
% fetch -in=@frog-histone.filThe following type of information will be returned to your terminal screen.
FETCH copies GCG sequences or data files from the GCG database into your directory or displays them on your terminal screen. h1a_xenla.sw h1b_xenla.sw h1c1_xenla.sw
Look at these files and see how they are organized. The information at the beginning is known as header information and comes from the reference database file. The actual sequence data comes from the sequence database file. In GCG files there is a line between these two sections. It contains information used by GCG to check and make sure that data hasn't been corrupted in processing. In the example given below the xxxxx represents the access code for one of your files.
% cat xxxxx.sw
There are times when you just want to look at a sequence file and not have it take up space in your own account. You can do this with the version of fetch that outputs the information to the terminal screen instead of a file. On our system that version of fetch has the name typedata. To see what that process is like, enter the following command line. This example will not only display the output to the terminal screen but uses an accession code to locate the desired sequence.
% typedata gb_ov:s69724
In looking at this example, you may have noticed the gb_ov term. GCG has given their own names to the various sub-divisions of the GenBank database. The term gb:* can be used to search all the GenBank sequences, or you can zero in on the desired sub-section of the database if you know the code.
GB_BA => gb_ba => bacterial sequences GB_EST1 => gb_est1 => first part of the expressed sequence tag entries GB_EST2 => gb_est2 => second part of the expressed sequence tag entries GB_EST3 => gb_est3 => third part of the expressed sequence tag entries GB_EST4 => gb_est4 => fourth part of the expressed sequence tag entries GB_EST5 => gb_est5 => fifth part of the expressed sequence tag entries GB_EST6 => gb_est6 => sixth part of the expressed sequence tag entries GB_EST7 => gb_est7 => seventh part of the expressed sequence tag entries GB_EST8 => gb_est8 => eighth part of the expressed sequence tag entries GB_EST9 => gb_est9 => ninth part of the expressed sequence tag entries GB_GSS => gb_gss => genome survey sequences GB_HTG => gb_htg => high-throughput sequencing data GB_IN => gb_in => invertebrate sequences GB_OM => gb_om => other mammalian sequences GB_OV => gb_ov => other vertebrate sequences GB_PAT => gb_pat => patent sequences GB_PH => gb_ph => phage sequences GB_PL => gb_pl => plant sequences GB_PR => gb_pr => primate sequences GB_RO => gb_ro => rodent sequences GB_ST => gb_st => structural RNA sequences GB_STS => gb_sts => sequence tag site sequences GB_SY => gb_sy => synthetic and chimeric sequences GB_TAGS => points to the est and sts sequences GB_UN => gb_un => unannotated sequences GB_VI => gb_vi => viral sequences
By using STRINGSEARCH, information contained in the header portion of the data file can be searched. Two levels of this searching are available. The first is the fastest, but only looks at the definition information. Definitions in this case contain the name of the organism, name of material, the sequence length, and possibly the date. The definitions data for the GenBank, EMBL and SwissProtein databases also contain the primary accession number for the sequences. The second level is much slower, however, it will look through everything in the reference section to find the desired character string.
As practice, use the STRINGSEARCH program to look into the plant portion of GenBank, the gb_pl sub-division for definition information on both maize and zein in the same sequence. Give the output file the name plant.look to match the example given below. User input is shown in bold type.
% stringsearch
StringSearch identifies sequences by searching for character patterns
such as "globin" or "human" in the sequence documentation.
STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? gb_pl:* <rtn>
Do you want to search through:
A) definitions
B) complete sequence annotation
Please choose one (* A *): <rtn>
Search for what text patterns ? maize,zein <rtn>
What should I call the output file (* gb_pl.strings *) ? plant.look<rtn>
///////////////////////////// many lines of data ///////////////////////////////
Sequences searched: xxxxx
Sequences with matches: xx
Patterns sought: maize zein
Output file: plant.look
Now use the output file from this search to locate additional hits with the term promoter in these sequences. Use the example given below as a guide.
% stringsearch
StringSearch identifies sequences by searching for character patterns
such as "globin" or "human" in the sequence documentation.
STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? @plant.look
<rtn>
Do you want to search through:
A) definitions
B) complete sequence annotation
Please choose one (* A *): <rtn>
Search for what text patterns ? promoter <rtn>
What should I call the output file (* plant.strings *) ? plant.again <rtn>
/////////////////////////////a few lines of data ///////////////////////////////
Sequences searched: xx
Sequences with matches: x
Patterns sought: promoter
Output file: plant.again
There are times when you need to conduct a search in a step wise process. Start out with a very general search pattern and then do further searches on the results of that initial search to narrow down the data set until you have exactly what you are looking for. A case in point is to locate all the current protein sequences for cytochrome c in the Swissprotein database (sw:*). For your initial pass use the search term "cytochrome c" in the STRINGSEARCH program.
% stringsearch
StringSearch identifies sequences by searching for character patterns
such as "globin" or "human" in the sequence documentation.
STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? sw:* <rtn>
Do you want to search through:
A) definitions
B) complete sequence annotation
Please choose one (* A *): <rtn>
Search for what text patterns ? "cytochrome c" <rtn>
What should I call the output file (* sw.strings *) ? chrome.look <rtn>
///////////////////////////// many lines of data ///////////////////////////////
Sequences searched: xxxxx
Sequences with matches: xx
Patterns sought: "cytochrome c"
Output file: chrome.look
As the data scrolled off on the screen, it becomes obvious that your search has found a lot of other proteins besides the desired cytochrome c ones. There are cytochrome c reductases, cytochrome c-1's, flavocytochrome c proteins and many others. Use the more commands to scroll off your results one screen's worth at a time and look very carefully the actual cytochrome c protein entry lines. Record below the information needed to make the search term specific for the desired proteins.
% more chrome.look
search data: ________________________________________________________________
Repeat your search on this output file, this time getting more specific to narrow the data set down to only the desired cytochrome c proteins using the information you gathered above. Check your work to see that your approach works, if it doesn't, try again and again until you get the desired results.
% stringsearch
StringSearch identifies sequences by searching for character patterns
such as "globin" or "human" in the sequence documentation.
STRINGSEARCH through what sequence(s) (* GenEMBL:* *) ? @chrome.look <rtn>
Do you want to search through:
A) definitions
B) complete sequence annotation
Please choose one (* A *): <rtn>
Search for what text patterns ? (put in your own phrase) <rtn>
What should I call the output file (* sw.strings *) ? chrome2.look <rtn>
///////////////////////////// many lines of data ///////////////////////////////
Sequences searched: xxxxx
Sequences with matches: xx
Patterns sought: (own phrase)
Output file: chrome2.look
Record below just how effective your new search pattern was.
search effectiveness: __________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
9) Exploring with a selected molecule.
Choose one of the four selected molecules from the Selected Molecule List on page 3 that best fits your interests and try your own searches. Create output files of your findings. After determining the extent of information available on this molecule in the local sequence databases, create a composite file of your findings. Name this file using local as the filename and looking as the extension.
10) An example of a non-ascii database
Cambridge is a non-ascii database. Special software must be used in order to get at the data contained within the database. There are two modes of searching the database. One is in a batch-like mode and the other is a graphics method. The batch-like mode is the simplest to learn. It calls for the creation of a question file. To get this working, you must be able to edit effectively and use the appropriate Cambridge manual. While the graphics method allows you to select structures visually, it requires more experience with the software package to use it effectively than the batch mode does.
Here's an example of how to use this database. Assume that the structure of ferrocene is being sought. This means that there must be coordinates for the data, and that the name of the structure be ferrocene. A question file created using the editor to handle this job would look like this.
T1 *COORds.gt.0
T2 *XNAMe 'FERROCENE'
QUESTION T1.and.T2
The Cambridge Structural Database is located on the ribozyme. Because of special set-up requirements, its use is restricted to modelling students. They will be exploring this database next week.
11) Working with databases on the nets.
There are times when the data you need is just not available locally. When this happens it is necessary to look for what you need on the nets. There are a number of locations that contain information of interest to molecular biology computer users. To explore this aspect of biocomputing, you will next use the gopher utility to get an x-ray data file.
To start this process enter the following command line.
% gopher
The following screen is then displayed on your terminal.
Internet Gopher Information Client v2.1.3
Home Gopher server: serval.net.wsu.edu
--> 1. About WSUinfo/
2. Student Information System/
3. WSU Campuses Information/
4. Desktop Resources/
5. Discussion Forums/
6. Library Resources/
7. Software Archives/
8. Gopher Tunnels/
9. News & Weather/
10. Internet Reference/
Press ? for Help, q to Quit Page: 1/1
At this screen press the v key. This loads a local set of bookmarks to use in the gopher utility. This list contains gopher sites with information of interest to molecular biology users.
Internet Gopher Information Client v2.1.3
Bookmarks
--> 1. Computational Biology (Welchlab - Johns Hopkins University)/
2. Brookhaven National Laboratory Protein Data Bank/
3. EMBnet BioInformation Resource EMBL (Germany)/
4. IUBio Biology Archive, Indiana University/
5. PIR Archive, University of Houston/
Press ? for Help, q to Quit, u to go up a menu Page:1/1
Use the arrow keys to move the pointer to option 2, Brookhaven National Laboratory Protein Data Bank. This is where x-ray data results are available. Once the pointer is on option 2, press the RETURN key.
Internet Gopher Information Client v2.1.3
Brookhaven National Laboratory Protein Data Bank
--> 1. Welcome to the Brookhaven PDB Gopher Hole!
2. An (almost) full text search of the PDB Bibliographic Headers <?>
3. Search by entry id only <?>
4. *NEW* Check the Status of a Pending Entry by ID, Tracking, Auth.. <?>
5. *NEW* Check the Status of Entries on "HOLD" by ID, Tracking, Au.. <?>
6. Raw access (Try the indexed searches instead)/
7. Important message for BNL INFORM users.
8. Documents/
9. Information about the PDB Mailing List (List Server)
10. Recent Announcements and Changes/
11. Recent PDB Newsletters/
12. Related Databases and the rest of Gopherville/
13. Software Available from PDB and friends/
14. Some hints for searching the Brookhaven PDB/
15. The PDB's Anonymous FTP /
Press ? for Help, q to Quit, u to go up a menu Page: 1/1
Notice that you can do text searching at this site with option 2, all you need is a term to search for. Try this out using the term, crambin. Use the arrow keys to move the pointer to option 2 and press the RETURN key. The middle section of the screen is rewritten with a box, to allow you to enter a search term.
Internet Gopher Information Client v2.1.3
Brookhaven National Laboratory Protein Data Bank
1. Welcome to the Brookhaven PDB Gopher Hole!
--> 2. An (almost) full text search of the PDB Bibliographic Headers <?>
3. Search by entry id only <?>
4. *NEW* Check the Status of a Pending Entry by ID, Tracking, Auth.. <?>
---------An (almost) full text search of the PDB Bibliographic Headers---------<
Words to search for
[Help: ^-] [Cancel: ^G]
------------------------------------------------------------------------------
13. Software Available from PDB and friends/
14. Some hints for searching the Brookhaven PDB/
15. The PDB's Anonymous FTP /
Press ? for Help, q to Quit, u to go up a menu Page: 1/1
Enter the term crambin in the highlighted portion of the box. All you
have to do is type in the term and press the RETURN key when you are
finished. In the bottom corner the screen shows you that the system is
searching and then the screen is cleared and a list of hits is displayed.
Internet Gopher Information Client v2.1.3
An (almost) full text search of the PDB Bibliographic Headers: crambin
--> 1. 1ccm : CRAMBIN (PRO 22/LEU 25) (NMR, 8 STRUCTURES)/
2. 1ccn : CRAMBIN (PRO 22/LEU 25) (NMR, MINIMIZED AVERAGE STRUCTURE)/
3. 1cbn : CRAMBIN/
4. 1cnr : CRAMBIN (PL FORM)/
5. 1crn : CRAMBIN/
6. 2plh : MOL_ID; MOLECULE: ALPHA-1-PUROTHIONIN; CHAIN: NULL; OTHE../
Press ? for Help, q to Quit, u to go up a menu Page: 1/1
To look at one of these entries, move the pointer to the desired line and press the RETURN key. Assuming that you were curious about the pl form of crambin, you would move the pointer down to option 4 and press RETURN. The screen would clear again, this time displaying a list of the information files available for that x-ray structure.
Internet Gopher Information Client v2.1.3
1cnr : CRAMBIN (PL FORM)
--> 1. 1cnr.biblio
2. 1cnr.full
3. 1cnr.gif <Picture>
Press ? for Help, q to Quit, u to go up a menu Page: 1/1
This is a standard listing of the types of information available at this site on any given x-ray structure. The biblio file contains the reference information on the structure. The full file is the complete x-ray structure file for the given material and the gif file is an image or picture file of the default orientation of that material. Normally, a user is interested in getting the full data set to a local computer. To do that move the pointer to option 2 and press RETURN. A screen similar to that given below is displayed.
1cnr.full (67k) 1% ------------------------------------------------------------------------------- HEADER PLANT SEED PROTEIN 15-JUL-93 1CNR 1CNR 2 COMPND CRAMBIN (PL FORM) 1CNR 3 SOURCE ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED 1CNR 4 AUTHOR A.YAMANO,M.M.TEETER 1CNR 5 REVDAT 1 31-AUG-94 1CNR 0 1CNR 6 JRNL AUTH A.YAMANO,M.M.TEETER 1CNR 7 JRNL TITL CORRELATED DISORDER OF THE PURE PRO22(SLASH)LEU25 1CNR 8 JRNL TITL 2 FORM OF CRAMBIN AT 150K REFINED TO 1.05 ANGSTROMS 1CNR 9 JRNL TITL 3 RESOLUTION 1CNR 10 JRNL REF J.BIOL.CHEM. V. 269 13956 1994 1CNR 11 ------------------------------------------------------------------------------- [Help: ?] [Exit: u] [PageDown: Space]
1cnr.full (67k) 1% ------------------------------------------------------------------------------- HEADER PLANT SEED PROTEIN 15-JUL-93 1CNR 1CNR 2 COMPND CRAMBIN (PL FORM) 1CNR 3 SOURCE ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED 1CNR 4 ------------------------------------------------------------------------------- Save in file: 6 1cnr.full 7 [Help: ^-] [Cancel: ^G] 8 ------------------------------------------------------------------------------- JRNL TITL 2 FORM OF CRAMBIN AT 150K REFINED TO 1.05 ANGSTROMS 1CNR 9 JRNL TITL 3 RESOLUTION 1CNR 10 JRNL REF J.BIOL.CHEM. V. 269 13956 1994 1CNR 11 ------------------------------------------------------------------------------- [Help: ?] [Exit: u] [PageDown: Space]
Now search this resource for information on your selected molecule. Use the information given on the previous four pages to guide you while your attempt to locate a PDB file for this molecule. When you find one, save it to your account.
Data safely stored away, press the u key five times to return to the starting gopher screen. At this point press the q key to exit. The program will query you with Really quit (y/n) ? y . Just press RETURN to return to the ribozyme prompt.
Do a directory of your week3 subdirectory. Record below just how big the 1cnr.full file is plus the name and size of your selected molecule file.
% ls -la *.full
size of 1cnr.full file: ______________________________________________________
name of selected molecule file: _______________________________________________
size of selected molecule file: ________________________________________________
Edit your local.looking file to reflect the results you obtained on your selected molecule at the PDB gopher site. Include the name of the x-ray file and information on the number of structure files there appears to be on this molecule.
% pico local.looking
This finishes the first part of the exercise that requires the use of the ribozyme computer. The next section uses software available on the clone. The exercises for this week expect you to become familiar with working on the clone as well as on ribozyme. Log off ribozyme now and return later to finish up the rest of the exercise series.
% logout
You are no longer on ribozyme, but still in the emulator program. Use the arrow to move to the File location of the control bar. This is a pull-down menu. From this listing select the QUIT option to exit the session and release the mouse button. This moves you back to the Launcher window screen.
12) Searching for information at NCBI with Entrez
NCBI's Network Entrez allows a Mac, Windows PC, or UNIX workstations that is directly ethernet linked to the Internet to use a super fast and powerful sequence reference and data information searching tool. Entrez is an incredible `lookup' program consisting of five linked databases, a nonredundant protein sequence database, a nonredundant nucleotide sequence database, the PDB structural database from Brookhaven, a new genome linkage database, and those references from Medline that mention sequence data. Finding an entry in any one database automatically links you to the associated entries in the other databases. Furthermore, Entrez offers built in `neighboring.' Finding an entry in any sequence database provides a list of all similar entries based on BLAST statistics. Finding a Medline entry neighbors all articles discussing similar concepts. Searches can be constructed using powerful nested Boolean logic and many different types of fields can be restricted. For instance, one can sort to various taxonomic levels and search by journal, author, date, or even E.C. number, among many others.
To launch Entrez, double-click the ENTREZ icon in the Launcher window This will activate the software, make connections with NCBI's dispatcher which keeps the databases, and throw up a Query window. The default database to search is Medline; you may want to select one of the alternatives. Refer to your carrel drawer for an Entrez manual and tutorial pamphlet. Please play with the software until you get comfortable with the concepts of changing the Boolean logic of the search, linking databases, and neighboring entries. Change the Boolean operator from its default "and" to "or" by dragging a search word onto another. Select search words by clicking the entry in the lower window then selecting Retrieve to see the finds. Link databases by selecting an entry after it has been found and by changing the target database selection near the bottom of the Document window. Neighbor by selecting any entry and choosing the Neighbor button in the Document window. Display an entry by double-clicking it. Use the NCBI tutorial as a guide to fully explore the program.
Once you understand how to use Entrez, find all the relevant entries for your "selected molecule." Try to find a cDNA GenBank entry, a genomic GenBank entry, an entry from a protein database, and an entry from Brookhaven's PDB. Do they match the selected molecule entries that you found with previous database text searching programs? Differences may be due to naming conventions used by NCBI or you may be looking at completely different sequences. Try to figure out which is the case with your data. Entrez is usually not used as a data retrieval tool since most of the sequence entries are stored locally on ribozyme, making reformatting necessary for any data retrieved from NCBI. Rather, use it as a much more powerful alternative over GCG's STRINGSEARCH for finding the name or access code of sequences you may be interested in. The drawback is that it does not produce a list file that can be directly fed to GCG programs. Record your Entrez findings below. Note any differences in the data that you found.
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________
13) Exploring a database on the web
Not all databases have to exist on the local system in order to be used. There are numerous databases that you can access on the web. For a listing of some of these and some experience in using them go through the following steps.
From the Launcher window select the NETSCAPE icon to get to the VADMS' web site. From the first page select Selected biotechnology hotlist. Select Bioinformatics Sites. From this page select Bioinformatics Databases & Searching to move to the proper section of this page so that you can select BioSCAN.
Once at the BioSCAN page, scroll down the page so you can select BioSCAN Online!. This is the part of the server that allows you to search the databases. Check out this page. It is the heart of this server's searching capabilities. Select Search for a Sequence to do a text search for information.
Now at the searching page you only need to enter a search phrase and click on the Submit Request button. Try the same search phrase that was successful for your cytochrome c search in section 8. Check out your results. Notice that the identification lines are different than what you saw before with the local copy of the Swissprotein database. Record below your impression of the search and its results.
search results: _________________________________________________________
Scroll down the output until you find some actual cytochrome c hits. Look carefully at these lines and devise a new search phrase that you can use to find only the proteins you want. Select the Back button from the browser menu bar and return to the previous page and try your search again with your revised phrase. How did these results compare with the first ones?
search 2 results: _______________________________________________________
There are advantages and disadvantages in doing database searching on the web. One advantage could be that their databases are more up-to-date than the local ones. Another is that there may be specific databases tailored to your exact area of interest.
But, as seen here there are also disadvantages. You can't reuse one search result to do another search making it more specific and only using a small section of the original database. In order to retrieve your results you either have to print off the results on the screen or use some means of getting the actual data over to your machine. This server has a Retrieve a Sequence option, but it only displays the desired sequence on the screen. From here you would have to use some sort of screen capture technique or view the source and save it to your local machine. There is also the question of the format of the data acquired in this manner and making it compatible with whatever sequence analysis system you are using. Most people don't have a sequence analysis package on their own computer and therefore transferring the data just gathered to another platform where it can be used is another concern.
For very specific purposes where unique databases are required working on the web makes sense. But most of the time it doesn't when access to local databases is possible.
14) Exploring possible project subjects in the databases (optional).
Now that you know how to poke around, check for information on possible project subjects. Use whatever data sort you desire, local or Internet. A survey of what sort of data is available on a given subject can help you select a project or shape the direction a project can take. Use all the resources at your disposal. Doing some background work now can save headaches later.
15) Finishing up.
With your Entrez, web and possible project work finished, log back onto the ribozyme machine.
From the Launcher window, select the RIBOZYME icon and press the mouse button twice. Once the login: prompt appears, proceed to log on to the machine entering your account name and then your password.
Now move into the location where your data for week 3 is being stored.
% cd week3
Revise your local.looking file with the pico editor to reflect your Entrez findings. Once that is done, rename the file to reflect your last name and keep the looking extension. Copy over a report form for this exercise, rename it to have your last name, go into the file and use pico to fill in the report. Finally, send both files to the teacher account.
% pico local.looking % cp local.looking (your lastname).looking % cp week3.week3 (your lastname).week3 % pico (your lastname).week3 % rcp (your lastname).week3 teacher@ribozyme:receive % rcp (your lastname).looking teacher@ribozyme:receive
This concludes your computing session for this week. Log off the computer.
% logout
Now exit the emulator program by using the arrow to move to the File location of the control bar. This is a pull-down menu. From this listing select the QUIT option to exit the session and release the mouse button. This moves you back to the Launcher window screen.
Internet resources used:
VADMS Center Web site:
mosaic users: http://ribozyme.vadms.wsu.edu/~vadms
netscape users: http://www.vadms.wsu.edu
BioSCAN site:
http://genome.cs.unc.edu
Entrez:
While you used ENTREZ over the Internet, it requires that special software exist on your
local machine in order for it to work. Your local machine also has to have an IP address
(be on the campus backbone or some other network with access to the Internet) in order
to use the software. ENTREZ is available for DOS, Mac and UNIX platforms.