next up previous contents
Next: 4. SAM-T99 Up: SAM Sequence Alignment and Previous: 2. New to this

Subsections

  
3. Quick overview

The Sequence Alignment and Modeling (SAM) suite of programs includes several tools for modeling, aligning, and discriminating related DNA, RNA, and protein sequences. Given a set of related sequences, the system can automatically train and use a linear HMM representing the family. The SAM-T99 method, available in SAM 3.0 and higher, is a more automated method for building and using HMMs. Readers may wish to read the T99 example before the current section. See Section 4.


  
Figure 1: A linear hidden Markov model and example alignment.
\begin{figure}\begin{center}
\parbox{0pt}{\psfig{figure=hmmexample.eps,width=0.9\textwidth}}
\end{center}\end{figure}

SAM uses a linear hidden Markov model (Figure 1) to represent biological sequences. The model is a linear sequence of nodes, each of which includes match (square), insert (diamond), and delete (circle) states. Each match state has a distribution over the appropriate alphabet indicating which characters are most likely. The chain of match states forms a model of the family, or of columns of a multiple alignment. Some sequences may not have characters in specific positions -- delete states enable them to skip through a node without `using up' any characters. Other sequences may have extra characters, which are modeled with the insert states. Insertions are thus used when a small number of sequences have positions not found in most other sequences, while delete states are used when a small number of sequences do not have a character in a position found in most other sequences. As with the match states, all transition probabilities (the chance of having a delete or moving to an insertion state) are local, enabling, for example, the system to strongly penalize sequences that delete conserved regions.

The primary programs include:

buildmodel
Create a new model from a family of sequences, or refine an existing model.
align2model
Create a multiple alignment of sequences to an existing model. The prettyalign program will make align2model output more readable.
hmmscore
Calculate the negative log-likelihood (NLL) scores for a file of sequences given a model, as well log-odds scores and E-values in the case of reverse null models. This program is used for discrimination experiments. Sequences that score better than (or worse than) a threshold can be saved, as can their alignments or multiple domain alignments.
modelfromalign
Use an existing multiple alignment to create an initial model. Such a model is usually then refined using buildmodel.
target99
A script that uses SAM to iteratively create a model from a single protein sequence and its close homologues.


  
Figure 2: Overview of SAM.
\begin{figure}\centerline{\psfig{figure=overview.eps,width=0.8\textwidth}}
\end{figure}

A basic flowchart for using SAM is shown in Figure 2.

As a simple example, consider the task of modeling the 10 tRNAs included in the file trna10.seq of the distribution. For this experiment, default program settings will be used: the many adjustable parameters are described Sections 6 and 12.

3.1 Building a model

To start, we need to create a model from the sequence file using buildmodel. This program always requires a name for the run: if the name is test, the system will create the model output file test.mod, which will include parameter settings, iteration statistics, and CPU usage, as well as the initial and final model.

Parameters to buildmodel are specified with hyphens. For this experiment, first we try the command:

buildmodel test -train trna10.seq -randseed 0
This specifies the run name and where the sequences for training can be found. Here, to hopefully make this example reproducible, a seed for the random number generator has also been specified.

Buildmodel then prints out a line on standard output such as:

-90.87  -85.85  -87.25   1.40  50 2 75 
This is a brief summary of the statistics provided in the output model file, and is discussed in more detail in Section 11.1. If no random seed had been specified, buildmodel would use the process number, and the statistics line would be different for multiple runs.

The run has generated a file called test.mod. This file contains various statistics, described later, as well as the final model. Statistics are printed to the file after each re-estimation so that the progress of a run can be readily checked.

3.2 Aligning sequences

To generate a multiple alignment, a command such as the following is used:

align2model trna10 -i test.mod -db trna10.seq
prettyalign trna10.a2m -l90 > trna10.pretty
This aligns each sequence to the model, places the alignment in the file trna10.a2m, and then cleans up the output and places it in the file trna10.pretty, with 90 characters per line. The alignment will look something like:

;  SAM: ../src/prettyalign v3.0 (October 8, 1999) compiled 11/26/99_20:33:58
;  (c) 1992-1999 Regents of the University of California, Santa Cruz
;
;          Sequence Alignment and Modeling Software System
;         http://www.cse.ucsc.edu/research/compbio/sam.html
;
; ------- Citations (SAM, SAM-T99, HMMs) ---------
; R. Hughey, A. Krogh, Hidden Markov models for sequence analysis:
;  Extension and analysis of the basic method, CABIOS 12:95-107, 1996.
; K. Karplus, C. Barrett, R. Hughey, Hidden Markov models for detecting
;  remote protein homologies, Bioinformatics 14(10):846-856, 1999.
; A. Krogh et al., Hidden Markov models in computational biology:
;  Applications to protein modeling, JMB 235:1501-1531, Feb 1994.
; -------------------------------
               10            20         30        40        50        60        70      
                |             |          |         |         |         |         |      
TRNA1  GGGGAUGUAGCUCAG-.UGG...U.AGAGCGCAUGCUUCGCAUGUAUGAGGCCCCGGGUUCGAUCCCCGGCAUCU--CCA
TRNA2  GCGGCCGUCGUCUAGU.CUGgauU.AGGACGCUGGCCUCCCAAGCCAGCAAUCCCGGGUUCGAAUCCCGGCGGCC--GCA
TRNA3  GGCCCUGUGGC-UAGC.UGG...UcAAAGCGCCUGUCUAGUAAACAGGAGAUCCUGGGUUCGAAUCCCAGCGGGGCCUCCA
TRNA4  GGGCGAAUAGUGUCAG.CGG...G.AGCACACCAGACUUGCAAUCUGGUAGGGA-GGGUUCGAGUCCCUCUUUGUCCACCA
TRNA5  GCCGGGAUAGCUCAGU.UGG...U.AGAGCAGAGGACUGAAAAUCCUCGUGUCACCAGUUCAAAUCUGGUU--CCUGGCA
TRNA6  GGGGCCUUAGCUCAGC.UGG...G.AGAGCGCCUGCUUUGCACGCAGGAGGUCAGCGGU-CGA-CCCGCUAGGCUCCACCA
TRNA7  GGGCACAUGGCGCAGU.UGG...U.AGCGCGCUUCCCUUGCAAGGAAGAGGUCAUCGGUUCGAUUCCGGUUGCGU--CCA
TRNA8  GGGCCCGUGGCCUAGU.CUGga.U.ACGGCACCGGCCUUCUAAGCCGGGGAUCGGGGGUUCAAAUCCCUCCGGGU--CCG
TRNA9  CGGCACGUAGCGCAGCcUGG...U.AGCGCACCGUCCUGGGGUUGCGGGGGUCGGAGGUUCAAAUCCUCUCGUGCCGACCA
TRNA10 UCCGUCGUAGUCUAGG.UGGu..U.AGGAUACUCGGCUUUCACCCGAGAGA-CCCGGGUUCAAGUCCCGGCGACGGAACCA

Here, hyphens indicate deletes while lower-case letters, and the corresponding periods (`.') indicate inserts. The column numbers refer to match states in the model, not to column numbers, so that insertions are disregarded in calculating these index points. It is important to remember that insertions are not aligned among them selves: the fact that two insertion characters are in the same printed column only means that they were generated by the same insertion state, not that they should be aligned.


  
Figure 3: Output of drawmodel. Match state histograms are in AGCU order.
\begin{figure}\begin{center}
\latex{\makebox[0pt][c]{\parbox{6.5in}{\psfig{figur...
...cse.ucsc.edu/pub/protein/sam1.01_doc.ps.Z}
for a copy.}
\end{center}\end{figure}

3.3 Examining models

Model structure can be quite interesting. The drawmodel program generates postscript drawings of models that include match-state histograms and transition line styles that correspond to their frequency of use. These drawings are most useful when derived from frequency counts, values that can be optionally included in the output file:

buildmodel test -train trna10.seq -randseed 0 -print_frequencies 1
The command drawmodel test.mod test.ps will run the drawmodel program, which scans the file finding a model and frequency count data. By selecting the frequency count data in `overall' mode, the postscript drawing in Figure 3 is generated.1 The histograms in the match states correspond to the columns of the multiple alignment above, the numbers in the diamond insert states correspond to the average number of insertions for each sequence that uses that state, and the node numbers are given in the circular delete states. Transitions that are not used by a significant number of the sequences are not drawn.

Drawmodel is explained in more detail in Section 10.6.1.

  
3.4 Scoring sequences

The scoring program, hmmscore, generates a file of the negative log-likelihood minus NULL model (NLL-NULL, or log-odds) scores for each sequence given the model. Let's see how the 10 tRNA sequences fit the model (test.mod):

hmmscore test -i test.mod -db trna10.seq -sw 2
The arguments are the name of the run, the model file, the database sequence file, and a specifier to use fully-local scoring similar to that of the Smith & Waterman algorithm (this is suggested for all scoring runs unless semi-local, domain, or global scoring is specifically required).

hmmscore produces the test.dist file of Figure 4.


  
Figure 4: Distance output from hmmscore
\begin{figure}\begin{showfile}
\%~~SAM:~../src/hmmscore~v3.0~(October~8,~1999)~c...
...0~~~~~~~~~~~~~76~~~~~~~-34.58~~~~~~-28.35~~~~~4.9e-12
\end{showfile}\end{figure}

The score file contains six columns. The first is the sequence identifier, followed by sequence length, the `NLL-NULL' score using a simple null model, the reverse sequence `NLL-NULL' and an E-value based on the size of the database and the reverse sequence score. The simple null model is just the product of the character probabilities in each sequence multiplied together. Thus, the NLL-NULL score show how much improvement modeling with an HMM gained versus modeling with just a single insertion state from the HMM. The reverse sequence null model is more expensive to calculate, but a much better indicator of sequence similarity to the model. The reverse sequence NLL-NULL score is the difference between the raw negative log-likelihood (NLL) sequence score from that of the raw NLL score of the reversed sequence. Thus, there for sequence null model matches exactly the length and composition of the sequence, but does not match the ordering of the residues.

Other options of hmmscore enable the selective output of sequences according to their NLL scores, NLL per base scores, or E-value. The hmmscore program enters an interactive mode when called with no command line arguments. See Section 10.2.


next up previous contents
Next: 4. SAM-T99 Up: SAM Sequence Alignment and Previous: 2. New to this
SAM
sam-info@cse.ucsc.edu
UCSC Computational Biology Group