README for stand-alone BLAST

                     (last updated 7/30/99)

 

 

 

This document provides information on stand-alone BLAST.  Topics covered are

setting up stand-alone BLAST, command-line options for stand-alone BLAST,

and a release history of the different versions.

 

 

In order for Standalone BLAST to operate, you have will need to have a .ncbirc file in your home directory that contains the following lines:

 

[NCBI]

Data="path/data/"

 

Where "path/data/" is the path to the location of the Standalone BLAST

"data" subdirectory. For SACS users on socrates this is :

 

Data=/home/socr/c/lib/ncbiblast/data

 

Make sure that your .ncbirc file is either in the directory that you

call the Standalone BLAST program from or in your root directory.

 

The blast databases are preformatted by SACS and the location is set by the system.

 

The names of the local BLAST databases are as follows:

 


alu

ecoli.aa

ecoli.nt

epd

est

est_human

est_mouse

est_others

genpept

gss

htg

mito

month.aa

month.na

nr

nrdb90

nt


owl

pataa

patnt

pdbaa

pdbnt

pir

sts

swissprot

tfdaa

tfdnt

vector

yeast.aa

yeast.nt


 

 

*SACS will add any customized database to this set or show users how to add their own databases to this set.

 

You can also request a web interface for BLAST searching your  custom dataset.  Please contact SACS at sacs@cgl.ucsf.edu or 476-5379 for more info.

 

The sacs local BLAST interface is located at www.sacs.ucsf.edu/Resources/sequenceweb.html under the BLAST or SACSGCG links.

 

Your query sequence must be in FASTA format to use the standalone BLAST tools.

 

>Test

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA

TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC

ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG

CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA

GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC

AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG

AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT

 

To run the first search enter the following command from the UNIX

command line in your BLAST directory:

 

blastall -p blastn -d ecoli.nt -i test.txt -o test.out

 

This should generate a results file called test.out in the Standalone

BLAST directory.

 

 

 

Blastall

--------

 

Blastall may be used to perform all five flavors of blast comparison. One

may obtain the blastall options by executing 'blastall -' (note the dash). A

typical use of blastall would be to perform a blastn search (nucl. vs. nucl.)

of a file called QUERY would be:

 

blastall -p blastn -d nr -i QUERY -o out.QUERY

 

The output is placed into the output file out.QUERY and the search is performed

against the 'nr' database.  If a protein vs. protein search is desired,

then 'blastn' should be replaced with 'blastp' etc.

 

Some of the most commonly used blastall options are:

 

blastall   arguments:

 

  -p  Program Name [String]

 

        Input should be one of "blastp", "blastn", "blastx", "tblastn", or "tblastx".

 

  -d  Database [String]

    default = nr

 

        The database specified must first be formatted with formatdb.

        Multiple database names (bracketed by quotations) will be

accepted.

        An example would be

 

                -d "nr est"

 

        which will search both the nr and est databases, presenting the results as if one

        'virtual' database consisting of all the entries from both were searched.   The

        statistics are based on the 'virtual' database of nr and est. 

 

        The currently recommended method to search very large datasets (i.e., over 2 Gig).

        is to break the original file into files under 2 Gig, 'formatdb' each file separately,

        and run blastall, specifying all the files comprising the original dataset.

 

  -i  Query File [File In]

    default = stdin

 

        The query should be in FASTA format.  If multiple FASTA entries are in the input

        file, all queries will be searched.

 

  -e  Expectation value (E) [Real]

    default = 10.0

 

  -o  BLAST report Output File [File Out]  Optional

    default = stdout

 

  -F  Filter query sequence (DUST with blastn, SEG with others) [T/F]

    default = T

 

         BLAST 2.0 uses the dust low-complexity filter for blastn and seg for the

         other programs. Both 'dust' and 'seg' are integral parts of the NCBI toolkit

         and are accessed automatically.

 

         If one uses "-F T" then normal filtering by seg or dust (for blastn)

         occurs (likewise "-F F" means no filtering whatsoever).  The seg options

         can be changed by using:

 

         -F "S 10 1.0 1.5"

 

         which specifies a window of 10, locut of 1.0 and hicut of 1.5.  A coiled-coiled filter,

         based on the work of Lupas et al. (Science, vol 252, pp. 1162-4 (1991)) and written by

         John Kuzio (Wilson et al., J Gen Virol, vol. 76, pp. 2923-32 (1995)), may be invoked

         by specifying:

 

         -F "C"

 

         There are three parameters for this: window, cutoff (prob of a coil-coil), and

         linker (distance between two coiled-coiled regions that should be linked

         together).  These are now set to

 

         window: 22

         cutoff: 40.0

         linker: 32

 

         One may also change the coiled-coiled parameters in a manner analogous to

         that of seg:

 

         -F "C 28 40.0 32" will change the window to 28.

 

         One may also run both seg and coiled-coiled together by using a ";":

 

         -F "C;S"

 

         Filtering by dust may also be specified by:

 

         -F "D"

 

         It is possible to specify that the masking should only be done during

         the process of building the initial words by starting the filtering

         command with 'm':

 

         -F "m S"

 

         which specifies that seg (with default arguments) should be used for masking,

         but that the masking should only be done when the words are being built. 

         This masking option is available with all filters.

 

  -S  Query strands to search against database (for blast[nx], and tblastx).  3 is both, 1 is top, 2 is bottom [Integer]

    default = 3

 

  -T  Produce HTML output [T/F]

    default = F

 

  -l  Restrict search of database to list of GI's [String]  Optional

 

      This option specifies that only a subset of the database should be

      searched, determined by the list of gi's (i.e., NCBI identifiers) in a

      file.  One can obtain a list of gi's for a given Entrez query from

      http://www.ncbi.nlm.nih.gov/Entrez/batch.html.  This file should

      be in the same directory as the database, or in the directory that

      BLAST is called from.

 

 

 

Blastpgp

--------

 

Blastpgp performs gapped blastp searches and can be used to perform

iterative searches in psi-blast and phi-blast mode. See the PSI-Blast and

PHI-BLAST sections (below) for a description of this binary. The options may be

obtained by executing 'blastpgp -'.

 

  -T  Produce HTML output [T/F]

    default = F

 

  -Q  Output File for PSI-BLAST Matrix in ASCII [File Out]  Optional

 

Bl2seq

------

 

Bl2seq performs a comparison between two sequences using either the blastn or

blastp algorithm.  Both sequences must be either nucleotides or proteins.

The options may be obtained by executing 'bl2seq -'.

 

  -i  First sequence [File In]

  -j  Second sequence [File In]

  -p  blastp? (blastn otherwise) [T/F]

    default = T

  -g  Gapped [T/F]

    default = T

  -o  alignment output file [File Out]

    default = stdout

  -d  theor. db size (zero is real size) [Integer]

    default = 0

  -a  SeqAnnot output file [File Out]  Optional

  -G  Cost to open a gap (zero invokes default behavior) [Integer]

    default = 0

  -E  Cost to extend a gap (zero invokes default behavior) [Integer]

    default = 0

  -X  X dropoff value for gapped alignment (in bits) (zero invokes default behavior) [Integer]

    default = 0

  -W  Wordsize (zero invokes default behavior) [Integer]

    default = 0

  -M  Matrix [String]

    default = BLOSUM62

  -q  Penalty for a nucleotide mismatch (blastn only) [Integer]

    default = -3

  -r  Reward for a nucleotide match (blastn only) [Integer]

    default = 1

  -F  Filter query sequence (DUST with blastn, SEG with others) [String]

    default = T

  -e  Expectation value (E) [Real]

    default = 10.0

  -S  Query strands to search against database (blastn only).  3 is both, 1 is top, 2 is bottom [Integer]

    default = 3

  -T  Produce HTML output [T/F]

    default = F

 

 

Fastacmd

--------

 

Fastacmd retrives FASTA formatted sequences from a BLAST database, if it was formatted

using the '-o' option.  An example fastacmd call would be:

 

fastacmd -d nr -s p38398

 

The fastacmd options are:

 

fastacmd   arguments:

 

  -d  Database [String]

    default = nr

  -s  Search string: GIs, accessions and locuses may be used delimited

      by comma or space) [String]  Optional

  -i  Input file wilth GIs/accessions/locuses for batch retrieval [String]  Optional

  -a  Retrieve duplicated accessions [T/F]  Optional

    default = F

  -l  Line length for sequence [Integer]  Optional

    default = 80

 

PSI-Blast

---------

 

The blastpgp program can do an iterative search in which

sequences found in one round of searching are used to build

a score model for the next round of searching. In this usage,

the program is called Position-Specific Iterated BLAST, or PSI-BLAST.

As explained in the accompanying paper, the BLAST algorithm is

not tied to a specific score matrix. Traditionally, it has been

implemented using an AxA substitution matrix where A is the alphabet size.

PSI-BLAST instead uses a QxA matrix, where Q is the length of the query

sequence; at each position the cost of a letter depends on the position

w.r.t. the query and the letter in the subject sequence.

 

The position-specific matrix for round i+1 is built from a constrained

multiple alignment among the query and the sequences found with

sufficiently low e-value in round i.  The top part of the output for

each round distinguishes the sequences into: sequences found

previously and used in the score model, and sequences not used in the

score model. The output currently includes lots of diagnostics

requested by users at NCBI. To skip quickly from the output of

one round to the next, search for the string "producing", which is

part of the header for each round and likely does not appear elsewhere

in the output. PSI-BLAST "converges" and stops if all sequences

found at round i+1 below the e-value threshold were already in

the model at the beginning of the round.

 

There are several blastpgp parameters specifically for PSI-BLAST:

-j   is the maximum number of rounds (default 1; i.e., regular BLAST)

-e   is the e-value threshold for including sequences in the

     score matrix model (default 0.01)

-c   is the "constant" used in the pseudocount formula specified in the

     paper (default 10)

 

The -C and -R flags provide a "checkpointing" facility whereby

a score model can be stored and later reused.

   -C  stores the query and frequency count ratio matrix in a

                  file

   -R  restarts from a file stored previously.

When using -R, it is required that the query specified on the command line

match exactly the query in the restart file.

The checkpoint files are stored in a byte-encoded (not human readable)

format, so as to prevent roundoff error between writing and reading

the checkpoint.

Users who also develop their own sequence analysis software may wish

to develop their own scoring systems. For this purpose the code

in posit.c that writes out the checkpoint can be easily adapated to

write out scoring systems derived by other algorithms in such

a way that PSI-BLAST can read the files in later.

The checkpoint structure is general in the sense that it can handle

any position-specific matrix that fits in the Karlin-Altschul

statistical framework for BLAST scoring.

 

The -B flag provides a way to jump start PSI-BLAST from a master-slave

multiple alignment computed outside PSI-BLAST.  The multiple alignment

must include the query sequence as one of the sequences, but it need

not be the first sequence.  The multiple alignment must be specified

in a format that is derived from Clustal, but without some headers and

trailers.  See example below. The rules are also described by the

following words.  Suppose the multiple alignments has N sequences.  It

may be presented in 1 or more blocks, where each block presents a

range of columns from the multiple alignment.  E.g., the first block

might have columns 1-60, the second block might have columns 61-95,

the third block might have columns 96-128. Each block should have N

rows, 1 row per sequence.  The sequences should be in the same order

in every block.  Blocks are separated by 1 or more blank lines.

Within a block there are no blank lines, and each line consists of 1

sequence identifier followed by some white space followed by

characters (and gaps) for that sequence in the multiple alignment.  In

each column, all letters must be in upper case, or all letters must be

in lower case.  Upper case means that this column is to be given

position-specific scores. Lower-case means to use the underlying

matrix (specified by -M) for this column; e.g., if the query sequence

has an 'l' residue in the column, then the standard scores for

matching an L are used in the column.

 

A sample usage would be:

 

  blastpgp -i seq1 -B align1 -j 2 -d nr

 

where seq1 is the query

      align1 is the alignment file

      -j 2 indicates to do 2 rounds

      -d nr indicates to use the nr database

 

The example files

    seq1

    align1

copied below were kindly supplied by L. Aravind from a paper

he and Chris Ponting published in Protein Science:

 

Aravind L, Ponting CP, Homologues of 26S proteasome subunits

are regulators of transcription and translation, Protein Science

7(1998) 1250-1254.

 

L. Aravind (aravind@ncbi.nlm.nih.gov) was the first user

and helped define how -B should work. Y. Wolf (wolf@ncbi.nlm.nih.gov)

helped design a more flexible input format for the alignments.

If you like how -B works, let them know.

If you do not like how -B works, complain to

A. Schaffer(schaffer@helix.nih.gov) who did the implementation.

 

seq1

----

> 26SPS9_Hs

IHAAEEKDWKTAYSYFYEAFEGYDSIDSPKAITSLKYMLLCKIMLNTPEDVQALVSGKLALRYAGRQTEA

LKCVAQASKNRSLADFEKALTDYRAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKL

SKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP

 

 

align1

------

26SPS9_Hs     IHAAEEKDWKTAYSYFYEAFEGYdsidspkaitslkymllckimlntpedvqalvsgklalryagrqtealkcvaqasknr

F57B9_Ce      LHAADEKDFKTAFSYFYEAFEGYdsvdekvsaltalkymllckvmldlpdevnsllsaklalkyngsdldamkaiaaaaqk

YDL097c_Sc    ILHCEDKDYKTAFSYFFESFESYhnltthnsyekacqvlkymllskimlnliddvknilnakytketyqsrgidamkavae

YMJ5_Ce       LYSAEERDYKTSFSYFYEAFEGFasigdkinatsalkymilckimlneteqlagllaakeivayqkspriiairsmadafr

FUS6_ARATH    KNYIRTRDYCTTTKHIIHMCMNAilvsiemgqfthvtsyvnkaeqnpetlepmvnaklrcasglahlelkkyklaarkfld

COS41.8_Ci    SLDYKLKTYLTIARLYLEDEDPVqaemyinrasllqnetadeqlqihykvcyarvldyrrkfleaaqrynelsyksaihet

644879        KCYSRARDYCTSAKHVINMCLNVikvsvylqnwshvlsyvskaestpeiaeqrgerdsqtqailtklkcaaglaelaarky

YPR108w_Sc    IHCLAVRNFKEAAKLLVDSLATFtsieltsyesiatyasvtglftlertdlkskvidspellslisttaalqsissltisl

eif-3p110_Hs  SKAMKMGDWKTCHSFIINEKMNGkvw-------------------------------------------------------

T23D8.4_Ce    SKAMLNGDWKKCQDYIVNDKMNQkvw-------------------------------------------------------

YD95_Sp       IYLMSIRNFSGAADLLLDCMSTFsstellpyydvvryavisgaisldrvdvktkivdspevlavlpqnesmssleacinsl

KIAA0107_Hs   LYCVAIRDFKQAAELFLDTVSTFtsyelmdyktfvtytvyvsmialerpdlrekvikgaeilevlhslpavrqylfslyec

F49C12.8_Hs   LYRMSVRDFAGAADLFLEAVPTFgsyelmtyenlilytvitttfaldrpdlrtkvircnevqeqltggglngtlipvreyl

Int-6_Mm      KFQYECGNYSGAAEYLYFFRVLVpatdrnalsslwgklaseilmqnwdaamedltrlketidnnsvssplqslqqrtwlih

 

26SPS9_Hs     sladfekaltdy-----------------------------------------------------------------------------------

F57B9_Ce      rslkdfqvafgsf----------------------------------------------------------------------------------

YDL097c_Sc    aynnrslldfntalkqy------------------------------------------------------------------------------

YMJ5_Ce       krslkdfvkalaeh---------------------------------------------------------------------------------

FUS6_ARATH    vnpelgnsyneviapqdiatygglcalasfdrselkqkvidninfrnflelvpdvrelindfyssryascleylasl------------------

COS41.8_Ci    eqtkalekalncailapagqqrsrmlatlfkdercqllpsfgilekmfldriiksdemeefar--------------------------------

644879        kqaakclllasfdhcdfpellspsnvaiygglcalatfdrqelqrnvissssfklflelepqvrdiifkfyeskyasclkmldem----------

YPR108w_Sc    yasdyasyfpyllety-------------------------------------------------------------------------------

eif-3p110_Hs  -----------------------------------------------------------------------------------------------

T23D8.4_Ce    -----------------------------------------------------------------------------------------------

YD95_Sp       ylcdysgffrtladve-------------------------------------------------------------------------------

KIAA0107_Hs   rysvffqslavv-----------------------------------------------------------------------------------

F49C12.8_Hs   esyydchydrffiqlaale----------------------------------------------------------------------------

Int-6_Mm      wslfvffnhpkgrdniidlflyqpqylnaiqtmcphilrylttavitnkdvrkrrqvlkdlvkviqqesytykdpitefveclyvnfdfdgaqkk

 

26SPS9_Hs     ----RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP

F57B9_Ce      ----PQELQMDPVVRKHFHSLSERMLEKDLCRIIEPYSFVQIEHVAQQIGIDRSKVEKKLSQMILDQKLSGSLDQGEGMLIVFEIAV

YDL097c_Sc    ----EKELMGDELTRSHFNALYDTLLESNLCKIIEPFECVEISHISKIIGLDTQQVEGKLSQMILDKIFYGVLDQGNGWLYVYETPN

YMJ5_Ce       ----KIELVEDKVVAVHSQNLERNMLEKEISRVIEPYSEIELSYIARVIGMTVPPVERAIARMILDKKLMGSIDQHGDTVVVYPKAD

FUS6_ARATH    ----KSNLLLDIHLHDHVDTLYDQIRKKALIQYTLPFVSVDLSRMADAFKTSVSGLEKELEALITDNQIQARIDSHNKILYARHADQ

COS41.8_Ci    ----QLMPHQKAITADGSNILHRAVTEHNLLSASKLYNNIRFTELGALLEIPHQMAEKVASQMICESRMKGHIDQIDGIVFFERRET

644879        ----KDNLLLDMYLAPHVRTLYTQIRNRALIQYFSPYVSADMHRMAAAFNTTVAALEDELTQLILEGLISARVDSHSKILYARDVDQ

YPR108w_Sc    ----ANVLIPCKYLNRHADFFVREMRRKVYAQLLESYKTLSLKSMASAFGVSVAFLDNDLGKFIPNKQLNCVIDRVNGIVETNRPDN

eif-3p110_Hs  ----DLFPEADKVRTMLVRKIQEESLRTYLFTYSSVYDSISMETLSDMFELDLPTVHSIISKMIINEELMASLDQPTQTVVMHRTEP

T23D8.4_Ce    ----NLFHNAETVKGMVVRRIQEESLRTYLLTYSTVYATVSLKKLADLFELSKKDVHSIISKMIIQEELSATLDEPTDCLIMHRVEP

YD95_Sp       ----VNHLKCDQFLVAHYRYYVREMRRRAYAQLLESYRALSIDSMAASFGVSVDYIDRDLASFIPDNKLNCVIDRVNGVVFTNRPDE

KIAA0107_Hs   ----EQEMKKDWLFAPHYRYYVREMRIHAYSQLLESYRSLTLGYMAEAFGVGVEFIDQELSRFIAAGRLHCKIDKVNEIVETNRPDS

F49C12.8_Hs   ----SERFKFDRYLSPHFNYYSRGMRHRAYEQFLTPYKTVRIDMMAKDFGVSRAFIDRELHRLIATGQLQCRIDAVNGVIEVNHRDS

Int-6_Mm      lrecESVLVNDFFLVACLEDFIENARLFIFETFCRIHQCISINMLADKLNMTPEEAERWIVNLIRNARLDAKIDSKLGHVVMGNNAV

 

 

 

 

 

PHI-Blast

---------

 

PHI-BLAST (Pattern-Hit Initiated BLAST) is a search

program that combines matching of regular expressions

with local alignments surrounding the match.

The most important features of the program have been

incorporated into the BLAST software framework

partly for user convenience and partly so that

PHI-BLAST may be combined seamlessly with PSI-BLAST.

Other features that do not fit into the BLAST framework

will be released later as a separate program and/or

separate Web page query options.

 

One very restrictive way to identify protein motifs

is by regular expressions that must contain each instance

of the motif. The PROSITE database is a compilation of

restricted regular expressions that describe protein motifs.

Given a protein sequence S and a regular expression pattern P

occurring in S, PHI-BLAST helps answer the question:

What other protein sequences both contain an occurrence of P

and are homologous to S in the vicinity of the pattern occurrences?

PHI-BLAST may be preferable to just searching for pattern occurrences

because it filters out those cases where the pattern occurrence is

probably random and not indicative of homology.

PHI-BLAST may be preferable to other flavors of BLAST because

it is faster and because it allows the user to express

a rigid pattern occurrence requirement.

 

The pattern search methods in PHI-BLAST are based on the

algorithms in:

 

R. Baeza-Yates and G. Gonnet, Communications of the ACM 35(1992), pp. 74-82.

S. Wu and U. Manber, Communications of the ACM 35(1992), pp. 83-91.

 

The calculation of local alignments is done using a method

very similar to (and much of the same code as) gapped BLAST.

However, the method of evaluating statistical significance is different, and

is described below.

 

In the stand-alone mode the typical PHI-BLAST usage looks like:

  blastpgp -i  -k  -p patseedp

 

  where -i is followed by the file containing the query in FASTA format

  where -k is followed by the file containing the pattern in a syntax given below

  and "patseedp" indicates the mode of usage,  not representing any file.

 

The syntax for the query sequence is FASTA format as for all other

BLAST queries. The syntax for patterns follows the rules of

PROSITE and is documented in detail below.

The specified pattern is not required to be in the PROSITE list.

Most of the other BLAST flags can be used with PHI-BLAST.

One important exception is that PHI-BLAST requires gapped

alignments (i.e. forbids -g F in the flags) because ungapped

alignments do not make sense for almost all patterns in PROSITE.

 

There is a second mode of PHI-BLAST usage that is important when

the specified pattern occurs more than 1 time in the query.

In this case, the user may be interested in restricting the

search for local alignments to a subset of the pattern occurrences.

This can be done with a search that looks like:

   blastpgp -i  -k  -p seedp

 

in which case the use of the "seedp" option requires the user to

specify the location(s) of the interesting pattern occurrence(s)

in the pattern file. The syntax for how to specify pattern

occurrences is below. When there are multiple pattern occurrences in the

query it may be important to decide how many are of interest because

the E-value for matches is effectively multiplied by the number

of interesting pattern occurrences.

 

The PHI-BLAST Web page supports only the "patseedp" option.

 

PHI-BLAST is integrated with PSI-BLAST. In the command-line

mode, PSI-BLAST can be invoked by using the -j option, as usual.

When this is done as:

   blastpgp -i  -k  -p patseedp -j

 

then the first round of searching uses PHI-BLAST and all subsequent

rounds use PSI-BLAST.

In the Web page setting, the user must explicitly invoke one round

at a time, and the PHI-BLAST Web page provides the option to

initiate a PSI-BLAST round with the PHI-BLAST results.

To describe a combined usage, use the term "PHI-PSI-BLAST"

(Pattern-Hit Initiated, Position-Specific Iterated BLAST).

 

Determining statistical significance.

 

When a query sequence Q matches a database sequence D in PHI-BLAST,

it is useful to subdivide Q and D into 3 disjoint pieces

    Qleft Qpattern Qright

    Dleft Dpattern Dright

 

The substrings Qpattern and Dpattern contain the pattern specified

in the pattern file. The pieces Qpattern and Dpattern are aligned

and that alignment is displayed as part of the PHI-BLAST output,

but the score for that alignment is mostly ignored.

The "reduced" score r of an alignment is the sum of the scores obtained

by aligning  Qleft with Dleft and by aligning Qright with Dright.

 

The expected number of alignments with a reduced score >= x

is given by:

       CN(Lambda*x + 1)e^(-Lambda *x)

where:

 

C and Lambda are "constants" depending on the score matrix and the

gap costs.

N is (number of occurrences of pattern in database) * (number of

      occurrences of pattern in Q)

e is the base of the natural logarithm.

 

It is important to understand that this method of computing

the statistical significance of a PHI-BLAST alignment is mathematically

different from the method used for BLAST and PSI-BLAST alignments.

However, both methods provide E-values, so they the E_values are

displayed with a similar output syntax.

 

Rules for pattern syntax for PHI-BLAST.

 

The syntax for patterns in PHI-BLAST follows the conventions

of PROSITE. When using the stand-alone program, it

is permissible to have multiple patterns in a file separated

by a blank line between patterns. When using the Web-page

only one pattern is allowed per query.

 

Valid protein characters for PHI-BLAST patterns:

    ABCDEFGHIKLMNPQRSTVWXYZU

 

Valid DNA characters for PHI-BLAST patterns:

    ACGT

 

Other useful delimiters:

    [ ]    means any one of the characters enclosed in the brackets

        e.g., [LFYT] means one occurrence of L or F or Y or T

    -      means nothing (this is a spacer character used by PROSITE)

    x with nothing following means any residue

    x(5)  means 5 positions in which any residue is allowed (and similarly for any other

          single number in parentheses after x)

    x(2,4) means 2 to 4 positions where any residue is allowed,

           and similarly for any other two numbers separated by a comma;

           the first number should be < the second number.

    >      can occur only at the end of a pattern and means nothing

           it may occur before a period

           (another spacer used by PROSITE)

 

    .      may be used at the end of the pattern and means nothing

 

When using the stand-alone program, the pattern should

be in a file, with the first line starting:

 ID

followed by 2 spaces and a text string giving the pattern a name.

 

There should also be a line starting

 PA

followed by 2 spaces followed by the pattern description.

 

All other PROSITE codes in the first two columns are allowed,

but only the HI code, described below is relevant to PHI-BLAST.

 

Here is an example from PROSITE.

 

ID   CNMP_BINDING_2; PATTERN.

AC   PS00889;

DT   OCT-1993 (CREATED); OCT-1993 (DATA UPDATE); NOV-1995 (INFO UPDATE).

DE   Cyclic nucleotide-binding domain signature 2.

PA   [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].

NR   /RELEASE=32,49340;

NR   /TOTAL=57(36); /POSITIVE=57(36); /UNKNOWN=0(0); /FALSE_POS=0(0);

NR   /FALSE_NEG=1; /PARTIAL=1;

CC   /TAXO-RANGE=??EP?; /MAX-REPEAT=2;

 

The line starting

    ID

gives the pattern a name.

The lines starting

     AC, DT, DE, NR, NR, CC

are relevant to PROSITE users, but irrelevant to PHI-BLAST.

These lines are tolerated, but ignored by PHI-BLAST.

 

The line starting

     PA

describes the pattern as:

      one of LIVMF

followed by

      G

followed by

      E

followed by

      any single character

followed by

      one of GAS

followed by

      one of LIVM

followed by

      any 5 to 11 characters

followed by

      R

followed by

      one of STAQ

followed by

      A

followed by

      any single character

followed by

      one of LIVMA

followed by

      any single character

followed by

      one of STACV

 

In this case the pattern ends with a period.

It can end with nothing after the last specifying symbol

or any number of > signs or periods or combination thereof.

 

Here is another example, illustrating the use of an HI line.

 

ID    ER_TARGET; PATTERN.

PA  [KRHQSA]-[DENQ]-E-L>.

HI (19 22)

HI (201 204)

 

In this example, the HI lines specify that the pattern

occurs twice, once from positions 19 through 22 in the

sequence and once from positions 201 through 204 in the

sequence.

These specifications are relevant when stand-alone PHI-BLAST is

used with the

     seedp

option, in which the interesting occurrences of the pattern

in the sequence are specified. In this case the

HI lines specify which occurrence(s) of the pattern

should be used to find good alignments.

 

In general, the seedp option is more useful than the

standard patternp option ONLY when the

pattern occurs K > 1 times in the sequence AND

the user is interested in matching to J < K of those

occurrences.

Then using the HI lines enables the user to specify which

occurrences are of interest.

 

Additional functionality related to PHI-BLAST.

 

PHI-BLAST takes as input both a sequence and a query containing

that sequence and searches a sequence database for

other sequences containing the same pattern and having a good alignment.

One may be interested in asking two related, simpler questions:

 

1. Given a sequence and a database of patterns, which patterns occur

in the sequence and where?

 

2. Given a pattern and a sequence database, which sequences contain the

pattern and where?

 

These queries can be answered wih software closely related to PHI-BLAST,

but they do not fit into the output framework of BLAST because the

answers are simple lists without alignments and with no notion of

statistical significance.

 

The NCBI toolbox includes another program, currently called

     seedtop

to answer the two queries above.

 

Query 1 can be asked with:

  seedtop -i  -k  -p patmatchp

 

Query 2 can be asked with:

  seedtop -d  -k  -p patternp

 

The -k argument is used similarly in all queries and the file

format is always the same. The standard pattern database is

PROSITE, but others (or a subset) can be used.

There are plans afoot to offer the patmatchp query (number 1) on

the PHI-BLAST web page or in its vicinity, but this would

be restricted to having PROSITE as the pattern database.

 

BLAST OPTIONS

-------------

 

Formatdb

--------

 

Formatdb, should be used to format the FASTA databases for both protein and DNA databases for BLAST 2.0. This must be done before blastall or blastpgp can be run locally. The format of the databases has been changed substantially from the BLAST 1.4 release. A major improvement in this format over the old one is that ambiguity information for DNA sequences is now retrieved from the files produced by formatdb, rather than from the original FASTA file. The original FASTA file is no longer needed for the BLAST runs.  This saves both disk-space and improves performance as the large FASTA file no longer needs to be accessed by BLAST.

 

The input for formatdb may be either ASN.1 or FASTA.  Use of ASN.1 is

advantageous for those sites that might also wish to format the ASN.1

in different ways, such as a GenBank report. Usage of formatdb may be

obtained by executing formatdb and a dash (note that additional comments

have been added here as indented paragraphs):

 

formatdb   arguments:

 

  -t  Title for database file [String]  Optional

  -i  Input file for formatting (this parameter must be set) [File In]

  -l  Logfile name: [File Out]  Optional

    default = formatdb.log

  -p  Type of file

         T - protein

         F - nucleotide [T/F]  Optional

 

    default = T

 

         The "-p" option has two different meaning depending on whether input

         database is in FASTA or ASN.1 format. In case of FASTA, the "-p" specifies

         type of input database. In case of ASN.1, the option specifies the type of

         sequence to be indexed for BLAST.

 

  -o  Parse options

         T - True: Parse SeqId and create indexes.

         F - False: Do not parse SeqId. Do not create indexes.

 

         If the "-o" option is TRUE (and the input database is in FASTA format), then

         the database identifiers in the FASTA definition line must follow the

         convention described in the appendices of

         ftp://ncbi.nlm.nih.gov/blast/db/README.  Also, see further explanation below.

 

 [T/F]  Optional

    default = F

  -a  Input file is database in ASN.1 format (otherwise FASTA is expected)

         T - True,

         F - False.

 [T/F]  Optional

    default = F

  -b  ASN.1 database in binary mode

         T - binary,

         F - text mode.

 

         An input ASN.1 database may be represented in two formats - ascii text and

         binary. The "-b" option, if TRUE, specifies that input ASN.1 database is in

         binary format. The option is ignored in case of FASTA input database.

 

 [T/F]  Optional

    default = F

  -e  Input is a Seq-entry [T/F]  Optional

    default = F

 

         An input ASN.1 database (either text ascii or binary) may contains

         Bioseq-set or just one Bioseq. In the latter case the "-e" switch should be

         set to TRUE.

 

  -n  Base name for BLAST files [String]  Optional

 

         This options allows one to produce BLAST databases with a different name

         than the original FASTA file.  One could have a file named 'nt' and

         and format it as 'nr':

 

         formatdb -i nt -p F -o T -n nr

 

         One could also uncompress the original FASTA file on the fly and send it to

         formatdb through the 'stdin' (under UNIX):

 

         uncompress -c nr.z | formatdb -i stdin -o T -n nr

 

      This can be used in situations where the original FASTA file is not required

      other than by formatdb.  This can help in a situation where disk-space is tight.

 

  -v  Number of sequence bases to be created in the volume [Integer]  Optional

    default = 0

      This option breaks up large FASTA files into 'volumes' (each with a maximum

      size of 2 billion letters).  As part of the creation of a volume formatdb

      writes a new type of BLAST database file, called an alias file, with the

      extension 'nal' or 'pal', is written.

 

  -s  Create indexes limited only to accessions - sparse [T/F]  Optional

    default = F

 

      This option limits the indices for the string identifiers (used by formatdb)

      to accessions (i.e., no locus names).  This is especially useful for sequences sets

      like the EST's where the accession and locus names are identical.  Formatdb runs

      faster and produces smaller temporary files if this option is used.  It is strongly

      recommended for EST's, STS's, GSS's, and HTGS's.

 

 

 

 

FORMATDB NOTES:

It is always advantageous to use the '-o' option if the database identifiers

are in the format specified at ftp://ncbi.nlm.nih.gov/blast/db/README.  If

the database identifiers are in the parseable formatdb produces additional

indices allowing retrieval from the databases by identifier. The databases

on the NCBI FTP site contain parseable identifiers. It is sufficient if the

first word on the FASTA defintion line is a unique identifier (e.g.,

">3091 Alcoho de..."). It is necessary to use parseable identifiers for the following

cases:

 

1.) If ASN.1 is to be produced from blastall or blastpgp, then "-o" must be TRUE.

 

2.) master-slave alignments are desired (i.e., the '-m' option with a non-zero value is used).

 

3.) The gi's are desired as part of the output (i.e., '-I' is used).

 

4.) fastacmd is used to fetch sequences from the database by accession or gi.

 

 

References

 

     Zhang, Zheng, Alejandro A. Schäffer, Webb Miller, Thomas L. Madden,

     David J. Lipman, Eugene V. Koonin, and Stephen F. Altschul (1998),

     "Protein sequence similarity searches using patterns as seeds", Nucleic

     Acids Res. 26:3986-3990.

 

     Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,

     Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),

     "Gapped BLAST and PSI-BLAST: a new generation of protein database

     search programs", Nucleic Acids Res. 25:3389-3402.

 

     Karlin, Samuel and Stephen F. Altschul (1990).  Methods  for

     assessing the statistical significance of molecular sequence

     features by using general scoring schemes. Proc. Natl. Acad.

     Sci. USA 87:2264-68.

 

     Karlin, Samuel and Stephen F. Altschul (1993).  Applications

     and statistics for multiple high-scoring segments in molecu-

     lar sequences. Proc. Natl. Acad. Sci. USA 90:5873-7.