README for
stand-alone BLAST
(last updated 7/30/99)
This document provides
information on stand-alone BLAST.
Topics covered are
setting up stand-alone
BLAST, command-line options for stand-alone BLAST,
and a release history of
the different versions.
In order for Standalone
BLAST to operate, you have will need to have a .ncbirc file in your home
directory that contains the following lines:
[NCBI]
Data="path/data/"
Where
"path/data/" is the path to the location of the Standalone BLAST
"data"
subdirectory. For SACS users on socrates this is :
Data=/home/socr/c/lib/ncbiblast/data
Make sure that your
.ncbirc file is either in the directory that you
call the Standalone BLAST
program from or in your root directory.
The blast databases are preformatted
by SACS and the location is set by the system.
The names of the local
BLAST databases are as follows:
alu
ecoli.aa
ecoli.nt
epd
est
est_human
est_mouse
est_others
genpept
gss
htg
mito
month.aa
month.na
nr
nrdb90
nt
owl
pataa
patnt
pdbaa
pdbnt
pir
sts
swissprot
tfdaa
tfdnt
vector
yeast.aa
yeast.nt
*SACS will add any
customized database to this set or show users how to add their own databases to
this set.
You can also request a web
interface for BLAST searching your
custom dataset. Please contact
SACS at sacs@cgl.ucsf.edu or 476-5379 for more info.
The sacs local BLAST
interface is located at www.sacs.ucsf.edu/Resources/sequenceweb.html under the
BLAST or SACSGCG links.
Your query sequence must
be in FASTA format to use the standalone BLAST tools.
>Test
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT
To run the first search
enter the following command from the UNIX
command line in your BLAST
directory:
blastall -p blastn -d
ecoli.nt -i test.txt -o test.out
This should generate a
results file called test.out in the Standalone
BLAST directory.
Blastall
--------
Blastall may be used to
perform all five flavors of blast comparison. One
may obtain the blastall
options by executing 'blastall -' (note the dash). A
typical use of blastall
would be to perform a blastn search (nucl. vs. nucl.)
of a file called QUERY
would be:
blastall -p blastn -d nr
-i QUERY -o out.QUERY
The output is placed into
the output file out.QUERY and the search is performed
against the 'nr'
database. If a protein vs. protein
search is desired,
then 'blastn' should be
replaced with 'blastp' etc.
Some of the most commonly
used blastall options are:
blastall arguments:
-p Program Name [String]
Input should be one of "blastp",
"blastn", "blastx", "tblastn", or
"tblastx".
-d Database [String]
default = nr
The database specified must first be formatted with
formatdb.
Multiple database names (bracketed by quotations) will be
accepted.
An example would be
-d "nr
est"
which will search both the nr and est databases, presenting
the results as if one
'virtual' database consisting of all the entries from both
were searched. The
statistics are based on the 'virtual' database of nr and
est.
The currently recommended method to search very large
datasets (i.e., over 2 Gig).
is to break the original file into files under 2 Gig,
'formatdb' each file separately,
and run blastall, specifying all the files comprising the
original dataset.
-i Query File [File In]
default = stdin
The query should be in FASTA format. If multiple FASTA entries are in the input
file, all queries will be searched.
-e Expectation value (E)
[Real]
default = 10.0
-o BLAST report Output
File [File Out] Optional
default = stdout
-F Filter query sequence
(DUST with blastn, SEG with others) [T/F]
default = T
BLAST 2.0 uses the dust low-complexity filter for blastn
and seg for the
other programs. Both 'dust' and 'seg' are integral parts
of the NCBI toolkit
and are accessed automatically.
If one uses "-F T" then normal filtering by seg
or dust (for blastn)
occurs (likewise "-F F" means no filtering
whatsoever). The seg options
can be changed by using:
-F "S 10 1.0 1.5"
which specifies a window of 10, locut of 1.0 and hicut of
1.5. A coiled-coiled filter,
based on the work of Lupas et al. (Science, vol 252, pp.
1162-4 (1991)) and written by
John Kuzio (Wilson et al., J Gen Virol, vol. 76, pp.
2923-32 (1995)), may be invoked
by specifying:
-F "C"
There are three parameters for this: window, cutoff (prob
of a coil-coil), and
linker (distance between two coiled-coiled regions that
should be linked
together). These
are now set to
window: 22
cutoff: 40.0
linker: 32
One may also change the coiled-coiled parameters in a manner
analogous to
that of seg:
-F "C 28 40.0 32" will change the window to 28.
One may also run both seg and coiled-coiled together by
using a ";":
-F "C;S"
Filtering by dust may also be specified by:
-F "D"
It is possible to specify that the masking should only be
done during
the process of building the initial words by starting the
filtering
command with 'm':
-F "m S"
which specifies that seg (with default arguments) should
be used for masking,
but that the masking should only be done when the words
are being built.
This masking option is available with all filters.
-S Query strands to
search against database (for blast[nx], and tblastx). 3 is both, 1 is top, 2 is bottom [Integer]
default = 3
-T Produce HTML output
[T/F]
default = F
-l Restrict search of
database to list of GI's [String]
Optional
This option specifies that only a subset of the database should
be
searched, determined by the list of gi's (i.e., NCBI
identifiers) in a
file. One can obtain a
list of gi's for a given Entrez query from
http://www.ncbi.nlm.nih.gov/Entrez/batch.html. This file should
be in the same directory as the database, or in the directory
that
BLAST is called from.
Blastpgp
--------
Blastpgp performs gapped
blastp searches and can be used to perform
iterative searches in
psi-blast and phi-blast mode. See the PSI-Blast and
PHI-BLAST sections (below)
for a description of this binary. The options may be
obtained by executing
'blastpgp -'.
-T Produce HTML output
[T/F]
default = F
-Q Output File for
PSI-BLAST Matrix in ASCII [File Out]
Optional
Bl2seq
------
Bl2seq performs a
comparison between two sequences using either the blastn or
blastp algorithm. Both sequences must be either nucleotides or
proteins.
The options may be
obtained by executing 'bl2seq -'.
-i First sequence [File
In]
-j Second sequence [File
In]
-p blastp? (blastn
otherwise) [T/F]
default = T
-g Gapped [T/F]
default = T
-o alignment output file
[File Out]
default = stdout
-d theor. db size (zero
is real size) [Integer]
default = 0
-a SeqAnnot output file
[File Out] Optional
-G Cost to open a gap
(zero invokes default behavior) [Integer]
default = 0
-E Cost to extend a gap
(zero invokes default behavior) [Integer]
default = 0
-X X dropoff value for
gapped alignment (in bits) (zero invokes default behavior) [Integer]
default = 0
-W Wordsize (zero invokes
default behavior) [Integer]
default = 0
-M Matrix [String]
default = BLOSUM62
-q Penalty for a
nucleotide mismatch (blastn only) [Integer]
default = -3
-r Reward for a
nucleotide match (blastn only) [Integer]
default = 1
-F Filter query sequence
(DUST with blastn, SEG with others) [String]
default = T
-e Expectation value (E)
[Real]
default = 10.0
-S Query strands to
search against database (blastn only).
3 is both, 1 is top, 2 is bottom [Integer]
default = 3
-T Produce HTML output
[T/F]
default = F
Fastacmd
--------
Fastacmd retrives FASTA
formatted sequences from a BLAST database, if it was formatted
using the '-o'
option. An example fastacmd call would
be:
fastacmd -d nr -s p38398
The fastacmd options are:
fastacmd arguments:
-d Database [String]
default = nr
-s Search string: GIs,
accessions and locuses may be used delimited
by comma or space) [String]
Optional
-i Input file wilth
GIs/accessions/locuses for batch retrieval [String] Optional
-a Retrieve duplicated
accessions [T/F] Optional
default = F
-l Line length for
sequence [Integer] Optional
default = 80
PSI-Blast
---------
The blastpgp program can
do an iterative search in which
sequences found in one
round of searching are used to build
a score model for the next
round of searching. In this usage,
the program is called
Position-Specific Iterated BLAST, or PSI-BLAST.
As explained in the
accompanying paper, the BLAST algorithm is
not tied to a specific
score matrix. Traditionally, it has been
implemented using an AxA
substitution matrix where A is the alphabet size.
PSI-BLAST instead uses a
QxA matrix, where Q is the length of the query
sequence; at each position
the cost of a letter depends on the position
w.r.t. the query and the
letter in the subject sequence.
The position-specific
matrix for round i+1 is built from a constrained
multiple alignment among
the query and the sequences found with
sufficiently low e-value
in round i. The top part of the output
for
each round distinguishes
the sequences into: sequences found
previously and used in the
score model, and sequences not used in the
score model. The output
currently includes lots of diagnostics
requested by users at
NCBI. To skip quickly from the output of
one round to the next,
search for the string "producing", which is
part of the header for
each round and likely does not appear elsewhere
in the output. PSI-BLAST
"converges" and stops if all sequences
found at round i+1 below
the e-value threshold were already in
the model at the beginning
of the round.
There are several blastpgp
parameters specifically for PSI-BLAST:
-j is the maximum number of rounds (default 1;
i.e., regular BLAST)
-e is the e-value threshold for including
sequences in the
score matrix model (default 0.01)
-c is the "constant" used in the
pseudocount formula specified in the
paper (default 10)
The -C and -R flags
provide a "checkpointing" facility whereby
a score model can be
stored and later reused.
-C stores the query and
frequency count ratio matrix in a
file
-R restarts from a file
stored previously.
When using -R, it is
required that the query specified on the command line
match exactly the query in
the restart file.
The checkpoint files are
stored in a byte-encoded (not human readable)
format, so as to prevent
roundoff error between writing and reading
the checkpoint.
Users who also develop
their own sequence analysis software may wish
to develop their own
scoring systems. For this purpose the code
in posit.c that writes out
the checkpoint can be easily adapated to
write out scoring systems
derived by other algorithms in such
a way that PSI-BLAST can
read the files in later.
The checkpoint structure
is general in the sense that it can handle
any position-specific
matrix that fits in the Karlin-Altschul
statistical framework for
BLAST scoring.
The -B flag provides a way
to jump start PSI-BLAST from a master-slave
multiple alignment
computed outside PSI-BLAST. The
multiple alignment
must include the query
sequence as one of the sequences, but it need
not be the first
sequence. The multiple alignment must
be specified
in a format that is
derived from Clustal, but without some headers and
trailers. See example below. The rules are also
described by the
following words. Suppose the multiple alignments has N
sequences. It
may be presented in 1 or
more blocks, where each block presents a
range of columns from the
multiple alignment. E.g., the first
block
might have columns 1-60,
the second block might have columns 61-95,
the third block might have
columns 96-128. Each block should have N
rows, 1 row per
sequence. The sequences should be in
the same order
in every block. Blocks are separated by 1 or more blank
lines.
Within a block there are
no blank lines, and each line consists of 1
sequence identifier
followed by some white space followed by
characters (and gaps) for
that sequence in the multiple alignment.
In
each column, all letters
must be in upper case, or all letters must be
in lower case. Upper case means that this column is to be
given
position-specific scores.
Lower-case means to use the underlying
matrix (specified by -M)
for this column; e.g., if the query sequence
has an 'l' residue in the
column, then the standard scores for
matching an L are used in
the column.
A sample usage would be:
blastpgp -i seq1 -B align1 -j 2 -d nr
where seq1 is the query
align1 is the alignment file
-j 2 indicates to do 2 rounds
-d nr indicates to use the nr database
The example files
seq1
align1
copied below were kindly
supplied by L. Aravind from a paper
he and Chris Ponting
published in Protein Science:
Aravind L, Ponting CP,
Homologues of 26S proteasome subunits
are regulators of
transcription and translation, Protein Science
7(1998) 1250-1254.
L. Aravind
(aravind@ncbi.nlm.nih.gov) was the first user
and helped define how -B
should work. Y. Wolf (wolf@ncbi.nlm.nih.gov)
helped design a more
flexible input format for the alignments.
If you like how -B works,
let them know.
If you do not like how -B
works, complain to
A.
Schaffer(schaffer@helix.nih.gov) who did the implementation.
seq1
----
> 26SPS9_Hs
IHAAEEKDWKTAYSYFYEAFEGYDSIDSPKAITSLKYMLLCKIMLNTPEDVQALVSGKLALRYAGRQTEA
LKCVAQASKNRSLADFEKALTDYRAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKL
SKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP
align1
------
26SPS9_Hs
IHAAEEKDWKTAYSYFYEAFEGYdsidspkaitslkymllckimlntpedvqalvsgklalryagrqtealkcvaqasknr
F57B9_Ce LHAADEKDFKTAFSYFYEAFEGYdsvdekvsaltalkymllckvmldlpdevnsllsaklalkyngsdldamkaiaaaaqk
YDL097c_Sc
ILHCEDKDYKTAFSYFFESFESYhnltthnsyekacqvlkymllskimlnliddvknilnakytketyqsrgidamkavae
YMJ5_Ce
LYSAEERDYKTSFSYFYEAFEGFasigdkinatsalkymilckimlneteqlagllaakeivayqkspriiairsmadafr
FUS6_ARATH
KNYIRTRDYCTTTKHIIHMCMNAilvsiemgqfthvtsyvnkaeqnpetlepmvnaklrcasglahlelkkyklaarkfld
COS41.8_Ci
SLDYKLKTYLTIARLYLEDEDPVqaemyinrasllqnetadeqlqihykvcyarvldyrrkfleaaqrynelsyksaihet
644879
KCYSRARDYCTSAKHVINMCLNVikvsvylqnwshvlsyvskaestpeiaeqrgerdsqtqailtklkcaaglaelaarky
YPR108w_Sc
IHCLAVRNFKEAAKLLVDSLATFtsieltsyesiatyasvtglftlertdlkskvidspellslisttaalqsissltisl
eif-3p110_Hs
SKAMKMGDWKTCHSFIINEKMNGkvw-------------------------------------------------------
T23D8.4_Ce
SKAMLNGDWKKCQDYIVNDKMNQkvw-------------------------------------------------------
YD95_Sp
IYLMSIRNFSGAADLLLDCMSTFsstellpyydvvryavisgaisldrvdvktkivdspevlavlpqnesmssleacinsl
KIAA0107_Hs
LYCVAIRDFKQAAELFLDTVSTFtsyelmdyktfvtytvyvsmialerpdlrekvikgaeilevlhslpavrqylfslyec
F49C12.8_Hs LYRMSVRDFAGAADLFLEAVPTFgsyelmtyenlilytvitttfaldrpdlrtkvircnevqeqltggglngtlipvreyl
Int-6_Mm
KFQYECGNYSGAAEYLYFFRVLVpatdrnalsslwgklaseilmqnwdaamedltrlketidnnsvssplqslqqrtwlih
26SPS9_Hs
sladfekaltdy-----------------------------------------------------------------------------------
F57B9_Ce
rslkdfqvafgsf----------------------------------------------------------------------------------
YDL097c_Sc
aynnrslldfntalkqy------------------------------------------------------------------------------
YMJ5_Ce
krslkdfvkalaeh---------------------------------------------------------------------------------
FUS6_ARATH
vnpelgnsyneviapqdiatygglcalasfdrselkqkvidninfrnflelvpdvrelindfyssryascleylasl------------------
COS41.8_Ci
eqtkalekalncailapagqqrsrmlatlfkdercqllpsfgilekmfldriiksdemeefar--------------------------------
644879
kqaakclllasfdhcdfpellspsnvaiygglcalatfdrqelqrnvissssfklflelepqvrdiifkfyeskyasclkmldem----------
YPR108w_Sc
yasdyasyfpyllety-------------------------------------------------------------------------------
eif-3p110_Hs
-----------------------------------------------------------------------------------------------
T23D8.4_Ce
-----------------------------------------------------------------------------------------------
YD95_Sp
ylcdysgffrtladve-------------------------------------------------------------------------------
KIAA0107_Hs
rysvffqslavv-----------------------------------------------------------------------------------
F49C12.8_Hs esyydchydrffiqlaale----------------------------------------------------------------------------
Int-6_Mm
wslfvffnhpkgrdniidlflyqpqylnaiqtmcphilrylttavitnkdvrkrrqvlkdlvkviqqesytykdpitefveclyvnfdfdgaqkk
26SPS9_Hs
----RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP
F57B9_Ce
----PQELQMDPVVRKHFHSLSERMLEKDLCRIIEPYSFVQIEHVAQQIGIDRSKVEKKLSQMILDQKLSGSLDQGEGMLIVFEIAV
YDL097c_Sc
----EKELMGDELTRSHFNALYDTLLESNLCKIIEPFECVEISHISKIIGLDTQQVEGKLSQMILDKIFYGVLDQGNGWLYVYETPN
YMJ5_Ce ----KIELVEDKVVAVHSQNLERNMLEKEISRVIEPYSEIELSYIARVIGMTVPPVERAIARMILDKKLMGSIDQHGDTVVVYPKAD
FUS6_ARATH
----KSNLLLDIHLHDHVDTLYDQIRKKALIQYTLPFVSVDLSRMADAFKTSVSGLEKELEALITDNQIQARIDSHNKILYARHADQ
COS41.8_Ci
----QLMPHQKAITADGSNILHRAVTEHNLLSASKLYNNIRFTELGALLEIPHQMAEKVASQMICESRMKGHIDQIDGIVFFERRET
644879
----KDNLLLDMYLAPHVRTLYTQIRNRALIQYFSPYVSADMHRMAAAFNTTVAALEDELTQLILEGLISARVDSHSKILYARDVDQ
YPR108w_Sc
----ANVLIPCKYLNRHADFFVREMRRKVYAQLLESYKTLSLKSMASAFGVSVAFLDNDLGKFIPNKQLNCVIDRVNGIVETNRPDN
eif-3p110_Hs ----DLFPEADKVRTMLVRKIQEESLRTYLFTYSSVYDSISMETLSDMFELDLPTVHSIISKMIINEELMASLDQPTQTVVMHRTEP
T23D8.4_Ce
----NLFHNAETVKGMVVRRIQEESLRTYLLTYSTVYATVSLKKLADLFELSKKDVHSIISKMIIQEELSATLDEPTDCLIMHRVEP
YD95_Sp
----VNHLKCDQFLVAHYRYYVREMRRRAYAQLLESYRALSIDSMAASFGVSVDYIDRDLASFIPDNKLNCVIDRVNGVVFTNRPDE
KIAA0107_Hs
----EQEMKKDWLFAPHYRYYVREMRIHAYSQLLESYRSLTLGYMAEAFGVGVEFIDQELSRFIAAGRLHCKIDKVNEIVETNRPDS
F49C12.8_Hs
----SERFKFDRYLSPHFNYYSRGMRHRAYEQFLTPYKTVRIDMMAKDFGVSRAFIDRELHRLIATGQLQCRIDAVNGVIEVNHRDS
Int-6_Mm lrecESVLVNDFFLVACLEDFIENARLFIFETFCRIHQCISINMLADKLNMTPEEAERWIVNLIRNARLDAKIDSKLGHVVMGNNAV
PHI-Blast
---------
PHI-BLAST (Pattern-Hit
Initiated BLAST) is a search
program that combines
matching of regular expressions
with local alignments
surrounding the match.
The most important
features of the program have been
incorporated into the
BLAST software framework
partly for user
convenience and partly so that
PHI-BLAST may be combined
seamlessly with PSI-BLAST.
Other features that do not
fit into the BLAST framework
will be released later as
a separate program and/or
separate Web page query
options.
One very restrictive way
to identify protein motifs
is by regular expressions
that must contain each instance
of the motif. The PROSITE
database is a compilation of
restricted regular
expressions that describe protein motifs.
Given a protein sequence S
and a regular expression pattern P
occurring in S, PHI-BLAST
helps answer the question:
What other protein
sequences both contain an occurrence of P
and are homologous to S in
the vicinity of the pattern occurrences?
PHI-BLAST may be
preferable to just searching for pattern occurrences
because it filters out
those cases where the pattern occurrence is
probably random and not
indicative of homology.
PHI-BLAST may be
preferable to other flavors of BLAST because
it is faster and because
it allows the user to express
a rigid pattern occurrence
requirement.
The pattern search methods
in PHI-BLAST are based on the
algorithms in:
R. Baeza-Yates and G.
Gonnet, Communications of the ACM 35(1992), pp. 74-82.
S. Wu and U. Manber,
Communications of the ACM 35(1992), pp. 83-91.
The calculation of local
alignments is done using a method
very similar to (and much
of the same code as) gapped BLAST.
However, the method of
evaluating statistical significance is different, and
is described below.
In the stand-alone mode
the typical PHI-BLAST usage looks like:
blastpgp -i -k -p patseedp
where -i is followed by the file containing the query in FASTA
format
where -k is followed by the file containing the pattern in a
syntax given below
and "patseedp" indicates the mode of usage, not representing any file.
The syntax for the query
sequence is FASTA format as for all other
BLAST queries. The syntax
for patterns follows the rules of
PROSITE and is documented
in detail below.
The specified pattern is
not required to be in the PROSITE list.
Most of the other BLAST
flags can be used with PHI-BLAST.
One important exception is
that PHI-BLAST requires gapped
alignments (i.e. forbids
-g F in the flags) because ungapped
alignments do not make
sense for almost all patterns in PROSITE.
There is a second mode of
PHI-BLAST usage that is important when
the specified pattern
occurs more than 1 time in the query.
In this case, the user may
be interested in restricting the
search for local
alignments to a subset of the pattern occurrences.
This can be done with a
search that looks like:
blastpgp -i -k -p seedp
in which case the use of
the "seedp" option requires the user to
specify the location(s) of
the interesting pattern occurrence(s)
in the pattern file. The
syntax for how to specify pattern
occurrences is below. When
there are multiple pattern occurrences in the
query it may be important
to decide how many are of interest because
the E-value for matches is
effectively multiplied by the number
of interesting pattern
occurrences.
The PHI-BLAST Web page
supports only the "patseedp" option.
PHI-BLAST is integrated
with PSI-BLAST. In the command-line
mode, PSI-BLAST can be
invoked by using the -j option, as usual.
When this is done as:
blastpgp -i -k -p patseedp -j
then the first round of
searching uses PHI-BLAST and all subsequent
rounds use PSI-BLAST.
In the Web page setting,
the user must explicitly invoke one round
at a time, and the
PHI-BLAST Web page provides the option to
initiate a PSI-BLAST round
with the PHI-BLAST results.
To describe a combined
usage, use the term "PHI-PSI-BLAST"
(Pattern-Hit Initiated,
Position-Specific Iterated BLAST).
Determining statistical
significance.
When a query sequence Q
matches a database sequence D in PHI-BLAST,
it is useful to subdivide
Q and D into 3 disjoint pieces
Qleft Qpattern Qright
Dleft Dpattern Dright
The substrings Qpattern
and Dpattern contain the pattern specified
in the pattern file. The
pieces Qpattern and Dpattern are aligned
and that alignment is
displayed as part of the PHI-BLAST output,
but the score for that
alignment is mostly ignored.
The "reduced"
score r of an alignment is the sum of the scores obtained
by aligning Qleft with Dleft and by aligning Qright with
Dright.
The expected number of
alignments with a reduced score >= x
is given by:
CN(Lambda*x + 1)e^(-Lambda *x)
where:
C and Lambda are
"constants" depending on the score matrix and the
gap costs.
N is (number of
occurrences of pattern in database) * (number of
occurrences of pattern in Q)
e is the base of the
natural logarithm.
It is important to
understand that this method of computing
the statistical
significance of a PHI-BLAST alignment is mathematically
different from the method
used for BLAST and PSI-BLAST alignments.
However, both methods
provide E-values, so they the E_values are
displayed with a similar
output syntax.
Rules for pattern syntax
for PHI-BLAST.
The syntax for patterns in
PHI-BLAST follows the conventions
of PROSITE. When using the
stand-alone program, it
is permissible to have
multiple patterns in a file separated
by a blank line between
patterns. When using the Web-page
only one pattern is
allowed per query.
Valid protein characters
for PHI-BLAST patterns:
ABCDEFGHIKLMNPQRSTVWXYZU
Valid DNA characters for
PHI-BLAST patterns:
ACGT
Other useful delimiters:
[ ] means any one of
the characters enclosed in the brackets
e.g., [LFYT] means one occurrence of L or F or Y or T
- means nothing
(this is a spacer character used by PROSITE)
x with nothing following means any residue
x(5) means 5 positions
in which any residue is allowed (and similarly for any other
single number in parentheses after x)
x(2,4) means 2 to 4 positions where any residue is allowed,
and similarly for any other two numbers separated by a
comma;
the first number should be < the second number.
> can occur only
at the end of a pattern and means nothing
it may occur before a period
(another spacer used by PROSITE)
. may be used at
the end of the pattern and means nothing
When using the stand-alone
program, the pattern should
be in a file, with the
first line starting:
ID
followed by 2 spaces and a
text string giving the pattern a name.
There should also be a
line starting
PA
followed by 2 spaces
followed by the pattern description.
All other PROSITE codes in
the first two columns are allowed,
but only the HI code,
described below is relevant to PHI-BLAST.
Here is an example from
PROSITE.
ID CNMP_BINDING_2; PATTERN.
AC PS00889;
DT OCT-1993 (CREATED); OCT-1993 (DATA UPDATE);
NOV-1995 (INFO UPDATE).
DE Cyclic nucleotide-binding domain signature
2.
PA
[LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].
NR /RELEASE=32,49340;
NR /TOTAL=57(36); /POSITIVE=57(36);
/UNKNOWN=0(0); /FALSE_POS=0(0);
NR /FALSE_NEG=1; /PARTIAL=1;
CC /TAXO-RANGE=??EP?; /MAX-REPEAT=2;
The line starting
ID
gives the pattern a name.
The lines starting
AC, DT, DE, NR, NR, CC
are relevant to PROSITE
users, but irrelevant to PHI-BLAST.
These lines are tolerated,
but ignored by PHI-BLAST.
The line starting
PA
describes the pattern as:
one of LIVMF
followed by
G
followed by
E
followed by
any single character
followed by
one of GAS
followed by
one of LIVM
followed by
any 5 to 11 characters
followed by
R
followed by
one of STAQ
followed by
A
followed by
any single character
followed by
one of LIVMA
followed by
any single character
followed by
one of STACV
In this case the pattern
ends with a period.
It can end with nothing
after the last specifying symbol
or any number of >
signs or periods or combination thereof.
Here is another example,
illustrating the use of an HI line.
ID ER_TARGET; PATTERN.
PA [KRHQSA]-[DENQ]-E-L>.
HI (19 22)
HI (201 204)
In this example, the HI
lines specify that the pattern
occurs twice, once from
positions 19 through 22 in the
sequence and once from
positions 201 through 204 in the
sequence.
These specifications are
relevant when stand-alone PHI-BLAST is
used with the
seedp
option, in which the
interesting occurrences of the pattern
in the sequence are
specified. In this case the
HI lines specify which
occurrence(s) of the pattern
should be used to find
good alignments.
In general, the seedp
option is more useful than the
standard patternp option
ONLY when the
pattern occurs K > 1
times in the sequence AND
the user is interested in
matching to J < K of those
occurrences.
Then using the HI lines
enables the user to specify which
occurrences are of
interest.
Additional functionality
related to PHI-BLAST.
PHI-BLAST takes as input
both a sequence and a query containing
that sequence and searches
a sequence database for
other sequences containing
the same pattern and having a good alignment.
One may be interested in
asking two related, simpler questions:
1. Given a sequence and a
database of patterns, which patterns occur
in the sequence and where?
2. Given a pattern and a
sequence database, which sequences contain the
pattern and where?
These queries can be
answered wih software closely related to PHI-BLAST,
but they do not fit into
the output framework of BLAST because the
answers are simple lists
without alignments and with no notion of
statistical significance.
The NCBI toolbox includes
another program, currently called
seedtop
to answer the two queries
above.
Query 1 can be asked with:
seedtop -i -k -p patmatchp
Query 2 can be asked with:
seedtop -d -k -p patternp
The -k argument is used
similarly in all queries and the file
format is always the same.
The standard pattern database is
PROSITE, but others (or a
subset) can be used.
There are plans afoot to
offer the patmatchp query (number 1) on
the PHI-BLAST web page or
in its vicinity, but this would
be restricted to having
PROSITE as the pattern database.
BLAST OPTIONS
-------------
Formatdb
--------
Formatdb, should be used
to format the FASTA databases for both protein and DNA databases for BLAST 2.0.
This must be done before blastall or blastpgp can be run locally. The format of
the databases has been changed substantially from the BLAST 1.4 release. A
major improvement in this format over the old one is that ambiguity information
for DNA sequences is now retrieved from the files produced by formatdb, rather
than from the original FASTA file. The original FASTA file is no longer needed
for the BLAST runs. This saves both
disk-space and improves performance as the large FASTA file no longer needs to
be accessed by BLAST.
The input for formatdb may
be either ASN.1 or FASTA. Use of ASN.1
is
advantageous for those
sites that might also wish to format the ASN.1
in different ways, such as
a GenBank report. Usage of formatdb may be
obtained by executing
formatdb and a dash (note that additional comments
have been added here as
indented paragraphs):
formatdb arguments:
-t Title for database
file [String] Optional
-i Input file for
formatting (this parameter must be set) [File In]
-l Logfile name: [File
Out] Optional
default = formatdb.log
-p Type of file
T - protein
F - nucleotide [T/F]
Optional
default = T
The "-p" option has two different meaning
depending on whether input
database is in FASTA or ASN.1 format. In case of FASTA,
the "-p" specifies
type of input database. In case of ASN.1, the option
specifies the type of
sequence to be indexed for BLAST.
-o Parse options
T - True: Parse SeqId and create indexes.
F - False: Do not parse SeqId. Do not create indexes.
If the "-o" option is TRUE (and the input
database is in FASTA format), then
the database identifiers in the FASTA definition line must
follow the
convention described in the appendices of
ftp://ncbi.nlm.nih.gov/blast/db/README. Also, see further explanation below.
[T/F] Optional
default = F
-a Input file is database
in ASN.1 format (otherwise FASTA is expected)
T - True,
F - False.
[T/F] Optional
default = F
-b ASN.1 database in
binary mode
T - binary,
F - text mode.
An input ASN.1 database may be represented in two formats
- ascii text and
binary. The "-b" option, if TRUE, specifies that
input ASN.1 database is in
binary format. The option is ignored in case of FASTA
input database.
[T/F] Optional
default = F
-e Input is a Seq-entry
[T/F] Optional
default = F
An input ASN.1 database (either text ascii or binary) may
contains
Bioseq-set or just one Bioseq. In the latter case the
"-e" switch should be
set to TRUE.
-n Base name for BLAST
files [String] Optional
This options allows one to produce BLAST databases with a
different name
than the original FASTA file. One could have a file named 'nt' and
and format it as 'nr':
formatdb -i nt -p F -o T -n nr
One could also uncompress the original FASTA file on the
fly and send it to
formatdb through the 'stdin' (under UNIX):
uncompress -c nr.z | formatdb -i stdin -o T -n nr
This can be used in situations where the original FASTA file is
not required
other than by formatdb.
This can help in a situation where disk-space is tight.
-v Number of sequence
bases to be created in the volume [Integer]
Optional
default = 0
This option breaks up large FASTA files into 'volumes' (each
with a maximum
size of 2 billion letters).
As part of the creation of a volume formatdb
writes a new type of BLAST database file, called an alias file,
with the
extension 'nal' or 'pal', is written.
-s Create indexes limited
only to accessions - sparse [T/F]
Optional
default = F
This option limits the indices for the string identifiers (used
by formatdb)
to accessions (i.e., no locus names). This is especially useful for sequences sets
like the EST's where the accession and locus names are
identical. Formatdb runs
faster and produces smaller temporary files if this option is
used. It is strongly
recommended for EST's, STS's, GSS's, and HTGS's.
FORMATDB NOTES:
It is always advantageous
to use the '-o' option if the database identifiers
are in the format
specified at ftp://ncbi.nlm.nih.gov/blast/db/README. If
the database identifiers
are in the parseable formatdb produces additional
indices allowing retrieval
from the databases by identifier. The databases
on the NCBI FTP site
contain parseable identifiers. It is sufficient if the
first word on the FASTA
defintion line is a unique identifier (e.g.,
">3091 Alcoho
de..."). It is necessary to use parseable identifiers for the following
cases:
1.) If ASN.1 is to be
produced from blastall or blastpgp, then "-o" must be TRUE.
2.) master-slave
alignments are desired (i.e., the '-m' option with a non-zero value is used).
3.) The gi's are desired
as part of the output (i.e., '-I' is used).
4.) fastacmd is used to
fetch sequences from the database by accession or gi.
References
Zhang, Zheng, Alejandro A. Schäffer, Webb Miller, Thomas L.
Madden,
David J. Lipman, Eugene V. Koonin, and Stephen F. Altschul
(1998),
"Protein sequence similarity searches using patterns as
seeds", Nucleic
Acids Res. 26:3986-3990.
Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman
(1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein
database
search programs", Nucleic Acids Res. 25:3389-3402.
Karlin, Samuel and Stephen F. Altschul (1990). Methods
for
assessing the statistical significance of molecular sequence
features by using general scoring schemes. Proc. Natl. Acad.
Sci. USA 87:2264-68.
Karlin, Samuel and Stephen F. Altschul (1993). Applications
and statistics for multiple high-scoring segments in molecu-
lar sequences. Proc. Natl. Acad. Sci. USA 90:5873-7.