BLAST Command-line Reference
The command-line options for the Blast+ CLI Tools. Taken from the User Manual because the tools lack individual manpages and information is spread all over.
NOTE: documentation below applies to the 2.2 version, which is current in Ubuntu 16.04. NCBI released 2.5 in late 2016 when they switched to https for remote queries, and after recently having moved from gi to accession as the primary identifier. Features such as -gilist
may have been deprecated together with the abolishment of the GI.
Table of Contents
BLAST Links
- All BLAST Documentation
- Blast+ Command Line Applications User Manual (PDF)
- NCBI C++ Toolkit book and code on GitHub, the examples describe
id1_fetch
.
Search Programmes
The BLAST search programmes are:
- blastn: nucleotide - nucleotide (regular, mega, dc-mega, short)
- blastp: protein - protein (regular, fast, short)
- blastx: nucleotide - protein (regular, fast)
- tblastn: protein - nucleotide translated (regular, fast)
- tblastx: nucleotide translated - nucleotide translated (ungapped only)
- rpsblast+: protein - conserved domain profiles (CDD)
- rpstblastn:
- psiblast:
- deltablast: sensitive protein sequence search based on psiblast
The BLAST search programmes share the following set of shared options.
Common search options
option | type | default | description |
---|---|---|---|
db |
string | none | BLAST database name, searched in BLASTDB path (see Configuring BLAST+. |
query |
string | stdin | Query file name, contents autodetected and autoresolved: fasta, GIs, accession numbers. |
query_loc |
string | none | Location on the query sequence (Format: start-stop) |
out |
string | stdout | Output file name |
evalue |
real | 10.0 | Expect value (E) for saving hits |
subject |
string | none | File with subject sequence(s) to search, contents autodetected and autoresolved: fasta, GIs, accession numbers. |
subject_loc |
string | none | Location on the subject sequence (Format: start-stop). |
show_gis |
flag | N/A | Show NCBI GIs in report. |
num_descriptions |
integer | 500 | Show one-line descriptions for this number of database sequences. |
num_alignments |
integer | 250 | Show alignments for this number of database sequences. |
max_target_seqs |
Integer | 500 | Number of aligned sequences to keep. Use with report formats that do not have separate definition line and alignment sections such as tabular (all outfmt > 4). Not compatible with num_descriptions or num_alignments . |
html |
flag | N/A | Produce HTML output |
gilist |
string | none | Restrict search of database to GI’s listed in this file. Local searches only. Note that blastdb_aliastool can (1) optimise a GI list to binary and (2) create a named alias to a database filtered on a GI list. |
negative_gilist |
string | none | Restrict search of database to everything except the GI’s listed in this file. Local searches only. |
entrez_query |
string | none | Restrict search with the given Entrez query. Remote searches only. |
culling_limit |
integer | none | Delete a hit that is enveloped by at least this many higher-scoring hits. |
best_hit_overhang |
real | none | Best Hit algorithm overhang value (recommended value: 0.1) |
best_hit_score_edge |
real | none | Best Hit algorithm score edge value (recommended value: 0.1) |
dbsize |
integer | none | Effective size of the database |
searchsp |
integer | none | Effective length of the search space |
import_search_strategy |
string | none | Search strategy file to read. |
export_search_strategy |
string | none | Record search strategy to this file. |
parse_deflines |
flag | N/A | Parse query and subject bar delimited sequence identifiers (e.g., gi|129295). |
num_threads |
integer | 1 | Number of threads (CPUs) to use in blast search. |
remote |
flag | N/A | Execute search on NCBI servers? |
outfmt |
string | 0 | Alignment view options (see below) |
Alignment View options (outfmt values)
Note: options 6, 7, and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers (see format specifiers below).
0 |
pairwise (default) |
1 |
query-anchored showing identities, |
2 |
query-anchored no identities, |
3 |
flat query-anchored, show identities, |
4 |
flat query-anchored, no identities, |
5 |
XML Blast output, |
6 |
tabular, see output specifiers |
7 |
tabular with comment lines, see output specifiers |
8 |
Text ASN.1, |
9 |
Binary ASN.1, |
10 |
Comma-separated values, see below |
11 |
BLAST archive format (ASN.1), |
12 |
JSON Seqalign output, |
13 |
JSON Blast output, |
14 |
XML2 Blast output |
Search Output Specifiers
These apply to the tabular alignment view options (blast outfmt
6,7,10). Asterisks mark the defaults, which correspond to keyword std
.
qseqid* |
Query Seq-id |
qgi |
Query GI |
qacc |
Query accesion |
qaccver |
Query accesion.version |
qlen |
Query sequence length |
sseqid* |
Subject Seq-id |
sallseqid |
All subject Seq-id(s), separated by semicolon |
sgi |
Subject GI |
sallgi |
All subject GIs |
sacc |
Subject accession |
saccver |
Subject accession.version |
sallacc |
All subject accessions |
slen |
Subject sequence length |
qstart* |
Start of alignment in query |
qend* |
End of alignment in query |
sstart* |
Start of alignment in subject |
send* |
End of alignment in subject |
qseq |
Aligned part of query sequence |
sseq |
Aligned part of subject sequence |
evalue* |
Expect value |
bitscore* |
Bit score |
score |
Raw score |
length* |
Alignment length |
pident* |
Percentage of identical matches |
nident |
Number of identical matches |
mismatch* |
Number of mismatches |
positive |
Number of positive-scoring matches |
gapopen* |
Number of gap openings |
gaps |
Total number of gaps |
ppos |
Percentage of positive-scoring matches |
frames |
Query and subject frames separated by a ‘/’ |
qframe |
Query frame |
sframe |
Subject frame |
btop |
Blast traceback operations (BTOP) |
staxids |
unique Subject Taxonomy ID(s), separated by semicolon (in numerical order) |
sscinames |
unique Subject Scientific Name(s), separated by semicolon |
scomnames |
unique Subject Common Name(s), separated by semicolon |
sblastnames |
unique Subject Blast Name(s), separated by semicolon (in alphabetical order) |
sskingdoms |
unique Subject Super Kingdom(s), separated by semicolon (in alphabetical order) |
stitle |
Subject Title |
salltitles |
All Subject Title(s), separated by a ‘<>’ |
sstrand |
Subject Strand |
qcovs |
Query Coverage Per Subject |
qcovhsp |
Query Coverage Per HSP |
Identifier Auto-resolution
The -query
and -subject
files can contain sequences or identifiers (GIs, accession number), which
will be resolved locally and remotely. For remote resolution, DATA_LOADERS
, BLASTDB_PROT_DATA_LOADER
,
BLASTDB_NUCL_DATA_LOADER
must be set (see configuring BLAST+).
Best-Hit Filtering
Returns only the best matches for each query region reporting matches. Given -best_hit_overhang H
and
-score_edge E
, this selects hit A over a hit B when:
- B’s query region extends neither end of A’s query region by more than
H
times A’s query region length. - A’s e-value is no worse than that of B, i.e.
evalue(A) <= evalue(B)
- A’s score over length has an
E
edge over that of B, i.e.score(A)/length(A) > (1-E) * score(B)/length(B)
Suggested ranges are H
in 0.1 .. 0.25 (larger is more filtering but longer runtime), E
0.05 .. 0.25
(larger is less filtering).
BLASTN options
The blastn application searches a nucleotide query against nucleotide subject sequences or a nucleotide database.
It has four -task
options, which preset defaults values for specific types of search:
megablast
: for very similar sequences (e.g, sequencing errors),dc-megablast
: discontinuous megablast, typically used for inter-species comparisonsblastn
: the traditional program used for inter-species comparisons,blastn-short
: optimized for sequences less than 50 nucleotides.
In addition to the Common Search Options, these are the blastn
options:
blastn option | type | default | description |
---|---|---|---|
word_size |
integer | 28(m) 11(d,n) 7(s) | Length of initial exact match. Note: dc allows non-consecutive letters to match. |
gapopen |
integer | 0(m) 5(d,n,s) | Cost to open a gap. See below |
gapextend |
integer | -(m) 2(d,n,s) | Cost to extend a gap. This default is a function of reward/ penalty value. see below |
reward |
integer | 1(m,s) 2(d,n) | Reward for a nucleotide match. |
penalty |
integer | -1(m) -3(d,n,s) | Penalty for a nucleotide mismatch. |
ungapped |
flag | N/A | Perform ungapped alignment. |
strand |
string | both | Query strand(s) to search against database/subject. Choice of both, minus, or plus. |
perc_identity |
integer | 0 | Percent identity cutoff. |
dust |
string | 20 64 1 | Filter query sequence with dust. |
filtering_db |
string | none | Mask query using the sequences in this database. |
window_masker_taxid |
integer | none | Enable WindowMasker filtering using a Taxonomic ID. |
window_masker_db |
string | none | Enable WindowMasker filtering using this file. |
soft_masking |
boolean | true | Apply filtering locations as soft masks (i.e., only for finding initial matches). |
lcase_masking |
flag | N/A | Use lower case filtering in query and subject sequence(s). |
db_soft_mask |
integer | none | Filtering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches). |
db_hard_mask |
integer | none | Filtering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search). |
xdrop_ungap |
real | 20 | Heuristic value (in bits) for ungapped extensions. |
xdrop_gap |
real | 30 | Heuristic value (in bits) for preliminary gapped extensions. |
xdrop_gap_final |
real | 100 | Heuristic value (in bits) for final gapped alignment. |
min_raw_gapped_score |
integer | none | Minimum raw gapped score to keep an alignment in the preliminary gapped and trace-back stages. Normally set based upon expect value. |
Megablast specific
blastn option | type | default | description |
---|---|---|---|
use_index |
boolean | false | Use MegaBLAST database index. Indices may be created with the makembindex application. |
index_name |
string | none | MegaBLAST database index name. |
no_greedy |
flag | N/A | Use non-greedy dynamic programming extension. |
DC-Megablast specific
blastn option | type | default | description |
---|---|---|---|
template_type |
string | coding | Discontiguous MegaBLAST template type. Allowed values are coding, optimal and coding_and_optimal. |
template_length |
integer | 18 | Discontiguous MegaBLAST template length. |
window_size |
integer | 40 | Multiple hits window size, use 0 to specify 1-hit algorithm |
BLASTP options
The blastp application searches a protein sequence against protein subject sequences or a protein database. This table reflects the 2.2.31 BLAST+ release.
Tasks
blastp
: standard protein-protein comparisonsblastpshort
: optimized for query sequences shorter than 30 residuesblastp-fast
: using a larger wordsize for the initial word matching as described in PMID17921491.
In addition to the Common Search Options, these are the blastp
options:
blastp option | type | default | description |
---|---|---|---|
word_size |
integer | 3(p) 2(s) 6(f) | Word size of initial match. Valid word sizes are 2-7. |
gapopen |
integer | 11(p,f) 9(s) | Cost to open a gap. |
gapextend |
integer | 1 | Cost to extend a gap. |
matrix |
matrix | BLOSUM62(p,f) PAM30(s) | Scoring matrix name. |
threshold |
integer | 11(p) 16(s) 21(f) | Minimum score to add a word to the BLAST lookup table. |
seg |
string | no | Filter query sequence with SEG (Format: ‘yes’, ‘window locut hicut’, or ‘no’ to disable). |
soft_masking |
boolean | false(p,f) N/A(s) | Apply filtering locations as soft masks (i.e., only for finding initial matches). Not for blastpshort. |
lcase_masking |
flag | N/A | Use lower case filtering in query and subject sequence(s). |
db_soft_mask |
integer | none | Filtering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches). |
db_hard_mask |
integer | none | Filtering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search). |
xdrop_gap_final |
real | 25 | Heuristic value (in bits) for final gapped alignment/ |
window_size |
integer | 40(p,f) 15(s) | Multiple hits window size, use 0 to specify 1-hit algorithm. |
use_sw_tback |
flag | N/A | Compute locally optimal Smith-Waterman alignments? |
comp_based_stats |
opt | 2(p,f) 0(s) | Use composition-based statistics (see below) |
Composition-based stats option
value | meaning |
D,d | Default |
F,f,0 | No composition-based statistics |
1 | Composition-based statistics as in NAR 29:2994-3005, 2001 |
T,t,2 | Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties |
3 | Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally |
BLASTX options
The blastx application translates a nucleotide query and searches it against protein subject sequences or a protein database. It has two tasks:
blastx
: standard searchesblastx-fast
: larger word-size for the initial word matching as described in PMID17921491.
In addition to the Common Search Options, these are the blastx
options:
blastx option | type | default | description |
---|---|---|---|
word_size |
integer | 3(x) 6(f) | Valid word sizes are 2-7. |
gapopen |
integer | 11 | Cost to open a gap. |
gapextend |
integer | 1 | Cost to extend a gap. |
matrix |
string | BLOSUM62 | Scoring matrix name. |
threshold |
integer | 12(x) 21(f) | Minimum score to add a word to the BLAST lookup table. |
seg |
string | 12 2.2 2.5 | Filter query sequence with SEG (Format: ‘yes’, ‘window locut hicut’, or ‘no’ to disable). |
soft_masking |
boolean | false | Apply filtering locations as soft masks (i.e., only for finding initial matches). |
lcase_masking |
flag | N/A | Use lower case filtering in query and subject sequence(s). |
db_soft_mask |
integer | none | Filtering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches). |
db_hard_mask |
integer | none | Filtering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search). |
xdrop_gap_final |
real | 25 | Heuristic value (in bits) for final gapped alignment. |
window_size |
integer | 40 | Multiple hits window size, use 0 to specify 1-hit algorithm. |
strand |
string | both | Query strand(s) to search against database/subject. Choice of both, minus, or plus. |
query_genetic_code |
integer | 1 | Genetic code to translate query, see ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt |
max_intron_length |
integer | 0 | Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking). |
comp_based_stats |
integer | 2 | Use composition-based statistic (see Composition based statistics option for BLASTN) |
TBLASTN options
The tblastn application searches a protein query against nucleotide subject sequences or a nucleotide database translated at search time. It has two tasks:
tblastn
: for standard searches,tblastn-fast
: using a larger word-size for the initial word matching as described in PMID17921491.
In addition to the Common Search Options, these are the tblastn
options:
tblastn option | type | default | description |
---|---|---|---|
word_size |
integer | 3(n) 6(f) | Word size for initial match. Valid word sizes are 2-7. |
gapopen |
integer | 11 | Cost to open a gap. |
gapextend |
integer | 1 | Cost to extend a gap. |
matrix |
string | BLOSUM62 | Scoring matrix name. |
threshold |
integer | 13(n) 21(f) | Minimum score to add a word to the BLAST lookup table. |
seg |
string | 12 2.2 2.5 | Filter query sequence with SEG (Format: ‘yes’, ‘window locut hicut’, or ‘no’ to disable). |
soft_masking |
boolean | false | Apply filtering locations as soft masks (i.e., only for finding initial matches). |
lcase_masking |
flag | N/A | Use lower case filtering in query and subject sequence(s). |
db_soft_mask |
integer | none | Filtering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches). |
db_hard_mask |
integer | none | Filtering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search). |
xdrop_gap_final |
real | 25 | Heuristic value (in bits) for final gapped alignment. |
window_size |
integer | 40 | Multiple hits window size, use 0 to specify 1-hit algorithm. |
db_gen_code |
integer | 1 | Genetic code to translate subject sequences, see ftp://ftp.ncbi.nih.gov/ entrez/misc/data/gc.prt |
max_intron_length |
integer | 0 | Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking). |
comp_based_stats |
string | 2 | Use composition-based statistics see Composition based statistics option for BLASTN) |
TBLASTX options
The tblastx application searches a translated nucleotide query against translated nucleotide subject sequences or a translated nucleotide database. Only ungapped searches are supported for tblastx.
In addition to the Common Search Options, these are the tblastx
options:
tblastx option | type | default | description |
---|---|---|---|
word_size |
integer | 3 | Word size for initial match. |
matrix |
string | BLOSUM62 | Scoring matrix name. |
threshold |
integer | 13 | Minimum word score to add the word to the BLAST lookup table. |
seg |
string | 12 2.2 2.5 | Filter query sequence with SEG (Format: ‘yes’, ‘window locut hicut’, or ‘no’ to disable). |
soft_masking |
boolean | false | Apply filtering locations as soft masks (i.e., only for finding initial matches). |
lcase_masking |
flag | N/A | Use lower case filtering in query and subject sequence(s). |
db_soft_mask |
integer | none | Filtering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches). |
db_hard_mask |
integer | none | Filtering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search). |
strand |
string | both | Query strand(s) to search against database subject sequences. Choice of both, minus, or plus. |
query_genetic_code |
integer | 1 | Genetic code to translate query, see ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt |
db_gen_code |
integer | 1 | Genetic code to translate subject sequences, see ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt |
max_intron_length |
integer | 0 | Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking) |
RPSBLAST options
The rpsblast application searches a protein query against the conserved domain database (CDD), which is a set of protein profiles. Many of the common options such as matrix or word threshold are set when the CDD is built and cannot be changed by the rpsblast application. A search ready CDD can be downloaded from ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/
In addition to the Common Search Options, these are the rpsblast
options:
rpsblast option | type | default | description |
---|---|---|---|
window_size |
integer | 40 | Multiple hits window size, use 0 to specify 1-hit algorithm. |
xdrop_ungap |
real | 15 | Heuristic value (in bits) for ungapped extensions |
xdrop_gap |
real | 25 | Heuristic value (in bits) for preliminary gapped extensions. |
xdrop_gap_final |
real | 40 | Heuristic value (in bits) for final gapped alignment. |
seg |
string | no | Filter query sequence with SEG (Format: ‘yes’, ‘window locut hicut’, or ‘no’ to disable). |
soft_masking |
boolean | false | Apply filtering locations as soft masks (i.e., only for finding initial matches). |
comp_based_stats |
string | 1 | Use composition-based statistics (see Composition based statistics option for BLASTN) |
DELTABLAST options
DELTA-BLAST uses RPS-BLAST to search for conserved domains matching to a query, constructs a PSSM from the sequences associated with the matching domains, and searches a sequence database. Its sensitivity is comparable to PSI-BLAST and does not require several iterations of searches against a large sequence database.
Database Management
The BLAST database & index management related tools:
- makeblastdb: create BLAST+ database
- makembindex: create index for megablast or (new) srsearch
- makeprofiledb: create profile database for rpsblast+
- blastdbcmd: read and report from BLAST+ database
- blastdb_aliastool: manage subsetted and multipart databases
- blastdbcheck
- blastdbcp
MAKEBLASTDB options
Makeblastdb application options. This application builds a BLAST database. Note that blastdb_aliastool can create a ‘virtual’ database by subsetting a database with a GI list, or ‘supersetting’ across a number of databases.
makeblastdb option | type | default | description |
---|---|---|---|
in |
string | stdin | Input file/database name |
out |
string | input | file name Name of BLAST database to be created. Input file name is used if none provided. This field is required if input consists of multiple files. |
input_type |
string | fasta | Input file type, it may be any of fasta, blastdb, asn1_txt, asn1_bin |
dbtype |
string | prot | Molecule type of input, values can be nucl or prot. |
title |
string | none | Title for BLAST database. If not set, the input file name will be used. |
parse_seqids |
flag | N/A | Parse bar delimited sequence identifiers (e.g., gi|129295) in FASTA input, so they can be used to filter queries on identifier lists. See section About Sequence Identifiers below. |
hash_index |
flag | N/A | Create index of sequence hash values. |
mask_data |
string | none | Comma-separated list of input files containing masking data as produced by NCBI masking applications (e.g. dustmasker, segmasker, windowmasker). |
max_file_size |
string | 1GB | Maximum file size to use for BLAST database. |
taxid |
integer | none | Taxonomy ID to assign to all sequences. |
taxid_map |
string | none | File with two columns mapping sequence ID to the taxonomy ID. The first column is the sequence ID represented as one of the below. The second column should be the NCBI taxonomy ID (e.g., 9606 for human). First column is either fasta with accessions (emb|X17276.1| ), fasta with GI (gi|4 ), GI as a bare number (4 ), or a local ID which must be prefixed with lcl (lcl|4 ). See section on sequence indentifiers too. |
logfile |
string | none | Program log file (default is stderr). |
MAKEMBINDEX options
The indexed databases created by makembindex are used by production MegaBLAST software and by a new srsearch utility designed to quickly search for nearly exact matches (up to one mismatch) of short queries against a genomic database. When a FASTA formatted file is used as the input, then masking by lower case letters is incorporated in the index. Makembindex can currently build two types of indices, called old style and new style indexing. The NCBI offers full support for the new style and has deprecated the old style. A MegaBLAST search with a new style index requires that both the index and the corresponding BLAST database be present. The index structure is described in PMID: 18567917. Please cite this paper in any publication that uses makembindex.
makembindex option |
type | default | description |
-+-+-+- |
|||
input |
string | stdin | Input file name or BLAST database name, depending on the value of the iformat parameter. For FASTA formatted input, this parameter is optional and defaults to the program’s standard input stream. |
output |
string | none | The resulting index name. The index itself can consist of multiple files, called volumes, called |
iformat |
string | fasta | The input format selector. Possible values are ‘fasta’ and ‘blastdb’. |
old_style_index |
boolean | true | If set to ‘false’ the new style index is created. New style indices require a BLAST database as input (use -iformat blastdb), which can be downloaded from the NCBI FTP site or created with makeblastdb. The option -output is ignored for a new style index. New style indices are always created at the same location as the corresponding BLAST database. |
db_mask |
integer | None | Exclude masked regions of BLAST db from the index. Use makeblastdb to discover the algorithm ID to be used as input for this argument. |
legacy |
boolean | true | This is a compatibility feature to support current production MegaBLAST. If true, then -stride, -nmer, and -ws_hint are ignored. The legacy format must be used for BLAST. |
nmer |
integer | 12 | N-mer size to use. Ignored if –legacy is specified |
ws_hint |
integer | 28 | This is an optimization hint for makembindex that indicates an expected minimum match size in searches that use the index. If n is the value of -nmer parameter and s is the value of –stride parameter, then the value of -ws_hint must be at least n + s - 1. |
stride |
integer | 5 | makembindex will index every stride-th N-mer of the database. |
volsize |
integer | 1536 | Target index volume size in megabytes. |
MAKEPROFILEDB options
This application builds an RPS-BLAST database (which includes the files for a standard protein BLAST database). COBALT (a multiple sequence alignment program) and DELTA-BLAST both use RPS-BLAST searches as part of their processing, but use specialized versions of the database. This application can build databases for COBALT, DELTA-BLAST, and a standard RPS-BLAST search. The dbtype
option (see entry in table) determines which flavor of the database is built.
makeprofiledb option | type | default | description |
---|---|---|---|
in |
string | stdin | Input file that contains a list of scoremat files (delimited by space, tab, or newline) |
binary |
flag | N/A | The scoremat files are binary ASN.1 |
title |
string | none | Title for RPS-BLAST database. If not set, the input file name will be used. |
threshold |
real | 9.82 | Threshold for RPSBLAST lookup table. |
out |
string | input | file name Name of BLAST database to be created. Input file name is used if none provided. |
max_file_size |
string | 1GB | Maximum file size to use for BLAST database. |
dbtype |
string | rps | Specifies use for RPSBLAST db. One of rps, cobalt, or delta. |
index |
boolean | true | Creates index files for the standard BLAST database (equivalent to parse_seqids with makeblastdb). |
gapopen |
integer | none | Cost to open a gap. Used only if scoremat files do not contain PSSM scores, otherwise ignored. |
gapextend |
integer | none | Cost to extend a gap by one residue. Used only if scoremat files do not contain PSSM scores, otherwise ignored. |
scale |
real | 100 | PSSM scale factor. |
matrix |
string | BLOSUM62 | Matrix to use in constructing PSSM. One of BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90, PAM250, PAM30 or PAM70. Used only if scoremat files do not contain PSSM scores, otherwise ignored. |
obsr_threshold |
real | 6 | Exclude domains with maximum number of independent observations below this value (for use in DELTA-BLAST searches). |
exclude_invalid |
boolean | true | Exclude domains that do not pass validation test (for use in DELTA-BLAST searches). |
logfile |
string | none | Program log file (default is stderr). |
BLASTDBCMD options
This application reads a BLAST database and produces reports. For every format except ‘%f’ (default), each line of output will correspond to a sequence.
blastdbcmd option | type | default | description |
---|---|---|---|
db |
string | nr | BLAST database name. |
dbtype |
string | guess | Molecule type stored in BLAST database, one of nucl, prot, or guess. |
entry |
string | none | Comma-delimited search string(s) of sequence identifiers: e.g.: 555 , AC147927 , gnl|dbname|tag , or all to select all sequences in the database. See also section About Sequence Identifiers below. |
entry_batch |
string | none | Input file for batch processing. The format requires one entry per line; each line should begin with the sequence ID followed by any of the following optional specifiers (in any order): range (format: from-[to], 1-based inclusive), strand (‘plus’ or ‘minus’), or masking algorithm ID (integer value representing the available masking algorithm). |
pig |
integer | none | PIG (protein identity group) to retrieve. |
info |
flag | N/A | Print BLAST database information. |
range |
string | none | Range of sequence to extract (Format: start-stop). |
strand |
string | plus | Strand of nucleotide sequence to extract. Choice of plus or minus. |
mask_sequence_with |
string | none | Produce lower-case masked FASTA using the algorithm IDs specified. |
out |
string | stdout | Output file name. |
target_only |
flag | N/A | Definition line should contain target GI or accession only |
get_dups |
flag | N/A | Retrieve duplicate accessions. |
line_length |
integer | 80 | Line length for output. |
ctrl_a |
flag | N/A | Use Ctrl-A as the non-redundant definition line separator. |
outfmt |
string | %f | Output format, see Database output specifiers) |
Database output specifiers
Note that all specifiers except %f
(default) produce a single line per result.
%f |
sequence in FASTA format |
%s |
sequence data (without defline) |
%a |
accession |
%g |
gi |
%o |
ordinal id (OID) |
%t |
sequence title |
%l |
sequence length |
%T |
taxid |
%L |
common taxonomic name |
%S |
scientific name |
%P |
PIG |
%m X |
sequence masking data, where X is an optional comma-separated list of integers to specify the algorithm ID(s) to display (or all masks if absent or invalxd specification). Masking data will be displayed as a series of ‘N-M’ values separated by ‘;’ or the word ‘none’ if none are available. |
BLASTDB_ALIASTOOL options
Optimise a textual GI list into a binary list:
blastdb_aliastool -gi_file_in gilist.txt -gi_file_out gilist.bin
Create an aliased database using a GI list (use a binary GI list or the tool will create it; note that the GI list must stay with the generated .nal
file):
blastdb_aliastool -db nt -dbtype nucl -title "Subset database" -gilist gilist.bin -out subsetdb
To create a subset database for a specific taxonomy ID’, generate the GI list first. Ways of doing this:
- Use the
gi_taxid_{nucl,prot}.dmp
GI-taxid mapping files from ftp://ftp.ncbi.nih.gov/pub/taxonomy/ (see the readme) - Perform an Entrez query with query
txidXXXX[ORGN]
whereXXXX
is the taxid, then choose “Send to File” and select GI List (suggested here - Trawl recursively trought the taxdump
nodes
andnames
files from the archive at ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
Create an alias for a multi-volume BLAST database:
blastdb_aliastool -dblist ... -num_volumes ...
seqdb_perf
TBD.
Filtering and Masking
dustmasker
Identifies and masks regions of low complexity in nucleotide sequences. See ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/dustmasker/README.dustmasker
segmasker
Identifies and masks regions of low complexity in protein sequences. See ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/segmasker/README.segmasker
windowmasker
Identifies sequences occurring too often to be of interest to most users. See ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/README.windowmasker
convert2blastmask
TBD.
Miscellanous Tools & Topics
Configuring BLAST+
The blast+ toolkit reads .ncbirc
in current directory, user $HOME
or directory $NCBI
. On Ubuntu, package ncbi-data
installs /etc/ncbi/.ncbirc
so set export NCBI=/etc/ncbi
and add a [BLAST] section having at least BLASTDB
pointing to the directory/ies containing Blast databases, and optionally DATA_LOADERS
, BLASTDB_PROT_DATA_LOADER
, BLASTDB_NUCL_DATA_LOADER
for identifier auto-resolution.
About Sequence Identifiers
Defined here as in tables below. The “lcl” and “general database reference” (gnl|MYDB|123
) (at least) work for parse_seqid
and will be indexed in the resulting database. Sequence identifiers can be concatenated separated by a vertical bar (who makes this up?). So for instance gi|1234|ref|NP_5432.1
would make the sequence retrievable (or filterable with -gilist
) using any of gi|1234|ref|NP_5432.1
, 1234
, ref|NP_5432
, NP_5432.1
, etc. Any string following the space following an identifier will be the (unparsed) title for the sequence.
Flattened Sequence ID format
A flattened sequence ID has one of the following three formats, where square brackets surround optional elements. The type is a number, indicating who assigned the ID (numbers in table below):
type([name or locus][,[accession][,[release][,version]]])
type=accession[.version]
type:number
FASTA Sequence ID format
Type | Description | Format | Example |
---|---|---|---|
1 | Local | lcl|[integer|string] |
lcl|123 lcl|hmm271 |
2 | GenInfo backbone sequence ID | bbs|integer |
bbs|123 |
3 | GenInfo backbone molecule type | bbm|integer |
bbm|123 |
4 | GenInfo import ID | gim|integer |
gim|123 |
5 | GenBank | gb|accession|locus |
gb|M73307|AGMA13GT |
6 | European Mol Biol Lab (EMBL) | emb|accession|locus |
emb|CAM43271.1| |
7 | Protein Info Resource (PIR) | pir|accession|name |
pir||G36364 |
8 | SWISS-PROT | sp|accession|name |
sp|P01013|OVAX_CHICK |
9 | Patent | pat|country|patent|sequence |
pat|US|RE33188|1 |
- | Pre-grant patent | pgp|country|application-number|seq-number |
pgp|EP|0238993|7 |
10 | RefSeq | ref|accession|name |
ref|NM_010450.1| |
11 | General database reference | gnl|database|[integer|string] |
gnl|taxon|9606 gnl|PID|e1632 |
12 | GenInfo integrated database (GI) | gi|integer |
gi|21434723 |
13 | DNA Bank of Japan (DDBJ) | dbj|accession|locus |
dbj|BAC85684.1| |
14 | Protein Research Foundation (PRF) | prf|accession|name |
prf||0806162C |
15 | Protein Database (PDB) | pdb|entry|chain |
pdb|1I4L|D |
16 | Third-party annot to GenBank | tpg|accession|name |
tpg|BK003456| |
17 | Third-party annot to EMBL | tpe|accession|name |
tpe|BN000123| |
18 | Third-party annot to DDBJ | tpd|accession|name |
tpd|FAA00017| |
19 | TrEMBL | tr|accession|name |
tr|Q90RT2|Q90RT2_9HIV1 |
- | Genome pipeline (internal) | gpp|accession|name |
gpp|GPC_123456789| |
- | Named annotation track (internal) | nat|accession|name |
nat|AT_123456789.1| |
BLASTN reward/penalty values
BLASTN uses a simple approach to score alignments, with matching bases assigned a reward and mismatching bases assigned a penalty. It is important to choose reward/penalty values appropriate to the sequences being alignedi, with (absolute) reward/penalty ratio increasing for more divergent sequences. Rules of thumb for the reward/penalty ratio are:
- 0.33 (1/-3) is appropriate for sequences that are about 99% conserved;
- 0.5 (1/-2) is best for sequences that are 95% conserved;
- about unity (1/-1) is best for sequences that are 75% conserved.
For each reward/penalty pair, a number of different gap costs are supported. Gap cost is a value to open the gap and a value to extend the gap by a base. Default costs are:
- MegaBLAST: opening cost is 0, extension is half of the cost of two mismatches minus one match.
- Other tasks of blastn: 5 to open a gap and 2 to extend one base.
Table below presents the supported reward/penalty values and gap costs. Blastn also supports gap costs more stringent than those listed. Default megaBLAST gap costs are shown in the right-most column. Accurate statistics for these default megaBLAST gap costs can only be calculated for the most stringent reward/penalty values, but the values listed in the middle column can always be used.
reward/penalty | gap costs (open/extend) | default MegaBLAST gap costs (open/extend) |
---|---|---|
1/-5 | 3/3 | 0/5.5 |
1/-4 | 1/2, 0/2, 2/1, 1/1 | 0/4.5 |
2/-7 | 2/4, 0/4, 4/2, 2/2 | 0/8 |
1/-3 | 2/2, 1/2, 0/2, 2/1, 1/1 | 0/3.5 |
2/-5 | 2/4, 0/4, 4/2, 2/2 | 0/6 |
1/-2 | 2/2, 1/2, 0/2, 3/1, 2/1, 1/1 | 0/2.5 |
2/-3 | 4/4, 2/4, 0/4, 3/3, 6/2, 5/2, 4/2, 2/2 | 0/4 |
3/-4 | 6/3, 5/3, 4/3, 6/2, 5/2, 4/2 | N/A |
4/-5 | 6/5, 5/5, 4/5, 3/5 | N/A |
1/-1 | 3/2, 2/2, 1/2, 0/2, 4/1, 3/1, 2/1 | N/A |
3/-2 | 5/5 | N/A |
5/-4 | 10/6, 8/6 | N/A |
legacy_blast
Run legacy BLAST command in BLAST+. Use --print-only
to see what modern blast+ invokes instead.
gene_info_reader
Convert between GI and Gene IDs.
blast_formatter
Reformat the output of a blast job without needing to rerun the query.
seedtop+
TODO