• [ ] SRA search functions!
  • [ ] Overview of All NCBI Software and Data, and HOWTO’s
  • [ ] How to find SNPs using Blast
  • [ ] API’s and download methods
  • [ ] The Entrez Programming Utilies Chapter 6 is Unix:
  • [ ] Check out more Ubuntu docs (e.g. the scoring.pdf in /usr/share/doc/blast2/…) Explains exactly the L, K, etc. [ ] New WGS Blast search, limited to taxonomy


The BLAST Suite

BLAST is the Basic Local Alignment Search Tool. blast+ is the C++ rewrite deprecating the C original.

Ubuntu packages

  • ncbi-blast+ is the package to install
    • binaries: psiblast rpsblast+ tblastn blast_formatter blastx blastdb_aliastool dustmasker makembindex blastdbcmd blastdbcheck makeprofiledb segmasker blastp makeblastdb seedtop+ tblastx convert2blastmask windowmasker gene_info_reader rpstblastn update_blastdb seqdb_perf legacy_blast deltablast blastn windowmasker_2.2.22_adapter blastdbcp
    • documentation: one man page only: ncbi-blast+, referring to (stored in Zotero)
  • blast2 deprecated by blast+
    • legacy_blast [--print-only] ... invokes or shows equivalent from blast+
    • documentation at /usr/share/doc/blast2/index.html mentions blast+
    • binaries: megablast blast2 blastcl3 fastacmd formatdb makemat blastall bl2seq blastall_old taxblast formatrpsdb blastpgp rpsblast impala copymat blastclust seedtop
  • ncbi-tools-bin seem to be the developer tools (why not called ncbi-blast+-dev?)
    • binaries: errhdr asn2asn idfetch asn2idx insdseqget nps2gps asn2xml getmesh gil2bin asn2gb getpub gene2xml subfuse gbseqget vecscreen trna2tbl tbl2asn asnval debruijn trna2sap cleanasn asntool asn2all asn2fsa asndisc sortbyquote findspl asn2ff checksub indexpub asndhuff spidey fa2htgs asnmacro makeset
  • ncbi-data is package required by all three aforementioned packages
    • configuration: /etc/ncbi/{.ncbirc,.nlmstmanrc} note they’re dotfiles (silly)
      • .ncbirc has lot of config and sets [NCBI] DATA=/usr/share/ncbi/data
      • .ncbirc set [BLAST] BLASTDB=/data/genomics/ncbi/blast/db but doesn’t seem get picked up?
    • binaries: vibrate # preloads vibrate library so omitted arguments will be prompted (but library not here)
    • data: /usr/share/ncbi/data has 16SCore but e.g. also the genetic code gc.prt

Configuring blast+

  • BLASTDB point to directory with databases
  • /etc/ncbi/.ncbirc # sheesh why dotfile?

BLAST Programs

  • blastn - search nucleotide database using nucleotide sequence
    • blastn: classical
    • megablast: intra-species identification (fast, precise)
    • discontiguous megablast: cross-species, search with coding sequences
    • blastn-short: short sequences, cross-species
  • blastp - search protein database using protein query
    • psi-blast: iterative search for position-specific score matrix (PSSM) construction, identify remote relatives for protein family
    • phi-blast: protein alignment with input pattern as anchor / constraint
    • delta-blast: protein similarity search, higher sensitivity than blastp
  • blastx - search protein database using translated nucleotide: identify potential protein products encoded by sequence
  • tblastn - search translated nucleotide using protein query: identify sequences encoding products similar to protein query
  • tblastx - search translated nucleotide database using translated nucleotide query: idem that could also be produced by nucleotides in query
  • blast2 - align two sequences (bl2seq)

Some other programs at NCBI

  • Standalone Blast+ - with remote option
  • QBlast, URLAPI - RESTful BLAST Sample:
  • blastn_vdb - Search in the SRA (SRR, WGS and TSA files are stored in vdb)
  • MOLE-BLAST - Take number of input sequences, cluster with nearest neighbours in database
  • CDS/CDART - Find conserved domains in curated domains / and find other sequences containing these CDS

The BLAST Databases

Explained in the How To BLAST Guide, also see the README on the FTP Server

  • Representative_Genomes (local copy): Archaea and Bacteria Representative and reference genomes from refseq
  • nr: default database: all GenBank + EMBL (Europe) + DDBJ (Japan) + PDB (World-wide Protein Database) Excluding:
    • PAT (patent division),
    • STS (Sequence Tagged Sites),
    • GSS (Genomic Survey Sequences)
    • HTGS (Unfinished High Throughput Sequences, phases 0,1,2)
    • EST (Expressed Sequence Tag = cDNA, i.e. mRNA or other transcript),
    • TSA (Transcriptome Shotgun Assemblies, assembled from RNAseq SRA),
    • WGS (Whole Genome Shotgun Assemblies, from SRA)
  • nt: default nucleotide database
  • refseq_rna: curated (NM, NR) and predicted (XM, XR) sequences from RefSeq project
  • refseq_genomic: genomic sequences from RefSeq project
  • chromosome: complete genomes and chromosomes from RefSeq project
  • human, mouse G+T:genomic sequences, curated and predicted RNA for current build for human / mouse
  • 16S microbial (local copy): archea & bacteria 16S rRNA sequences from Targeted Loci Project
  • SRA: Sequence Read Archive: raw sequence data from NGS, also DRA (DDBJ) and ERA (EMBL) To search these use blast_vdb, and SRA Toolkit:, also documented here:

Reference & representative Prokaryotic Genomes

About (includes explanation of distinction): Browse:


What is GenBank

Is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences GenBank is part of the INSDC (GenBank, EMBL - European Mol Bio Lab, DDBJ - DNA Data Bank of Japan) Comprised of

  • NucCore (
  • NucEst ( Expressed Sequence Tags
  • Nucgss ( Genome Survey Sequences

Download locations

Web Pages

  • (Reads) (SRA)
  • (Assembled)
  • (Mapped)
  • (Representative)


New organisation of the download, read the FAQ:

          links to /genomes/all/acc.accver_assver

More confusing: there is also /genomes/Bacteria/Genus_species_ETC_uid12345 # but is old (THIS LOCAL COPY)

First check the reference (and representative?): wget wget # the accession link is in the summary file But also linked from {latest,all}_assembly_versions Then check the genbank (s/refseq/genbank/ in the URL)

BLAST scores

See also /usr/share/doc/blast2/scoring.pdf.gz

  • S = [lambda x R - ln(K)] / ln(2), where R is raw score and lambda and K are constants that change over time
  • R = match scores + mismatch penalties + gap penalties (for nucleotides; for protein is sum of scores in BLOSUM matrix used)
  • E = number of alignments found by chance would have score S = Q * D * 2^-S, where Q is query length, D is database length

Feature Table Format

For NCBI-GenBank, EBI-EMBL and DDBJ: INSDC = International Nucleotide Sequence Data Collaboration