NCBI Notes • io.zwets.it

TODO

SRA search functions!
Overview of All NCBI Software and Data, and HOWTO’s https://www.ncbi.nlm.nih.gov/guide/data-software/ http://www.ncbi.nlm.nih.gov/home/tutorials.shtml
How to find SNPs using Blast https://www.ncbi.nlm.nih.gov/guide/howto/view-all-snps/ ftp://ftp.ncbi.nih.gov/pub/factsheets/HowTo_Finding_SNP_by_BLAST.pdf
API’s and download methods ftp://ftp.ncbi.nih.gov/pub/factsheets/Factsheet_bulk_download.pdf https://www.ncbi.nlm.nih.gov/guide/howto/automate-blast-searches-ncbi-server
The Entrez Programming Utilies http://www.ncbi.nlm.nih.gov/books/NBK25501/ Chapter 6 is Unix: http://www.ncbi.nlm.nih.gov/books/n/helpeutils/chapter6/
Check out more Ubuntu docs (e.g. the scoring.pdf in /usr/share/doc/blast2/…) Explains exactly the L, K, etc. [ ] New WGS Blast search, limited to taxonomy ftp://ftp.ncbi.nlm.nih.gov/blast/WGS_TOOLS/README_BLASTWGS.txt

NCBI Links

Overall

Blast Links

The BLAST Suite

BLAST is the Basic Local Alignment Search Tool. blast+ is the C++ rewrite deprecating the C original.

Ubuntu packages

ncbi-blast+ is the package to install
- binaries: psiblast rpsblast+ tblastn blast_formatter blastx blastdb_aliastool dustmasker makembindex blastdbcmd blastdbcheck makeprofiledb segmasker blastp makeblastdb seedtop+ tblastx convert2blastmask windowmasker gene_info_reader rpstblastn update_blastdb seqdb_perf legacy_blast deltablast blastn windowmasker_2.2.22_adapter blastdbcp
- documentation: one man page only: ncbi-blast+, referring to http://www.ncbi.nlm.nih.gov/books/NBK1763/ (stored in Zotero)
blast2 deprecated by blast+
- legacy_blast [--print-only] ... invokes or shows equivalent from blast+
- documentation at /usr/share/doc/blast2/index.html mentions blast+
- binaries: megablast blast2 blastcl3 fastacmd formatdb makemat blastall bl2seq blastall_old taxblast formatrpsdb blastpgp rpsblast impala copymat blastclust seedtop
ncbi-tools-bin seem to be the developer tools (why not called ncbi-blast+-dev?)
- binaries: errhdr asn2asn idfetch asn2idx insdseqget nps2gps asn2xml getmesh gil2bin asn2gb getpub gene2xml subfuse gbseqget vecscreen trna2tbl tbl2asn asnval debruijn trna2sap cleanasn asntool asn2all asn2fsa asndisc sortbyquote findspl asn2ff checksub indexpub asndhuff spidey fa2htgs asnmacro makeset
ncbi-data is package required by all three aforementioned packages
- configuration: /etc/ncbi/{.ncbirc,.nlmstmanrc} note they’re dotfiles (silly)
  - .ncbirc has lot of config and sets [NCBI] DATA=/usr/share/ncbi/data
  - .ncbirc set [BLAST] BLASTDB=/data/genomics/ncbi/blast/db but doesn’t seem get picked up?
- binaries: vibrate # preloads vibrate library so omitted arguments will be prompted (but library not here)
- data: /usr/share/ncbi/data has 16SCore but e.g. also the genetic code gc.prt

Configuring blast+

BLASTDB point to directory with databases
/etc/ncbi/.ncbirc # sheesh why dotfile?

BLAST Programs

blastn - search nucleotide database using nucleotide sequence
- blastn: classical
- megablast: intra-species identification (fast, precise)
- discontiguous megablast: cross-species, search with coding sequences
- blastn-short: short sequences, cross-species
blastp - search protein database using protein query
- psi-blast: iterative search for position-specific score matrix (PSSM) construction, identify remote relatives for protein family
- phi-blast: protein alignment with input pattern as anchor / constraint
- delta-blast: protein similarity search, higher sensitivity than blastp
blastx - search protein database using translated nucleotide: identify potential protein products encoded by sequence
tblastn - search translated nucleotide using protein query: identify sequences encoding products similar to protein query
tblastx - search translated nucleotide database using translated nucleotide query: idem that could also be produced by nucleotides in query
blast2 - align two sequences (bl2seq)

Some other programs at NCBI

Standalone Blast+ - with remote option
QBlast, URLAPI - RESTful BLAST http://www.ncbi.nlm.nih.gov/blast/Doc/urlapi.html Sample: http://www.ncbi.nlm.nih.gov/blast/docs/web_blast.pl
blastn_vdb - Search in the SRA (SRR, WGS and TSA files are stored in vdb)
MOLE-BLAST - Take number of input sequences, cluster with nearest neighbours in database
CDS/CDART - Find conserved domains in curated domains / and find other sequences containing these CDS
WGS BLAST

The BLAST Databases

Explained in the How To BLAST Guide, also see the README on the FTP Server

Representative_Genomes (local copy): Archaea and Bacteria Representative and reference genomes from refseq
nr: default database: all GenBank + EMBL (Europe) + DDBJ (Japan) + PDB (World-wide Protein Database) Excluding:
- PAT (patent division),
- STS (Sequence Tagged Sites),
- GSS (Genomic Survey Sequences)
- HTGS (Unfinished High Throughput Sequences, phases 0,1,2)
- EST (Expressed Sequence Tag = cDNA, i.e. mRNA or other transcript),
- TSA (Transcriptome Shotgun Assemblies, assembled from RNAseq SRA),
- WGS (Whole Genome Shotgun Assemblies, from SRA)
nt: default nucleotide database
refseq_rna: curated (NM, NR) and predicted (XM, XR) sequences from RefSeq project
refseq_genomic: genomic sequences from RefSeq project
chromosome: complete genomes and chromosomes from RefSeq project
human, mouse G+T:genomic sequences, curated and predicted RNA for current build for human / mouse
16S microbial (local copy): archea & bacteria 16S rRNA sequences from Targeted Loci Project
SRA: Sequence Read Archive: raw sequence data from NGS, also DRA (DDBJ) and ERA (EMBL) http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi To search these use blast_vdb, and SRA Toolkit: ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Local_SRA_BLAST.pdf, also documented here: ftp://ftp.ncbi.nlm.nih.gov/blast/WGS_TOOLS/README_BLASTWGS.txt

Reference & representative Prokaryotic Genomes

About (includes explanation of distinction): http://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/ Browse: http://www.ncbi.nlm.nih.gov/genome/browse/reference/#

GenBank

What is GenBank

Is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences GenBank is part of the INSDC (GenBank, EMBL - European Mol Bio Lab, DDBJ - DNA Data Bank of Japan) Comprised of

NucCore (http://www.ncbi.nlm.nih.gov/nuccore/)
NucEst (http://www.ncbi.nlm.nih.gov/nucest/) Expressed Sequence Tags
Nucgss (http://www.ncbi.nlm.nih.gov/nucgss/) Genome Survey Sequences

Download locations

Web Pages

(Reads) http://www.ncbi.nlm.nih.gov/sra (SRA)
(Assembled) http://www.ncbi.nlm.nih.gov/genbank
(Mapped) http://www.ncbi.nlm.nih.gov/assembly
(Representative) http://www.ncbi.nlm.nih.gov/refseq

FTP

New organisation of the download, read the FAQ: http://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/

/genomes
  /genbank
    /archaea
      /Genus_species
        assembly_summary.txt
        all_assembly_versions
          links to /genomes/all/acc.accver_assver
        latest_assembly_versions
    /bacteria
    /fungi
    ...
  /refseq
    /archaea
    /bacteria
    .
  /all
    /GCA_000000000.1_ASM000
      GCA_000000000.1_ASM000_assembly_{report,stats,structure}.txt
      GCA_000000000.1_ASM000_feature_table.txt.gz
      GCA_000000000.1_ASM000_genomic_{fna,gff,gbff}.gz
      GCA_000000000.1_ASM000_protein_{faa,gpff}.gz
      GCA_000000000.1_ASM000_wgsmaster.gbff.gz

More confusing: there is also /genomes/Bacteria/Genus_species_ETC_uid12345 # but is old (THIS LOCAL COPY)

First check the reference (and representative?): wget ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/Aeromonas_hydrophila/assembly_summary.txt wget ftp://ftp.ncbi.nih.gov/genomes/all/GCF_ # the accession link is in the summary file But also linked from {latest,all}_assembly_versions Then check the genbank (s/refseq/genbank/ in the URL)

BLAST scores

See also /usr/share/doc/blast2/scoring.pdf.gz

S = [lambda x R - ln(K)] / ln(2), where R is raw score and lambda and K are constants that change over time
R = match scores + mismatch penalties + gap penalties (for nucleotides; for protein is sum of scores in BLOSUM matrix used)
E = number of alignments found by chance would have score S = Q * D * 2^-S, where Q is query length, D is database length

Feature Table Format

For NCBI-GenBank, EBI-EMBL and DDBJ: http://www.insdc.org/files/feature_table.html INSDC = International Nucleotide Sequence Data Collaboration

Contents