bcdatabaser

A pipeline to create reference databases for arbitrary markers and taxonomic groups from NCBI data

This project is maintained by molbiodiv

back to the index

Classification examples using different examples

USEARCH/SINTAX

The databases created with BCdatabaser can be directly used in USEARCH, although you might consider applying automated and manual post-processing options. Also consider the USEARCH manual for parameter and algotihm choices: https://drive5.com/usearch/manual/

Direct global alignment searches:

usearch -usearch_global sequence-of-interest.fa -db sequences.tax.fa -id 0.97 -uc zotus.directglobal.uc -strand both

Direct local alignment searches:

usearch -usearch_local sequence-of-interest.fa -db sequences.tax.fa -id 0.97 -uc zotus.directglobal.uc -strand both

Hierarchical SINTAX classification (USEARCH v9+):

usearch -sintax sequence-of-interest.fa -db sequences.tax.fa -tabbedout zotus.directglobal.sintax -strand both -sintax_cutoff 0.8

References:

VSEARCH/SINTAX

Similar to USEARCH, databases are already fully compatible with VSEARCH

Direct global alignment searches:

vsearch --usearch_global sequence-of-interest.fa --db sequences.tax.fa --id 0.97 --uc zotus.directglobal.uc --strand both

Hierarchical SINTAX classification (USEARCH v9+):

vsearch -sintax sequence-of-interest.fa --db sequences.tax.fa --tabbedout zotus.directglobal.sintax --strand both --sintax_cutoff 0.8

Reference: Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584. doi: 10.7717/peerj.2584

BLAST

Also direct local alignments can be applied immediatly through BLAST:

blastn -query sequence-of-interest.fa -max_target_seqs 1 -outfmt 6 -subject sequences.tax.fa > tabular.out

Reference: Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410

RDP classifier

The RDP classifier uses a similar, yet not completely consistent syntax with the tools above. Therefore slight modifications have to be applied:

BCdatabaser/tools above fasta syntax:

>LS453445;tax=k:Metazoa,p:Arthropoda,c:Insecta,o:Coleoptera,f:Carabidae,g:Molops,s:Molops_piceus;
ACATCCTGAAGTTTATATTTTAATTCTCCCAGGATTTGGAATAATTTCCCATATTATTAGACAAGAAAGA
GGTAAAAAAGAAACATTTGGTTCATTAGGAATAATTTATGCTATATTAGCTATTGGTTTATTAGGATTTG
TAGTATGAGCTCATCATATATTTACAGTAGGAATAGATGTGGATACTCGAGCTTATTTTACATCAGCTAC
TATAATTATTGCTGTTCCTACAGGAATTAAGATCTTTTCTTGGCTTGCAACTTTACACGGAACTCAGTTA

RDP fasta syntax:

>LS453445	Metazoa;Arthropoda;Insecta;Coleoptera;Carabidae;Molops;Molops_piceus
ACATCCTGAAGTTTATATTTTAATTCTCCCAGGATTTGGAATAATTTCCCATATTATTAGACAAGAAAGA
GGTAAAAAAGAAACATTTGGTTCATTAGGAATAATTTATGCTATATTAGCTATTGGTTTATTAGGATTTG
TAGTATGAGCTCATCATATATTTACAGTAGGAATAGATGTGGATACTCGAGCTTATTTTACATCAGCTAC
TATAATTATTGCTGTTCCTACAGGAATTAAGATCTTTTCTTGGCTTGCAACTTTACACGGAACTCAGTTA

To format this accordingly, the following regular expression can be applied:

sed -e "s/;tax=k:/\t/" -e "s/,[^:]:/;/" sequences.tax.fa  > sequences.tax.rdp.fa 

Reference: Wang, Qiong, et al. “Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy.” Appl. Environ. Microbiol. 73.16 (2007): 5261-5267.