A pipeline to create reference databases for arbitrary markers and taxonomic groups from NCBI data
This project is maintained by molbiodiv
The databases created with BCdatabaser can be directly used in USEARCH, although you might consider applying automated and manual post-processing options. Also consider the USEARCH manual for parameter and algotihm choices: https://drive5.com/usearch/manual/
Direct global alignment searches:
usearch -usearch_global sequence-of-interest.fa -db sequences.tax.fa -id 0.97 -uc zotus.directglobal.uc -strand both
Direct local alignment searches:
usearch -usearch_local sequence-of-interest.fa -db sequences.tax.fa -id 0.97 -uc zotus.directglobal.uc -strand both
Hierarchical SINTAX classification (USEARCH v9+):
usearch -sintax sequence-of-interest.fa -db sequences.tax.fa -tabbedout zotus.directglobal.sintax -strand both -sintax_cutoff 0.8
References:
Similar to USEARCH, databases are already fully compatible with VSEARCH
Direct global alignment searches:
vsearch --usearch_global sequence-of-interest.fa --db sequences.tax.fa --id 0.97 --uc zotus.directglobal.uc --strand both
Hierarchical SINTAX classification (USEARCH v9+):
vsearch -sintax sequence-of-interest.fa --db sequences.tax.fa --tabbedout zotus.directglobal.sintax --strand both --sintax_cutoff 0.8
Reference: Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584. doi: 10.7717/peerj.2584
Also direct local alignments can be applied immediatly through BLAST:
blastn -query sequence-of-interest.fa -max_target_seqs 1 -outfmt 6 -subject sequences.tax.fa > tabular.out
Reference: Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410
The RDP classifier uses a similar, yet not completely consistent syntax with the tools above. Therefore slight modifications have to be applied:
BCdatabaser/tools above fasta syntax:
>LS453445;tax=k:Metazoa,p:Arthropoda,c:Insecta,o:Coleoptera,f:Carabidae,g:Molops,s:Molops_piceus;
ACATCCTGAAGTTTATATTTTAATTCTCCCAGGATTTGGAATAATTTCCCATATTATTAGACAAGAAAGA
GGTAAAAAAGAAACATTTGGTTCATTAGGAATAATTTATGCTATATTAGCTATTGGTTTATTAGGATTTG
TAGTATGAGCTCATCATATATTTACAGTAGGAATAGATGTGGATACTCGAGCTTATTTTACATCAGCTAC
TATAATTATTGCTGTTCCTACAGGAATTAAGATCTTTTCTTGGCTTGCAACTTTACACGGAACTCAGTTA
RDP fasta syntax:
>LS453445 Metazoa;Arthropoda;Insecta;Coleoptera;Carabidae;Molops;Molops_piceus
ACATCCTGAAGTTTATATTTTAATTCTCCCAGGATTTGGAATAATTTCCCATATTATTAGACAAGAAAGA
GGTAAAAAAGAAACATTTGGTTCATTAGGAATAATTTATGCTATATTAGCTATTGGTTTATTAGGATTTG
TAGTATGAGCTCATCATATATTTACAGTAGGAATAGATGTGGATACTCGAGCTTATTTTACATCAGCTAC
TATAATTATTGCTGTTCCTACAGGAATTAAGATCTTTTCTTGGCTTGCAACTTTACACGGAACTCAGTTA
To format this accordingly, the following regular expression can be applied:
sed -e "s/;tax=k:/\t/" -e "s/,[^:]:/;/" sequences.tax.fa > sequences.tax.rdp.fa
Reference: Wang, Qiong, et al. “Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy.” Appl. Environ. Microbiol. 73.16 (2007): 5261-5267.