Targeted Loci Database

Analyzing "barcode" marker genes, such as 16S, ITS, etc. on the One Codex platform

Christopher Smith avatar
Written by Christopher Smith
Updated over a week ago

This page summarizes the Targeted Loci Database, our tool for analyzing data from the 16S locus and other marker genes (e.g. ITS, 5S). It includes:

  • Our approach to 16S analysis

  • How we build the Targeted Loci Database

  • The accurate and robust results

  • How to run the Targeted Loci Database

Background

People that study the microbiome generally use two different genomic methods to analyze samples – sequencing all of the DNA in a sample (WGS) or targeting a specific marker gene (e.g. 16S rDNA). While WGS provides a high-resolution taxonomic and functional characterization of microbiome samples, 16S sequencing is a cost effective technique for broad community surveys across large collections of samples.

The Targeted Loci Database on the One Codex platform is specifically designed for marker gene analysis, with a large curated database that includes 16S, ITS, and more. Analyzing 16S data against this database provides:

  • Highly accurate microbial identification from 16S, ITS, etc.

  • Community diversity measures that are robust to sequencing depth

  • Completely reproducible analysis using a stable, versioned reference database

  • Large-scale cross-comparison across samples in One Codex platform

Our approach to 16S – enabling robust, scalable, and portable analysis

In the community of microbiome research, tools that analyze 16S data can be grouped into three categories:

  • "Closed reference": where sequences are compared against a fixed reference database of 16S sequences from known organisms

  • "De novo": where individual sequences are clustered by their pairwise similarity

  • "Open reference": a combination of closed reference and de novo, where individual sequences are first compared against a reference, and then any non-matching sequences are grouped into novel OTUs

It's important to note that for de novo and open reference analysis, adding new samples to a dataset requires the entire set of samples to be re-analyzed as a batch, which can be computationally intensive and introduce analytical artifacts. We chose to take a closed reference approach for two reasons:

  1. Sample analysis is more robust to changes in sequencing depth and parameters used for analysis (e.g. the percent identity threshold used to create OTUs)

  2. Every result conforms to a standardized taxonomy, which enables us to perform large-scale cross-sample comparison that does not change as new samples are added

We believe that this approach will enable researchers to easily analyze and compare across many thousands of samples, rapidly incorporating new datasets within a reproducible and deterministic analysis platform.

The One Codex Targeted Loci Database

The One Codex Targeted Loci Database is specifically designed for marker gene sequencing and built using the most commonly used genes for microbial surveys, including 16S, ITS, 18S, and others [1]. The Targeted Loci Database contains ~250,000 curated gene records spanning the known microbial world – bacteria, archaea, fungi, protists, algae, etc.

Database curation

The Targeted Loci Database builds on resources such as the NCBI Target Loci Project, as well as manually and automatically curated sequences from the broader NCBI nucleotide collection. We've sought to strike a balance between including the broadest possible number of microbes while being careful to avoid contamination, mis-annotation, and other issues that can confound microbiome analysis [2-4]. The database is constructed with the following steps:

  1. Full length sequence records are selected by filtering on minimum record length [5]

  2. All reference sequences must have a valid species name and NCBI Tax ID

  3. For all records outside of the NCBI Target Loci Project, each species must be represented by at least two independent sequence records, and be >= 75% similar to any other sequence at that locus

  4. Mislabeled references that align with >= 99% identity across different genera are excluded

These straightforward steps are used to assemble a set of reference sequences that best reflect the biological diversity that can be captured by "barcode" sequencing using this set of marker genes. You can find the complete set of organisms in the database here.

To analyze samples against the Targeted Loci Database, we align every read with high sensitivity and identify the best alignments to the database. Each read is assigned to the most specific taxonomic grouping that the data supports, down to the species-level where appropriate. We then perform abundance-based filtering to minimize the number of false positive assignments that may be introduced by sequencing error. In this filtering step, any organism whose readcount (with children) makes up less than 0.00005x of the total are reassigned to its parent. In addition, any organism at the genus level or below whose readcount (with children) makes up less than 0.01x of its immediate parent is also reassigned to its parent. Users are presented with the filtered data by default, and also have the option of viewing and analyzing the unfiltered results.

Accurate, robust analysis with the Targeted Loci Database

In order to measure the performance of the Targeted Loci Database, we analyzed a mock community constructed by Bokulich, et al. [6] and analyzed in depth by Kopylova, et al. [7]. Mock communities are particularly useful for benchmarking because we know which organisms are truly present in the sample, and so every organism detected is either a true positive or a false positive. Moreover, the authors [7] compared results from QIIME [8] and mothur [9], which we use here as a point of comparison.

Accurate detection (left):

On the left, you can see that the Targeted Loci Database provides more accurate results across all three of the mock communities. The F1 score incorporates both precision and recall, and summarizes the overall accuracy of detection. The three mock communities varied in size and complexity and were made to measure accurate characterization of real-world microbiome samples.

Robust output (right):

On the right, you can see that the Targeted Loci Database provides robust estimates of community diversity across different levels of sequencing depth. Across a group of 20 replicates each at 50K, 75K, 100K, and 250K reads per sample, the One Codex Targeted Loci analysis (blue) provided an estimate of community diversity that was more consistent between replicates, relative to the output of QIIME (green). Moreover, the community diversity reported by QIIME increased as more reads were added to a sample, while the Targeted Loci Database was robust to variation in sequencing depth, and closely matched the true number of organisms in the sample.

Identifying microbial species

While 16S analysis is often performed at the genus-level, it is vastly preferable to know which species that are present in a community whenever possible. Although some species cannot be distinguished by 16S, we believe that a well-curated database and a sensitive detection algorithm will provide users with a greater ability to perform species-level detection. Looking at the species present in Mock Community A, we found that the One Codex Targeted Loci analysis was able to detect 15 out of the 22 total, while QIIME [8] only detected 7. While marker gene analysis doesn't always contain enough data for species-level assignment, we believe that the One Codex Targeted Loci analysis does a good job of identifying those species that do have a distinct 16S gene.

Using the Targeted Loci Database

  • To run the Targeted Loci Database on your One Codex samples, go to the Run Analysis page, select your samples of interest using the menu on the left, and then click the Run button for the Targeted Loci Database on the right. This analysis runs for no additional cost with all samples uploaded to the One Codex platform. You can find more details on starting new analyses here.

We analyze samples against the Targeted Loci Database using a strictly versioned computational environment, and so every result is completely (bitwise) reproducible. In addition, you can instantly perform cross-comparison across any collection of samples (including both WGS and 16S) without having to downsample or re-cluster. This rapid analysis and scalability allows you to quickly analyze new batches of samples as they come in, with full confidence that your results will be accurate and reproducible.

References

[1] 16S rDNA, 18S, 23S, 28S, 5S, ITS (Internal Transcribed Spacer), rpoB, and gyrB (DNA gyrase subunit B).
[2] Salter SJ, et al. BMC Biol. 2014, 12:87. DOI: 10.1186/s12915-014-0087-z
[3] Lusk RW. PLoS One. 2014, 9(10):e110808. doi: 10.1371/journal.pone.0110808
[4] Merchant S, et al. PeerJ. 2014, 2:e675. doi: 10.7717/peerj.675
[5] 16S: 1kb; 18S: 500bp; 23S: 500bp; 28S: 500bp: 5S: 100bp; ITS: 100bp; gyrB: 500bp; rpoB: 500bp
[6] Bokulich NA, et al. Nature Methods 2014, 10(1) 57-59; DOI: 10.1038/nmeth.2276
[7] Kopylova E, et al. mSystems 2016, 1(1) e00003-15; DOI: 10.1128/mSystems.00003-15
[8] QIIME (pick_open_reference_otus.py v1.9.1) was run with default settings, which include uclust for clustering and gg_13_8_otus/rep_set/97_otus.fasta as the default reference database.
[9] Mothur was run using "furthest neighbor" clustering (Kopylova, et al., 2016). That analysis pre-dated the release of OptiClust as the default option for clustering. Kopylova, et al., 2016 only presented mothur results for two of the three available test datasets due to time and memory constraints.

Did this answer your question?