Skip to main content
One Codex Database

What's behind the One Codex Database

Denise Lynch avatar
Written by Denise Lynch
Updated over a month ago

The One Codex Database (as of November 2024) consists of ~134K complete microbial assemblies (across 118K organisms), including approximately:

  • 75K viral assemblies

  • 55K bacterial assemblies

  • 1.6K fungal assemblies

  • 2K archaeal assemblies

  • and 11.3K Metagenome-Assembled Genomes (MAGs).

In the 2024 database, we've added chicken and rat genomes as hosts, in addition to the existing human and mouse genomes. We've also added the new telomere-to-telomere (T2T CHM13v2.0) human genome assembly along with the existing hg38 assembly.

You can find the complete list of organisms in our list of references. The database is assembled from both of public and private sources, with a combination of automated and manual curation steps to remove low quality or mislabeled records. Analysis against the One Codex Database provides:

  1. Highly accurate identification of microbes from genomic sequence data

  2. Precise quantification of microbial abundance using whole-genome shotgun (WGS) sequencing

  3. Community-wide characterization of complex microbial mixtures, including the human microbiome

Analyzing samples against the One Codex Database

Comparing a microbial sample against the One Codex Database consists of three sequential steps:

  1. K-mer based classification. Every individual sequence (NGS read or contig) is compared against the One Codex Database by exact alignment using k-mers where k=31 (see Ames et al., 2014 and Wood et al., 2014 for details on k-mer based classification).

  2. Artifact filtering. Based on the relative frequency of unique k-mers in the sample, sequencing artifacts are filtered out of the sample. This filtering should run automatically on most WGS data and does not eliminate low abundance or low confidence hits, only probable sequencing or reference genome artifacts.[1]

  3. Species-level abundance estimation. The relative abundance of each microbial species is estimated based on the depth and coverage of sequencing across every available reference genome.

[1] Note: Users can access results without artifact filtering on the individual analysis pages by clicking "view unfiltered results". These raw results are not recommended and should only be used for diagnostic purposes or for comparison to pure read-level classifiers, e.g., Kraken. Please feel free to contact us if you believe you have a sample where an important taxa is not displayed in the filtered result set.

Increased accuracy with the One Codex Database

The figure below compares the latest version of One Codex against Kraken and MetaPhlAn using an in silico simulated sample from Segata et al. (2012). It shows how the latest version of the platform provides extremely accurate relative abundance estimates, while also substantially limiting the number of false positives when compared to previous k-mer based methods like Kraken.

References

Ames SK, Hysom DA, Gardner SN, Lloyd GS, Gokhale MB, Allen JE. Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics. 2013; 29(18):2253-60.
Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using clade-specific marker genes. Nat Methods. 2012 Jun 10;9(8):811-4. doi:
376 10.1038/nmeth.2066.
Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014 Mar 3;15(3):R46. doi: 10.1186/gb-2014-15-3-r46.
โ€‹

Did this answer your question?