The One Codex Database (as of July 2023) consists of ~148K complete microbial genomes, including approximately:
25K viral species (72K viral genomes)
42K bacterial species (71K bacterial genomes)
1.3K eukaryote species including fungi (2K eukaryote genomes)
2K archaeal species (2.2K archaeal genomes)
and 1.6K mouse gut Metagenome-Assembled Genomes (MAGs).
The human and mouse genomes are included to screen out host reads, and you can find the complete list of organisms in our list of references. The database is assembled from both of public and private sources, with a combination of automated and manual curation steps to remove low quality or mislabeled records. Analysis against the One Codex Database provides:
Highly accurate identification of microbes from genomic sequence data
Precise quantification of microbial abundance using whole-genome shotgun (WGS) sequencing
Community-wide characterization of complex microbial mixtures, including the human microbiome
Analyzing samples against the One Codex Database
Comparing a microbial sample against the One Codex Database consists of three sequential steps:
K-mer based classification. Every individual sequence (NGS read or contig) is compared against the One Codex Database by exact alignment using k-mers where k=31 (see Ames et al., 2014 and Wood et al., 2014 for details on k-mer based classification).
Artifact filtering. Based on the relative frequency of unique k-mers in the sample, sequencing artifacts are filtered out of the sample. This filtering should run automatically on most WGS data and does not eliminate low abundance or low confidence hits, only probable sequencing or reference genome artifacts.[1]
Species-level abundance estimation. The relative abundance of each microbial species is estimated based on the depth and coverage of sequencing across every available reference genome.
[1] Note: Users can access results without artifact filtering on the individual analysis pages by clicking "view unfiltered results". These raw results are not recommended and should only be used for diagnostic purposes or for comparison to pure read-level classifiers, e.g., Kraken. Please feel free to contact us if you believe you have a sample where an important taxa is not displayed in the filtered result set.
Increased accuracy with the One Codex Database
The figure below compares the latest version of One Codex against Kraken and MetaPhlAn using an in silico simulated sample from Segata et al. (2012). It shows how the latest version of the platform provides extremely accurate relative abundance estimates, while also substantially limiting the number of false positives when compared to previous k-mer based methods like Kraken.
References
Ames SK, Hysom DA, Gardner SN, Lloyd GS, Gokhale MB, Allen JE. Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics. 2013; 29(18):2253-60.
Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using clade-specific marker genes. Nat Methods. 2012 Jun 10;9(8):811-4. doi:
376 10.1038/nmeth.2066.
Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014 Mar 3;15(3):R46. doi: 10.1186/gb-2014-15-3-r46.
โ