One question we often get regards the differences between readcounts and abundances, and what "with children" means. These are key metrics in your classification results, so we will break them down here.
Readcounts
When a sample is uploaded to One Codex, we will automatically classify each read in the sample as specifically as we can, using the One Codex Database. This uses a k-mer based approach, which means we classify sub-strings of your read, and then aggregate the k-mer classifications for that read to find the most-specific and parsimonious classification for that read.
Not all reads can be classified to species or strain rank. This is usually because the part of the genome from which they were sequenced is conserved between different organisms (e.g. housekeeping genes are often highly conserved between species). Where a read cannot be classified as specifically as species-rank, we will instead classify that read to the most-specific rank that we can (e.g. Genus, Family, or even higher taxonomic ranks).
In a sample, the "Readcounts" are the number of reads that can be classified to the specified taxon at the specific rank.
Readcounts with Children
Given that "Readcounts" reports the number of reads that specifically map to the taxon at the chosen rank, you may want to include reads that could be classified more specifically. The "Readcounts with Children" metric is the sum of all of the reads that were classified to the specified taxon at its rank, plus all of the reads that could be classified to more specific ranks belonging to that taxon (i.e. it's children).
For instance, in Figure 1, for the genus Bacteroides, we see that 60 reads were classified at the genus level rank (i.e. Readcounts). However, a total of 1500 reads were classified to Bacteroides and all of the species and strains that belong to that genus (i.e. Readcount with Children). The species "Bacteroides intestinalis" and "Bacteroides fragilis", as well as the strains "Bacteroides fragilis HMW 610" and "Bacteroides fragilis 638R", are all "children" of the Bacteroides genus.
Figure 1: Readcount and Readcount with Children example.
Abundances
Readcounts or Readcounts with Children are very useful metrics, particularly when determining if we can detect low-abundant organisms. However, genome sizes vary. An extreme example of differing genome sizes includes comparing virus genomes (very small) to fungal genomes (very large). In a scenario where there are exactly 100 fungal cells (species A) and 100 viruses (species B), you would expect many more reads for the fungus, on account of its much larger genome (more DNA to sequence!).
The ability to distinguish which reads belong to a given species also varies. For instance, some species will have very few differences between their genomic content, so there will be very few k-mers that will allow you to differentiate species A from species B. But in a different genus, the species might actually be more distinguishable. As a result, there will be more k-mers that can be used to identify species C and differentiate it from species D.
So with that in mind, we have developed an Abundance estimate. This takes into account the approximate genome sizes, and the number of k-mers that uniquely identify a species. We adjust the readcounts based on the number of unique k-mers in that species genome. This gives us an estimate of the proportion of cells in a sample that belong to that species.
It's not always possible to estimate abundances for a sample, or for some species within a sample. See below for further details.
Note that abundances are estimated at the species-level. Abundances at higher taxonomic ranks are the sum of the species-level abundances that belong to that genus or family etc.
Filtered vs unfiltered reads
Our classification pipeline employs a read-level filter to remove species that appear to be spurious or low-quality. This filter is based on statistical modelling of the coverage and depth of reads across a genome with the assumption that the data was generated using whole-genome sequencing (WGS). For example, genomes with coverage that looks incomplete or "spikey" may be excluded. The statistical model also takes into account sequencing depth to prevent filtering of low-proportion species from the sample.
By default, if filtering was successful, we will display the filtered reads on the classification results page. You can view unfiltered reads by clicking "Switch to view the unfiltered reads" at the top-right of the results page.
If you don't see the option to view unfiltered reads, this means that filtering was not successful. In such samples, you are already viewing unfiltered results.
Q: Why don't I see abundances in my results?
There are a number of reasons why our abundance estimates don't show for some samples.
The Abundance Estimate is a measure of sample composition that accounts for biases due to genome size and taxonomic uncertainty. Abundance estimates more closely resemble the composition of cells in the sample. The One Codex classifier builds a model of species-level composition from read-level data, and then evaluates that model to determine fit. If the fit is determined to be poor, then abundance estimates are rejected. In this case, the proportion of total reads is used (which still benefits from the above filtering steps).
If abundance estimates do not meet the quality threshold, then we also display the unfiltered results as the WGS assumption may not be valid in this case.
Abundances Estimated, but not for Some Species
If you see abundances for some species but not others, this is usually because we don't have enough reads or k-mers mapping to these species. Alternatively, it may be that there is a very uneven spread of reads across the genome, such that it doesn't quite fit the model for abundance estimation. We only report abundance estimates for species >= 0.01%.
Amplicon Sequencing, Enrichment Panels and Other Library Preparation / Sequencing Approaches
Some examples of sample types that typically do not meet our WGS criteria include samples prepared using amplicons (E.g., 16S rRNA and/or ITS), targeted enrichment panels such as the Twist Comprehensive Pan Viral Panel, or RNA sequencing. For those sample types, alternate analysis workflows may be more suitable.
What's next?
Learn more about the One Codex Database.


