One question we often get regards the differences between readcounts and abundances, and what "with children" means. These are key metrics in your classification results, so we will break them down here.
Readcounts
When a sample is uploaded to One Codex, we will automatically classify each read in the sample as specifically as we can, using the One Codex Database. This uses a k-mer based approach, which means we classify sub-strings of your read, and then aggregate the k-mer classifications for that read to find the most-specific and parsimonious classification for that read.
Not all reads can be classified to species or strain rank. This is usually because the part of the genome from which they were sequenced is conserved between different organisms (e.g. housekeeping genes are often highly conserved between species). Where a read cannot be classified as specifically as species-rank, we will instead classify that read to the most-specific rank that we can (e.g. Genus, Family, or even higher taxonomic ranks).
In a sample, the "Readcounts" are the number of reads that can be classified to the specified taxon at the specific rank.
Readcounts with Children
Given that "Readcounts" reports the number of reads that specifically map to the taxon at the chosen rank, you may want to include reads that could be classified more specifically. The "Readcounts with Children" metric is the sum of all of the reads that were classified to the specified taxon at its rank, plus all of the reads that could be classified to more specific ranks belonging to that taxon (i.e. it's children).
For instance, in Figure 1, for the genus Bacteroides, we see that 60 reads were classified at the genus level rank (i.e. Readcounts). However, a total of 1500 reads were classified to Bacteroides and all of the species and strains that belong to that genus (i.e. Readcount with Children). The species "Bacteroides intestinalis" and "Bacteroides fragilis", as well as the strains "Bacteroides fragilis HMW 610" and "Bacteroides fragilis 638R", are all "children" of the Bacteroides genus.
Figure 1: Readcount and Readcount with Children example.
Abundances
Readcounts or Readcounts with Children are very useful metrics, particularly when determining if we can detect low-abundant organisms. However, genome sizes vary. An extreme example of differing genome sizes includes comparing virus genomes (very small) to fungal genomes (very large). In a scenario where there are exactly 100 fungal cells (species A) and 100 viruses (species B), you would expect many more reads for the fungus, on account of its much larger genome (more DNA to sequence!).
The ability to distinguish which reads belong to a given species also varies. For instance, some species will have very few differences between their genomic content, so there will be very few k-mers that will allow you to differentiate species A from species B. But in a different genus, the species might actually be more distinguishable. As a result, there will be more k-mers that can be used to identify species C and differentiate it from species D.
So with that in mind, we have developed an Abundance estimate. This takes into account the approximate genome sizes, and the number of k-mers that uniquely identify a species. We adjust the readcounts based on the number of unique k-mers in that species genome. This gives us an estimate of the proportion of cells in a sample that belong to that species.
Note that abundances are estimated at the species-level. Abundances at higher taxonomic ranks are the sum of the species-level abundances that belong to that genus or family etc.
Our abundance estimate uses a proprietary mathematical model. In a scenario where many of the reads (or k-mers) that classify to a species are skewed across the species' genome (i.e. if we do not identify a good spread or coverage across the genome), then we may not be able to estimate that species' abundance. We also will not report abundances for species below 0.01% of the sample.
Due to copy number differences for various common targeted loci, we do not estimate abundances in the Targeted Loci classification pipeline.
What's next?
Learn more about the One Codex Database.