All Collections
Kit Guides
ATCC® Microbiome Standards
Bioinformatics details of ATCC Microbiome Standards
Bioinformatics details of ATCC Microbiome Standards

A detailed description of the bioinformatics workflow used for analyzing your ATCC® standards

Denise Lynch avatar
Written by Denise Lynch
Updated over a week ago

UPDATE: 25th March, 2021

We have just launched an update to this analysis!

  • Samples are now compared against the genomic sequences of the isolates in each Microbiome standard, as sequenced and assembled by ATCC in conjunction with One Codex. Previous versions of the pipeline used public assemblies for these species.

  • We've updated how we calculate the Relative Abundance score for each control. We now use a scoring system based on a scaled Aitchison Distance between your sample and the expected abundances of the microbes in this control. Previous versions of the pipeline used Pearson Correlation to calculate the Relative Abundance score.

If you need to run the previous analysis job on any of your new samples, or if you would like to run the new job on your old samples, reach out to us at support@onecodex.com, and we will be more than happy to help.

Bioinformatics Workflow

Every dataset is aligned against the set of organisms in the input mixtures, as well as the complete set of references in the One Codex Database (for WGS samples) or Targeted Loci Database (for 16S samples). The abundance of each detected organism is used to calculate the scores (ranging from 0-100%) for True Positives, Relative Abundance, and False Positives.

Bioinformatics Details

Read Alignment: Alignment of reads against the set of organisms in the input mixtures is executed using BWA (v0.7.15). Paired-end reads are analyzed with the -p flag, and Samtools (v1.4) is used to process the resulting SAM file (samtools fixmate).

Reference Databases: More details can be found here for the One Codex Database (used to analyze WGS files) and the Targeted Loci Database (used to analyze 16S files).

Scoring Details

  • True Positives: The True Positive Score is the percentage of organisms present in the control. Organisms are marked as "Present" if they are detected within two logs of the true abundance. Also presents a "proportion of true positives" figure, which is the number of reads mapping to an individual genome over the number of reads mapping to any true positive organism and "proportion of reads", which proportion of total reads in the sample mapping to a genome adjusted for genome size.

  • Relative Abundance: A score between 0 and 100%, based on a scaled Aitchison distance between the detected organism abundances and the known input abundances (based on genome-size adjusted read counts).

  • False Positives: The False Positive Score is 100% less 10 percentage points for each "High" abundance false positive, 5 points for each "Moderate" one, and 1 point for each "Low" one. "Trace" false positives do not count against the score, and the minimum possible score is 0%.

Did this answer your question?