In order to be published on the portal, genomes must go through an extensive quality control (QC) process that includes both sequencing and assembly QC. To learn more about our quality control process, view this article.
In order to pass sequencing QC, we require a minimum of 1,000,000 Illumina reads with a median Q score of 30 or greater for all bases, and a median Q score of 25 or greater per base. Additionally, there must be less 5% ambiguous content or “N” bases.
For assembly QC, we utilize CheckM, which is a tool that uses a set of Hidden Markov Models (HMMs) from phylogenetically close reference genomes, to determine if the query assembly contains all expected HMMs as predicted by the reference genomes (a percentage called “CheckM completeness”) and what percent of the query’s HMMs differ in copy number or come from reference genomes that are phylogenetically distant (called “CheckM contamination”). We require final assemblies to have completeness values ≥ 95% and contamination values ≤ 5%. Additionally, all assemblies have an average of 100X Illumina coverage across the entire span of the genome.
Learn how to perform a sequence search.