In order to be published on the portal, genomes must go through an extensive quality control (QC) process that includes both sequencing and assembly QC. To learn more about our quality control process, view this article.
Sequence Quality Control
In order to pass sequencing QC, we require a minimum of 1,000,000 Illumina reads with a median Q score of 30 or greater for all bases, and a median Q score of 25 or greater per base. Additionally, there must be less 5% ambiguous content or “N” bases.
Genome Assembly Quality Control
Bacterial Completeness and Contamination
For bacterial assembly QC, we utilize CheckM, which is a tool that uses a set of Hidden Markov Models (HMMs) from phylogenetically close reference genomes, to determine if the query assembly contains all expected HMMs as predicted by the reference genomes (a percentage called “CheckM completeness”) and what percent of the query’s HMMs differ in copy number or come from reference genomes that are phylogenetically distant (called “CheckM contamination”). We require final assemblies to have completeness values ≥ 95% and contamination values ≤ 5%. Additionally, all assemblies have an average of 100X Illumina coverage across the entire span of the genome.
Mycology Completeness and Quality Assessment
For mycology genomes, we estimate completeness using BUSCO. BUSCO is a tool/database combo widely used in the mycology field that examines the presence of a selection of universal single-copy orthologs for quantitative completeness calculations. We use fungi-specific databases where orthologs must be identified in at least 90% of the fungal species, and no single copy ortholog can be entirely missing from any sub-clade in the databases. Unlike CheckM, BUSCO does not calculate % contamination. We require fungal assemblies to have a completeness value of ≥ 80%.
Viral Genome Assembly and Quality Assessment
As viruses are co-cultured with their host, viral DNA or RNA sequencing data may contain reads from both the host and the virus, and de novo assemblies may contain contigs from both the host and viral genome. In order to produce an assembly containing contigs of a single virus; host reads can be removed or contigs can be binned taxonomically. Taxonomic binning is performed by aligning reads or contigs to One Codex’s curated NCBI Reference Sequence Database. Reads or contigs that align to “cellular organisms” are binned as nonviral, while those that do not are binned as viral. For our approach, high-quality viral reads are used for de novo assembly using SPAdes. To achieve the goal of obtaining complete assemblies for a single virus, the contig binning approach was used. Contigs that align to the Escherichia coli bacteriophage Phi-X 174 genome are excluded as this is used as a DNA spike-in for Illumina sequencing.
In addition to the problem of taxonomic binning, viral genomes are diverse in structure with many viruses having multipartite genome segments; the genome of Influenza A virus, for example, consists of 8 separate strands of RNA. To determine whether an assembly contains all the necessary segments, a curated database of complete viral genomes and segment information was constructed. After taxonomic binning, contigs are then aligned to the Viral Genomes-NCBI-NIH database to apply segment labels, segment depth, and percent identity to the closest reference.
As further quality control, any viral assemblies that do not align to any references in the Viral Genomes-NCBI-NIH database, or any viral assemblies for which segments align to different references (i.e. may be contaminated), are not published as is, but may undergo additional manual assembly and curation in order to be publishable.
Next Steps
Learn how to perform a sequence search.