Hybrid assembly is a state-of-the-art technique that uses both highly accurate Illumina short reads and ultra-long scaffolding ONT reads. In general, this technique begins with an optimized Illumina assembly. The longest of these resultant contigs are then assembled alongside the ONT reads; this combined assembly then undergoes multiple rounds of both long-read and short-read polishing. Because occasional sequencing and assembly artifacts appear as small contigs in the final assembly (so-called “chaff” contigs ), non-contiguous contigs less than 1000 bp with low relative coverage are then removed to produce the final assembly.
Genome Assembly Quality Control
Genetic Element Contiguity
Most bacterial genome assemblies available in public databases are of “draft” quality , and thus single genetic elements (e.g., chromosomes, plasmids, and other types of mobile genetic elements) are split between multiple contigs with little to no structural arrangement data. To be published to the ATCC Genome Portal, single genetic elements in ATCC reference-grade genomes must be assembled into a single contig. When the assembly process supports circularization of a contig (as in the case of most bacterial chromosomes and some mobile genetic elements), they are reported as such. Genome assemblies that possess multiple contigs that the genome assembly process recognizes are contiguous—but have unresolved structural relationships—are currently excluded from the genome portal.
Illumina Read Set Coverage
Although the depth of Illumina reads required is influenced by numerous factors (including, but not limited to, bacterial strain) [3, 4], Illumina read sets should be sufficient to cover the entire genome to obtain the most accurate base determination . To account for variance in distribution of coverage per base, we require a minimum of 100X average coverage for Illumina reads.
CheckM Completeness and Contamination
To ensure our assembly process has correctly captured the entirety of a given strain’s genome, and to confirm the absence of contamination from the assembly, we pass finalized assemblies through CheckM . Briefly, CheckM uses a set of Hidden Markov Models (HMMs) from phylogenetically close reference genomes to determine if the query assembly contains all expected HMMs as predicted by the reference genomes (a percentage called “CheckM completeness”), and what percent of the query’s HMMs differ in copy number or come from reference genomes that are phylogenetically distant (called “CheckM contamination”). We required final assemblies to have completeness values ≥ 95% and contamination values ≤ 5% (e.g., within the margin of error for 0% completeness and contamination, which indicates them as “excellence reference sequences” according to the authors of CheckM).
- Salzberg SL., et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research, 22(3), 557–567, 2012. PubMed: 22147368
- Land M, et al. Insights from 20 years of bacterial genome sequencing. Functional & Integrative Genomics, 15(2): 141–161, 2015. PubMed: 25722247
- Desai A, et al. Identification of Optimum Sequencing Depth Especially for De Novo Genome Assembly of Small Genomes Using Next Generation Sequencing Data. PLoS ONE, 8(4): e60204, 2013. PubMed: 23593174
- Pightling AW, Petronella N, Pagotto F. Choice of Reference Sequence and Assembler for Alignment of Listeria monocytogenes Short-Read Sequence Data Greatly Influences Rates of Error in SNP Analyses. PLoS ONE, 9(8): e104579, 2014. PubMed: 25144537
- Wick RR, et al. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLOS Computational Biology, 13(6): e1005595, 2017. PubMed: 28594827
- Parks DH, et al. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7): 1043–1055, 2015. PubMed: 25977477