Hybrid assembly is a state-of-the-art technique that uses both highly accurate Illumina short reads and ultra-long scaffolding ONT reads. In general, this technique begins with an optimized Illumina assembly. The longest of these resultant contigs are then assembled alongside the ONT reads; this combined assembly then undergoes multiple rounds of both long-read and short-read polishing. Because occasional sequencing and assembly artifacts appear as small contigs in the final assembly (so-called “chaff” contigs ), non-contiguous contigs less than 1000 bp with low relative coverage are then removed to produce the final assembly.
For fungal assemblies, we down-sample the reads as for the bacterial assemblies, and then use the MaSuRCA (hybrid assembly algorithm combines Illumina and ONT reads to construct long and accurate mega-reads) pipeline with the FLYE assembler. MaSuRCA was chosen for its strengths with large genomes.
Viral Genome Assembly and Quality Assessment
As viruses are co-cultured with their host, viral DNA or RNA sequencing data may contain reads from both the host and the virus, and de novo assemblies may contain contigs from both the host and viral genome . In order to produce an assembly containing contigs of a single virus; host reads can be removed or contigs can be binned taxonomically . Taxonomic binning is performed by aligning reads or contigs to One Codex’s curated NCBI Reference Sequence Database. Reads or contigs that align to “cellular organisms” are binned as non-viral, while those that do not are binned as viral. For our approach, high-quality reads are used for de novo assembly using SPAdes . To achieve the goal of obtaining complete assemblies for a single virus, the contig binning approach was used. Contigs that align to the Escherichia coli bacteriophage Phi-X 174 genome are excluded as this is used as a DNA spike-in for Illumina sequencing.
In addition to the problem of taxonomic binning, viral genomes are diverse in structure with many viruses having multipartite genome segments; the genome of Influenza A virus, for example, consists of 8 separate strands of RNA . To determine whether an assembly contains all the necessary segments, a curated database of complete viral genomes and segment information was constructed. After taxonomic binning, contigs are then aligned to the Viral Genomes-NCBI-NIH database to apply segment labels, segment depth, and percent identity to the closest reference.
As further quality control, any viral assemblies that do not align to any references in the Viral Genomes-NCBI-NIH database, or any viral assemblies for which segments align to different references (i.e. may be contaminated), are not published as is, but may undergo additional manual assembly and curation in order to be publishable.
Genome Assembly Quality Control
Genetic Element Contiguity
Most bacterial genome assemblies available in public databases are of “draft” quality , and thus single genetic elements (e.g., chromosomes, plasmids, and other types of mobile genetic elements) are split between multiple contigs with little to no structural arrangement data. To be published to the ATCC Genome Portal, single genetic elements in ATCC reference-grade genomes must be assembled into a single contig. When the assembly process supports circularization of a contig (as in the case of most bacterial chromosomes and some mobile genetic elements), they are reported as such. Genome assemblies that possess multiple contigs that the genome assembly process recognizes are contiguous—but have unresolved structural relationships—are currently excluded from the genome portal.
Illumina Read Set Coverage
Although the depth of Illumina reads required is influenced by numerous factors (including, but not limited to, bacterial strain) [7, 8], Illumina read sets should be sufficient to cover the entire genome to obtain the most accurate base determination . To account for variance in distribution of coverage per base, we require a minimum of 100X average coverage for Illumina reads.
Bacterial Completeness and Contamination
To ensure our assembly process has correctly captured the entirety of a given strain’s genome, and to confirm the absence of contamination from the assembly, we pass finalized assemblies through CheckM . Briefly, CheckM uses a set of Hidden Markov Models (HMMs) from phylogenetically close reference genomes to determine if the query assembly contains all expected HMMs as predicted by the reference genomes (a percentage called “CheckM completeness”), and what percent of the query’s HMMs differ in copy number or come from reference genomes that are phylogenetically distant (called “CheckM contamination”). We required final assemblies to have completeness values ≥ 95% and contamination values ≤ 5% (e.g., within the margin of error for 0% completeness and contamination, which indicates them as “excellence reference sequences” according to the authors of CheckM).
Mycology Completeness and Contamination
For mycology genomes, we estimate completeness using BUSCO . BUSCO is a tool/database combo widely used in the mycology field that examines the presence of a selection of universal single-copy orthologs for quantitative completeness calculations. We use fungi-specific databases where orthologs must be identified in at least 90% of the fungal species, and no single copy ortholog can be entirely missing from any sub-clade in the databases. Unlike CheckM, BUSCO does not calculate % contamination. We require fungal assemblies to have a completeness value of ≥ 80%.
Salzberg SL., et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research, 22(3), 557–567, 2012. PubMed: 22147368
Miller, J.R., et al., A host subtraction database for virus discovery in human cell line sequencing data. F1000Res, 2018. 7: p. 98.
Daly, G.M., et al., Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data. PLoS One, 2015. 10(6): p. e0129059.
Bankevich, A., et al., SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol, 2012. 19(5): p. 455-77.
Mahmoudabadi, G. and R. Phillips, A comprehensive and quantitative exploration of thousands of viral genomes. Elife, 2018. 7.
Land M, et al. Insights from 20 years of bacterial genome sequencing. Functional & Integrative Genomics, 15(2): 141–161, 2015. PubMed: 25722247
Desai A, et al. Identification of Optimum Sequencing Depth Especially for De Novo Genome Assembly of Small Genomes Using Next Generation Sequencing Data. PLoS ONE, 8(4): e60204, 2013. PubMed: 23593174
Pightling AW, Petronella N, Pagotto F. Choice of Reference Sequence and Assembler for Alignment of Listeria monocytogenes Short-Read Sequence Data Greatly Influences Rates of Error in SNP Analyses. PLoS ONE, 9(8): e104579, 2014. PubMed: 25144537
Wick RR, et al. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLOS Computational Biology, 13(6): e1005595, 2017. PubMed: 28594827
Parks DH, et al. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7): 1043–1055, 2015. PubMed: 25977477
Seppey, M., M. Manni, and E.M. Zdobnov, BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods Mol Biol, 2019. 1962: p. 227-245.