Hybrid Assembly

Hybrid assembly is a state-of-the-art technique that uses both highly accurate Illumina short reads and ultra-long scaffolding ONT reads. In general, this technique begins with an optimized Illumina assembly. The longest of these resultant contigs are then assembled alongside the ONT reads; this combined assembly then undergoes multiple rounds of both long-read and short-read polishing. Because occasional sequencing and assembly artifacts appear as small contigs in the final assembly (so-called “chaff” contigs [1]), non-contiguous contigs less than 1000 bp with low relative coverage are then removed to produce the final assembly. For all bacteriology, mycology, and most DNA virus collection items, a hybrid approach was employed using Illumina and ONT reads.

For fungal assemblies, we down-sample each read-set to 100X, as is done for the bacterial assemblies, and then use the MaSuRCA (hybrid assembly algorithm combines Illumina and ONT reads to construct long and accurate mega-reads) pipeline with the FLYE assembler. MaSuRCA was chosen for its strengths with large genomes.

Viral Genome Assembly and Quality Assessment

As viruses are co-cultured with their host, viral DNA or RNA sequencing data may contain reads from both the host and the virus, and de novo assemblies may contain contigs from both the host and viral genome [2]. To produce an assembly containing contigs of a single virus; host reads can be removed or contigs can be binned taxonomically [3]. Taxonomic binning is performed by aligning reads or contigs to a curated list of eukaryotic host genomic DNA. Reads or contigs that map to cell-line hosts are binned as non-viral, while those that do not are binned as viral. For our approach, high-quality de-hosted reads are used for de novo assembly using SPAdes [4]. To achieve the goal of obtaining complete assemblies for a single virus, the contig binning approach is used to extract the contigs that map to the anticipated viral tax ID.

In addition to the problem of taxonomic binning, viral genomes are diverse in structure with many viruses having multipartite genome segments; the genome of Influenza A virus, for example, consists of 8 separate strands of RNA [5]. To determine whether an assembly contains all the necessary segments, a curated database of complete viral genomes and segment information was constructed. After taxonomic binning, contigs are then aligned to the Viral Genomes-NCBI-NIH database to apply segment labels, segment depth, and percent identity to the closest reference.

As further quality control, any viral assemblies that do not align to any references in the Viral Genomes-NCBI-NIH database, or any viral assemblies for which segments align to different references (i.e. may be contaminated), are not published as is, but may undergo additional manual assembly and curation in order to be publishable. For additional QC, CheckV [6] is run on the extracted contigs to determine completeness of the genome by verifying presence of expected HMMs from close relatives.

Bacterial Genome Assembly Quality Control

Genetic Element Contiguity

Most bacterial genome assemblies available in public databases are of “draft” quality [6], and thus single genetic elements (e.g., chromosomes, plasmids, and other types of mobile genetic elements) are split between multiple contigs with little to no structural arrangement data. To be published to the ATCC Genome Portal, genetic elements in ATCC reference-grade bacterial genomes must be assembled into 15 single contigs or less as output from UniCycler assembler [10]. When the assembly process supports circularization of a contig (as in the case of most bacterial chromosomes and some mobile genetic elements), they are reported as "Gold" assemblies. Genome assemblies that possess multiple contigs that the genome assembly process recognizes are contiguous—but have unresolved structural relationships—are currently listed on the genome portal as a "beta" assembly.

Illumina Read Set Coverage

Although the depth of Illumina reads required is influenced by numerous factors (including, but not limited to, bacterial strain) [8, 9], Illumina read sets should be sufficient to cover the entire genome to obtain the most accurate base determination [10]. To account for variance in distribution of coverage per base, we require a minimum of 100X average coverage for Illumina reads.

Bacterial Completeness and Contamination

To ensure our assembly process has correctly captured the entirety of a given strain’s genome, and to confirm the absence of contamination from the assembly, we pass finalized assemblies through CheckM [11]. Briefly, CheckM uses a set of Hidden Markov Models (HMMs) from phylogenetically close reference genomes to determine if the query assembly contains all expected HMMs as predicted by the reference genomes (a percentage called “CheckM completeness”), and what percent of the query’s HMMs differ in copy number or come from reference genomes that are phylogenetically distant (called “CheckM contamination”). We required final assemblies to have completeness values ≥ 95% and contamination values ≤ 5% (e.g., within the margin of error for 0% completeness and contamination, which indicates them as “excellence reference sequences” according to the authors of CheckM).

Mycology Completeness and Contamination

For mycology genomes, we estimate completeness using BUSCO [12]. BUSCO is a tool/database combo widely used in the mycology field that examines the presence of a selection of universal single-copy orthologs for quantitative completeness calculations. We use fungi-specific databases where orthologs must be identified in at least 90% of the fungal species, and no single copy ortholog can be entirely missing from any sub-clade in the databases. Unlike CheckM, BUSCO does not calculate % contamination. We require fungal assemblies to have a completeness value of ≥ 80%.

References

Salzberg SL., et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research, 22(3), 557–567, 2012. PubMed: 22147368
Miller, J.R., et al., A host subtraction database for virus discovery in human cell line sequencing data. F1000Res, 2018. 7: p. 98.
Daly, G.M., et al., Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data. PLoS One, 2015. 10(6): p. e0129059.
Bankevich, A., et al., SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol, 2012. 19(5): p. 455-77.
Mahmoudabadi, G. and R. Phillips, A comprehensive and quantitative exploration of thousands of viral genomes. Elife, 2018. 7.
Nayfach, S., Camargo, A.P., Schulz, F., Eloe-Fadrosh, E., Roux, S. and Kyrpides, N.C., 2021. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nature biotechnology, 39(5), pp.578-585.
Land M, et al. Insights from 20 years of bacterial genome sequencing. Functional & Integrative Genomics, 15(2): 141–161, 2015. PubMed: 25722247
Desai A, et al. Identification of Optimum Sequencing Depth Especially for De Novo Genome Assembly of Small Genomes Using Next Generation Sequencing Data. PLoS ONE, 8(4): e60204, 2013. PubMed: 23593174
Pightling AW, Petronella N, Pagotto F. Choice of Reference Sequence and Assembler for Alignment of Listeria monocytogenes Short-Read Sequence Data Greatly Influences Rates of Error in SNP Analyses. PLoS ONE, 9(8): e104579, 2014. PubMed: 25144537
Wick RR, et al. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLOS Computational Biology, 13(6): e1005595, 2017. PubMed: 28594827
Parks DH, et al. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7): 1043–1055, 2015. PubMed: 25977477
Seppey, M., M. Manni, and E.M. Zdobnov, BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods Mol Biol, 2019. 1962: p. 227-245.