To generate the best quality sequencing data for our genome assemblies, we perform a single DNA extraction and sequence the DNA on both Illumina and ONT sequencing platforms.

Illumina Sequencing

Illumina libraries are prepared using the latest library preparation kits available. Libraries are subsequently sequenced on an Illumina instrument, producing a paired-end read set per sample. The degree of sample multiplexing is based on the estimated genome size of a given organism and the amount of data necessary to generate at least 100X coverage of the genome with the Illumina read set. Resultant reads are adapter trimmed using the adapter trimming option on the Illumina instrument. Periodic updates to the instruments’ software are performed when they are made available by the manufacturer to ensure that the latest version of instrument software is used for basecalling and adapter trimming for a given sequencing date.

Oxford Nanopore Technologies Sequencing

ONT libraries are prepared using the latest DNA sequencing kits available, then sequenced on an ONT instrument with the latest flow cell version available. The degree of sample multiplexing is based on the estimated genome size of a given organism. Flow cells are run on the instrument for at least 12 hours. Periodic updates to the instruments’ software are performed when they are made available by the manufacturer to ensure that the latest version of ONT software is used for sequencing and basecalling for a given sequencing date.

After basecalling, all resultant FASTQs are combined and then demultiplexed using either porechop or qcat, with barcode removal settings turned on.

Illumina Data Quality Control

Illumina read sets commonly contain flanking low-quality regions and portions of Illumina adapter sequence; removing these regions can substantially improve genome assemblies [1]. To accomplish this, we perform a round of adapter removal and quality filtering. This also ensures removal of adapter sequences otherwise missed by Illumina software. After Illumina read sets undergo quality and adapter trimming, we assess the quality of the read set by using FastQC. Illumina reads must pass the following quality control:

Median Q score, all bases > 30
Median Q score, per base > 25
Ambiguous content (% N bases) < 5%

Oxford Nanopore Technologies Data Quality Control

ONT ultra-long reads are critical for scaffolding over the low-complexity regions of bacterial genomes during hybrid assembly, but they have limited influence in determining base identity given enough Illumina coverage [2, 3]. Given the lower quality of ONT sequencing data, all data was trimmed and filtered for low quality regions. The quality control metrics used across all ONT read sets produced are:

Minimum mean Q score, per read > 10
Minimum read length > 5000. To perform this quality control step, we employ Filtlong on demultiplexed ONT read sets, in addition to barcode sequence removal during demultiplexing.

Read-Based Contamination Quality Control with One Codex

ATCC employs state-of-the-art methods to detect contamination during the growth phase of our product production. To compliment this approach, we use the One Codex microbial genomics platform [4] to perform read-level k-mer–based taxonomic classification and estimation of strain abundances on our processed Illumina read sets, which represent a highly-accurate snap shot of a given DNA extraction. A minimum of 1,000,000 Illumina reads per sequenced sample is required to undergo such analysis; Illumina read sets otherwise passing quality control criteria but possess less than 1,000,000 reads are sent for re-sequencing. When an Illumina read set is confirmed as an isolate by the One Codex platform, all read sets from that extraction continue to genome assembly. Please note that the results of this reads-based analysis are not currently presented on the portal but that all published genomes have passed our stringent thresholds for purity.

References

Del Fabbro C, et al. An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis. PLoS ONE, 8(12): e85024, 2013. PubMed: 24376861
Wick RR, et al. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLOS Computational Biology, 13(6): e1005595, 2017. PubMed: 28594827
Shomorony I, Courtade T, Tse D. Do read errors matter for genome assembly? 2015 IEEE International Symposium on Information Theory (ISIT), 919–923, 2015.
Minot SS, Krumm N, Greenfield NB. One Codex: A Sensitive and Accurate Data Platform for Genomic Microbial Identification. BioRxiv, 27607, 2015.