Assembly, QC and annotation pipeline versions

View the changelogs for the pipelines used to assemble and annotation ATCC genomes

Denise Lynch avatar
Written by Denise Lynch
Updated over a week ago

A series of pipelines were used to generate the assemblies, their QC, and annotations available on the ATCC Genome Portal. Information about each pipeline version can be found in the “Quality Control” tab for a genome on the ATCC Genome Portal in the “Notes” field. Currently, we are in the process of unifying the pipeline versions displayed in the “notes” field. Where the absence of a “notes” field denotes a One Codex derived assembly, an Oatmeal version captured in the notes describes the pipeline version used in curation of the genome. The presence of a short “manual assembly” disclaimer denotes an assembly with an earlier pre-production version of Oatmeal pipeline. Here, we provide a change log for all pipelines. If you are unsure which pipeline or version was used for your genome of interest, please contact us through the message box at the bottom-right of your screen.

Assembly and Annotation Pipelines

Oatmeal

“Oatmeal” is a benchmarked microbial assembly pipeline built and maintained by the ATCC Bioinformatics team. The pipeline is designed to produce reliable and authentic genome assemblies across each generalized microbial kingdom. Effective from the deployment dates listed below, each new genome on the ATCC Genome Portal has been assembled and curated through the Oatmeal pipeline. As the pipeline evolves over time, the change log will reflect updates to the software and methodology.

Oatmeal v1.0

Bacteriology

  • July 01, 2023 - Current

Assembly

  • Fastp (v0.23.2) read trimming and filtering

  • Nanofilt (v2.8.0) filtering of long reads – Phred Q10 and >1000nt

  • Kraken (v2.1.2) short read classification and read binning

  • Seqkit (v2.1.0) Deduplication and downsampling to 100X

  • bbnorm (v38.62) - Illumina

  • Filtlong (v0.2.1) - ONT

  • Unicycler (v0.4.8) hybrid assembly

  • Polishing of the assembly with Polypolish (v0.5.0)

Quality Control

  • CheckM (v1.1.3) completeness ≥ 95%

  • CheckM contamination ≤ 5%

  • Number of contigs ≤ 30

  • Illumina and ONT coverage ≥ 100X

  • Must pass graph QC (≤ 15 connecting contigs)

  • Assembly status

    • GOLD – Passes all QC criteria list above and all contigs circularized

    • BETA – Passes all QC criteria listed above but not all contigs circularized

Annotation

  • PGAP (2022-12-13.build6494) bacterial annotator

Virology

  • June 01, 2023 - Current

Assembly

  • Fastp (v0.23.2) read trimming and filtering

  • BWA (v0.7.17) Illumina mapping and removal of Eukaryotic host reads

    • (DNA Viruses) - Minimap2 (2.23-r1111) ONT mapping and removal of host reads

  • Spades (v3.14.1) assembly of non-host reads

    • (DNA Viruses) – Hybrid assembly of non-host reads

  • Retention of desired contigs over 500bp by taxID-specific Blast (v2.13)

Quality Control

  • CheckV (v1.0.1) completeness ≥ 80%

  • Assembly length within 10% of expected length of reference genome

  • All viral segments must be present but no more than 9 contigs larger than expected contig count

Annotation

  • VIGA (v0.11.2) viral annotator

Mycology

  • Rollout date anticipated 01 August 2023

One Codex Pipelines

Many of the assemblies on the ATCC Genome Portal have been assembled and curated through the One Codex assembly pipelines prior to deployment of the Oatmeal pipeline (documented below).

Assembly Pipelines

Bacteriology

  • Date: April 25, 2019

    • Initial Bacterial hybrid assembly pipeline

    • Runs readsQC to quality trim both Illumina and Oxford Nanopore Technologies (ONT) reads

    • Runs Unicycler (v0.4.4) to assemble genome

  • Date: August 1, 2019

    Changes:

    • Runs fastp to trim Illumina reads

    • Runs filtlong to trim ONT reads

    • Downsamples Illumina reads to 150X genome depth and ONT reads to 60X

    • Updates Unicycler to v0.4.8

  • Date: December 11, 2019

    Changes:

    • Downsamples ONT reads to 30X genome depth

Virology

  • Date: July 15, 2020

    • Runs fastp on Illumina reads

    • Uses SPAdes to assemble

    • minimap2 aligns trimmed reads to assembly

    • Masks low depth (<10X) regions

  • Date: April 29, 2021

    Changes:

    • Trims terminal masked regions from assemblies

    • Adds modification to Unicycler to raise exception if Racon runs out of memory

Mycology

  • Date: July 10, 2020

    • Initial assembly pipeline for hybrid fungal assemblies

    • Runs fastp on Illumina reads

    • Runs Filtlong on ONT reads

    • Runs MaSuRCA with FLYE on filtered read sets

  • Date: March 23, 2021

    Changes:

    • Estimates genome size on Illumina reads

    • Adds downsampling of filtered Illumina reads to 150X depth of estimated genome size

    • Adds downsampling of ONT reads to 30X depth of estimated genome size

Assembly QC pipeline

Bacteriology

  • Date: October 30, 2018

    • Initial bacterial hybrid assembly QC pipeline

    • Uses CheckM to assess assembly quality, completion and contamination

  • Date: June 7, 2019

    Changes:

    • Maps trimmed ONT reads to assembly using minimap2 to calculate ONT depth

    • Maps trimmed Illumina reads to assembly using BWA

    • Adds custom script to calculate other assembly statistics

Virology

  • Date: July 13, 2020

    • Initial virology assembly QC pipeline

    • Aligns contigs to reference database to identify the best reference species

    • Checks if all segments in each reference species are present in assembly, using GenBank segment information

    • Reports alignment quality

  • Date: August 05, 2020

    Changes:

    • Includes sub-species sequences in reference database

  • Date: December 10, 2020

    Changes:

    • Uses alignment results to identify segments, in place of GenBank segment information

  • Date: January 21, 2021

    Changes:

    • Calculates assembly completeness score (assembly length / reference length)

Mycology

  • Date: September 24, 2020

    • Initial mycology assembly QC pipeline

    • Maps raw reads to assembly to calculate depth

    • Runs BUSCO 4.1.2 with BUSCO database Fungi_ODB10 to calculate assembly completeness score

    • Calculates additional assembly statistics

Annotation Pipelines

Bacteriology

  • Date: September 7, 2018

    • Initial bacterial annotation pipeline

    • Uses prokka with genus-specific BLAST database

  • Date: September 18, 2019

    Changes:

    • Does not use genus-specific BLAST database

Virology (Variant calling)

  • Date: October 18, 2020

    • blastn aligns reference genome to assembly to identify matching segments

    • Uses MAFFT to align matching reference and assembly segments

    • Custom script examines MAAFT alignment results for variant types

  • Date: November 24, 2020

    Changes:

    • Improves method for identifying variant types

Mycology

  • Date: October 09, 2020

    • Initial annotation pipeline for fungal assemblies

    • Runs BUSCO 4.1.2 with BUSCO database Fungi_ODB10 for annotations of Universal Single-Copy Orthologs

Did this answer your question?