Bacterial Genome Annotation

There are currently several approaches for bacterial genome annotations [1, 2, 3]. As such, we make our finalized genome assembly FASTA files available for download from our genome portal and encourage our customers to conduct their own custom annotations of the ATCC reference-grade genomes if they so choose. However, we also recognize the need for a rapidly accessible annotation in a common format for those looking to perform immediate data analysis at the gene level. To address these needs, we provide a default genome annotation for ATCC reference-grade genomes with prokka [2]. Briefly, prokka relies on a number of tools to annotate CDS, rRNA, tRNA, signal leader peptides, and non-coding RNA. For CDSs, prokka leverages the UniProt [4], RefSeq [5], Pfam [6], and TIGRFAM [7] databases to assign protein identity. On the genome portal, all annotated CDSs include their EC number and UniProt ID as reported by prokka.

Mycology Genome Annotation

During completeness calculations for mycology genomes, BUSCO [8] generates annotations of universal single-copy orthologs, which we make available in the genome portal. BUSCO uses Augustus (trained on BUSCO databases), tBLASTn, and HMMER3 to automatically predict and annotate single-copy coding regions of mycological genomes according to their closest relatives on fungi-specific databases.

Viral Genome Annotation and Variant Detection

Viral assemblies draw gene annotations from the closest reference sequence in Viral Genomes-NCBI-NIH databases by using a customized python script.

To call genomic variants, the depth-masked, SPAdes de novo assembly is aligned to a reference sequences using MAFFT [9], which is a tool for multiple sequence alignments. Briefly, multiple sequence alignments are converted to a table of variants by using custom scripts. The table of variants is then joined to the reference assembly's genome annotation to produce a table of variants and their overlapping annotations. Variants that extend to the end of a segment sequence are excluded as they are likely truncations and not true biological variants. Variants that contain entirely ambiguous nucleotides in the reference or alternate sequence are also excluded from reporting.


  1. Overbeek R, et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Research, 42(D1): D206–D214, 2014. PubMed: 24293654

  2. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14): 2068–2069, 2014. PubMed: 24642063

  3. Zhao Y, et al. PGAP: pan-genomes analysis pipeline. Bioinformatics, 28(3): 416–418, 2012. PubMed: 22130594

  4. UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Research, 43(D1): D204–D212, 2015. PubMed: 25348405

  5. Tatusova T, et al. RefSeq microbial genomes database: New representation and annotation strategy. Nucleic Acids Research, 42(D1): 3872, 2014. PubMed: 25824943

  6. Finn RD, et al. Pfam: the protein families database. Nucleic Acids Research, 42(D1): D222–D230, 2014. PubMed: 24288371

  7. Haft DH, Selengut JD, White O. The TIGRFAMs database of protein families. Nucleic Acids Research, 31(1), 371–373, 2003. PubMed: 12520025

  8. Seppey, M., M. Manni, and E.M. Zdobnov, BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods Mol Biol, 2019. 1962: p. 227-245.

  9. Katoh, K., J. Rozewicki, and K.D. Yamada, MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization. Brief Bioinform, 2019. 20(4): p. 1160-1166.


Did this answer your question?