Bacterial Genome Annotation
There are currently several approaches for bacterial genome annotations [1, 2, 3]. As such, we make our finalized genome assembly FASTA files available for download from our genome portal (for ATCC Supporting Members, and for those who have purchased a physical product of the selected organism), and encourage our customers to conduct their own custom annotations of the ATCC reference-grade genomes if they so choose. However, we also recognize the need for a rapidly accessible annotation in a common format for those looking to perform immediate data analysis at the gene level. To address these needs, we provide a default genome annotation for ATCC reference-grade genomes with prokka [2]. Briefly, prokka relies on a number of tools to annotate CDS, rRNA, tRNA, signal leader peptides, and non-coding RNA. For CDSs, prokka leverages the UniProt [4], RefSeq [5], Pfam [6], and TIGRFAM [7] databases to assign protein identity. On the genome portal, all annotated CDSs include their EC number and UniProt ID as reported by prokka.
Currently we are transitioning to the use of the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) [3] for bacterial annotation in the Oatmeal pipeline. PGAP combines ab initio gene prediction algorithms with homology-based methods. PGAP leverages the Protein Family Models collection for structural and functional annotation. This collection is composed of Hidden Markov Model (HMM), Blast (BlastRules), and Conserved Domain Database-based architectures (CDDs) to assign names, gene symbols, publications, and EC number to the proteins that meet criteria for protein family inclusions.
Mycology Genome Annotation
During completeness calculations for mycology genomes, BUSCO [8] generates annotations of universal single-copy orthologs, which we make available in the genome portal. BUSCO uses Augustus (trained on BUSCO databases), tBLASTn, and HMMER3 to automatically predict and annotate single-copy coding regions of mycological genomes according to their closest relatives on fungi-specific databases.
Viral Genome Annotation
Viral annotations are currently not included on the ATCC Genome Portal. We are working to enable the inclusion of viral annotation for the Oatmeal pipeline by the approach of using VIGA program [9] on the finalized viral assembly. While we encourage customers to conduct their own annotations by downloading the reference-grade genome fasta assembly to ensure complete control, we also will provide these VIGA generated annotation files available through download for ease-of-use and for immediate data analysis. Note that downloads are available to ATCC Supporting Members, and those who have purchased a physical product of the selected organism.
To become an ATCC Supporting Member, login to your ATCC Genome Portal account and view the membership options here. You can also learn more about Supporting Memberships here.
References
Overbeek R, et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Research, 42(D1): D206–D214, 2014. PubMed: 24293654
Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14): 2068–2069, 2014. PubMed: 24642063
Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, Lomsadze A, Pruitt KD, Borodovsky M, Ostell J. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016 Aug 19;44(14):6614-24. doi: 10.1093/nar/gkw569. Epub 2016 Jun 24. PMID: 27342282; PMCID: PMC5001611.
UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Research, 43(D1): D204–D212, 2015. PubMed: 25348405
Tatusova T, et al. RefSeq microbial genomes database: New representation and annotation strategy. Nucleic Acids Research, 42(D1): 3872, 2014. PubMed: 25824943
Finn RD, et al. Pfam: the protein families database. Nucleic Acids Research, 42(D1): D222–D230, 2014. PubMed: 24288371
Haft DH, Selengut JD, White O. The TIGRFAMs database of protein families. Nucleic Acids Research, 31(1), 371–373, 2003. PubMed: 12520025
Seppey, M., M. Manni, and E.M. Zdobnov. BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods Mol Biol, 2019. 1962: p. 227-245.
González-Tortuero, E., Sean Sutton, T.D., Velayudhan, V., Shkoporov, A.N., Draper, L.A., Stockdale, S.R., Ross, R.P. and Hill, C., 2018. VIGA: a sensitive, precise and automatic de novo VIral Genome Annotator. BioRxiv, p.277509.