Whole genome shotgun (WGS) metagenomic sequencing, while useful for analyzing microbial community taxonomic composition, is also useful for other types of analyses. This includes studying genes coding for biochemical pathways that may be active in the sample, thus giving insight into physiological traits of the community. This is known as functional analysis. Example applications of metagenomic functional analysis include:
Identifying new biomarkers for disease diagnostics (Hollister et al., 2019)
Epidemiology and biogeography of antimicrobial resistance (Danko et al., 2021)
Discovering the role of microbial metabolic pathways in human disease physiology, including cancer, IBD, and mental health (Liu et al., 2021; Spichak et al., 2021; Thomas et al., 2019)
Studying functional underpinnings of taxonomic shifts in microbial communities (Casaburi et al., 2021; Oliver et al., 2021)
Therapeutics discovery (Young et al., 2021)
Bioprospecting (Vuong et al., 2022)
One Codex Functional Analysis
One Codex functional analysis leverages the Humann3 pipeline. It analyzes whole-genome shotgun sequencing data to identify genes, profiling their function and contribution to metabolic pathways. The first step characterizes the species composition of the sample. Traditionally, this step utilizes MetaPhlAn, however the One Codex functional analysis has been modified to use the One Codex database and metagenomic classifier. After classification, reads are mapped to a custom annotated database of those species’ pangenomes to identify gene families and taxonomy (with bowtie2 + ChocoPhlAn). It further does translated search on any unmapped reads (using Diamond and the UniRef90 database) to identify gene families. This gene family data is then annotated in several different gene annotation systems and into metabolic pathways. Metabolic pathways are further characterized by their abundance and completeness (pathway coverage).
Input consists of a One Codex Sample FASTQ from WGS Metagenomic sequencing.
There are three main types of output from Humann3. Like taxonomic abundance inferred from metagenomic sequencing, these results are by nature compositional, and do not represent absolute quantities of biological entities (Gloor et al., 2017).
Reads Per Kilobase (RPK) and Copies Per Million (CPM) abundance of gene families in the sample, including stratification by taxonomy.
Abundance (in RPK) of biochemical pathways as a function of the abundance of gene copies coding for their constituent enzymes.
Breadth of completeness of biochemical pathways represented by a probability (from zero to one) that an entire pathway’s enzymes are present in the sample.
Launching a Functional Analysis
Click Run Analyses on the left sidebar
Search for and select the samples you want to analyze in the Samples panel. Note, you can also put a sample into the Samples panel from a sample’s classification run results page. Click on the pulldown to select an analysis (“One Codex Database”) and select “Run Analysis”
3. Click Run for the Functional Analysis in the Available Jobs panel
The first section gives a Summary view of the functional analysis, including the total number of functional gene families discovered by the analysis and the number and percentage of total reads mapped to those families.
This shows the top five pathways with coverage equal to one. These are the biochemical pathways with the highest probability of being present in the sample.
The primary functional analysis identifies gene families present in the metagenomic data. This can result in a large number of genes. For this example, there are over sixty thousand gene families in the primary results. In order to make these data more tractable, the gene families are further grouped using various annotations, here called Functional Groups.
This pulldown menu allows you to choose which Functional Group annotation results to view in the Annotated Results panel:
Gene Family Results
This button makes the full gene family results available to download.
This diagram shows the highest RPK annotated gene groups and their relationship to taxonomy. There is a practical limit to the number of annotation nodes that can be shown in the diagram, which determines the minimum RPK shown in the slider.
This section shows results for functional groupings according to the annotation system selected above in the Summary section. Bolded functional names show the main functional grouping name, further stratified taxonomically in the rows below each group. Uncheck the Abundance of gene families in each grouping are expressed in both RPK (reads per kilobase) and CPM (copies per million).
RPK values are read counts normalized by kilobase reference gene length. Therefore RPK values for different gene families (annotated entities, e.g., for KO, a KEGG functional ortholog, K00424, cytochrome bd-I ubiquinol oxidase subunit X [EC:22.214.171.124]) are comparable within a sample. The values for a given gene family are not comparable between samples because the values have not been corrected for overall read depth for that sample. Therefore these values would be useful if you are going to sum data for multiple samples before normalizing for total depth.
CPM is relative copy depth (gene copy not read count) in the sample, RPK values normalized to a million RPK total. Because it is normalized across the sample, CPM is the more correct value to use (rather than RPK) if you are comparing between different samples. Note that, like taxonomic data, this data is compositional, and does not represent an absolute quantitation of gene families in the sample.
Uncheck Show taxonomic stratification to show only total CPM and RPK for functional grouping across all taxa. Tabular results from a given annotation can be downloaded via the Save pulldown button.
Pathways are analyzed in two ways, for abundance and coverage. In this analysis, reactions and pathways are defined by the MetaCyc database. MinPath is used to compute the minimal pathway reconstruction given the set of gene families in the sample.
Abundance represents the total abundance of the pathway’s component reactions, summed copy numbers of the reactions’ constituent enzymes, in units of RPK.
Coverage is a probabilistic measure of a complete metabolic pathway being detected, where 1 = high confidence that the full pathway is present, 0 = low confidence that the full pathway is covered.
Note that it is possible to have high pathway abundance but low coverage, where only some of the constituent reactions of a pathway are detected. For further details on interpreting these metrics, see the Humann3 documentation.
Uncheck Show taxonomic stratification to show only total abundance and coverage for pathways across all taxa. Tabular results for pathways can be downloaded via the Save pulldown button.