Background

Whole genome shotgun (WGS) metagenomic sequencing, while useful for analyzing microbial community taxonomic composition, is also useful for other types of analyses. This includes studying genes coding for biochemical pathways that may be active in the sample, thus giving insight into physiological traits of the community. This is known as functional analysis. Example applications of metagenomic functional analysis include:

Identifying new biomarkers for disease diagnostics (Hollister et al., 2019)
Epidemiology and biogeography of antimicrobial resistance (Danko et al., 2021)
Discovering the role of microbial metabolic pathways in human disease physiology, including cancer, IBD, and mental health (Liu et al., 2021; Spichak et al., 2021; Thomas et al., 2019)
Studying functional underpinnings of taxonomic shifts in microbial communities (Casaburi et al., 2021; Oliver et al., 2021)
Therapeutics discovery (Young et al., 2021)
Bioprospecting (Vuong et al., 2022)

One Codex Functional Analysis

One Codex functional analysis leverages the Humann3 pipeline. It analyzes whole-genome shotgun sequencing data to identify genes, profiling their function and contribution to metabolic pathways. The first step characterizes the species composition of the sample. Traditionally, this step utilizes MetaPhlAn, however the One Codex functional analysis has been modified to use the One Codex database and metagenomic classifier. After classification, reads are mapped to a custom annotated database of those species’ pangenomes to identify gene families and taxonomy (with bowtie2 + ChocoPhlAn). It further does translated search on any unmapped reads (using Diamond and the UniRef90 database) to identify gene families. This gene family data is then annotated in several different gene annotation systems and into metabolic pathways. Metabolic pathways are further characterized by their abundance and completeness (pathway coverage).

Input

Input consists of a One Codex Sample FASTQ from WGS Metagenomic sequencing.

Outputs

There are three main types of output from Humann3. Like taxonomic abundance inferred from metagenomic sequencing, these results are by nature compositional, and do not represent absolute quantities of biological entities (Gloor et al., 2017).

Gene Families

Reads Per Kilobase (RPK) and Copies Per Million (CPM) abundance of gene families in the sample, including stratification by taxonomy.

Pathway Abundance

Abundance (in RPK) of biochemical pathways as a function of the abundance of gene copies coding for their constituent enzymes.

Pathway Coverage

Breadth of completeness of biochemical pathways represented by a probability (from zero to one) that an entire pathway’s enzymes are present in the sample.

Launching a Functional Analysis

Click Run Workflows on the left sidebar
Search for and select the samples you want to analyze in the Samples panel.
Click Run for the Functional Analysis in the Available Workflows panel

If you are viewing a sample's results page, you can also click on the database name at the top-right of the page. This will show you the workflows that have already been run on that sample, to allow you to view the results from them. At the bottom of the list, you will find a "Run Workflows" button. This will bring you to the "Run Workflows" page, with that sample pre-selected.

Lastly, if you want to launch the functional analysis job for a group of samples, you can do so with the following:

On your Samples page, make sure that you are viewing your samples as a table.
Select the samples that you want to launch the workflow on.
Then within the "Actions" menu, choose "Run Workflows".

This will bring you to the Run Workflows page, with your samples pre-selected.

Results Walkthrough

Summary

The first section gives a Summary view of the functional analysis, including the total number of functional gene families discovered by the analysis and the number and percentage of total reads mapped to those families.

Complete Pathways

This shows the top five pathways with coverage equal to one. These are the biochemical pathways with the highest probability of being present in the sample.

Functional Groups

The primary functional analysis identifies gene families present in the metagenomic data. This can result in a large number of genes. For this example, there are over sixty thousand gene families in the primary results. In order to make these data more tractable, the gene families are further grouped using various annotations, here called Functional Groups.

Functional Annotations

This pulldown menu allows you to choose which Functional Group annotation results to view in the Annotated Results panel:

Gene Family Results

This button makes the full gene family results available to download.

Flow Diagram

This diagram shows the highest RPK annotated gene groups and their relationship to taxonomy. There is a practical limit to the number of annotation nodes that can be shown in the diagram, which determines the minimum RPK shown in the slider.

Annotated Results

This section shows results for functional groupings according to the annotation system selected above in the Summary section. Bolded functional names show the main functional grouping name, further stratified taxonomically in the rows below each group. Uncheck the Abundance of gene families in each grouping are expressed in both RPK (reads per kilobase) and CPM (copies per million).

RPK values are read counts normalized by kilobase reference gene length. Therefore RPK values for different gene families (annotated entities, e.g., for KO, a KEGG functional ortholog, K00424, cytochrome bd-I ubiquinol oxidase subunit X [EC:7.1.1.7]) are comparable within a sample. The values for a given gene family are not comparable between samples because the values have not been corrected for overall read depth for that sample. Therefore these values would be useful if you are going to sum data for multiple samples before normalizing for total depth.
CPM is relative copy depth (gene copy not read count) in the sample, RPK values normalized to a million RPK total. Because it is normalized across the sample, CPM is the more correct value to use (rather than RPK) if you are comparing between different samples. Note that, like taxonomic data, this data is compositional, and does not represent an absolute quantitation of gene families in the sample.

Uncheck Show taxonomic stratification to show only total CPM and RPK for functional grouping across all taxa. Tabular results from a given annotation can be downloaded via the Save pulldown button.

Pathway Results

Pathways are analyzed in two ways, for abundance and coverage. In this analysis, reactions and pathways are defined by the MetaCyc database. MinPath is used to compute the minimal pathway reconstruction given the set of gene families in the sample.

Abundance represents the total abundance of the pathway’s component reactions, summed copy numbers of the reactions’ constituent enzymes, in units of RPK.
Coverage is a probabilistic measure of a complete metabolic pathway being detected, where 1 = high confidence that the full pathway is present, 0 = low confidence that the full pathway is covered.

Note that it is possible to have high pathway abundance but low coverage, where only some of the constituent reactions of a pathway are detected. For further details on interpreting these metrics, see the Humann3 documentation.

Uncheck Show taxonomic stratification to show only total abundance and coverage for pathways across all taxa. Tabular results for pathways can be downloaded via the Save pulldown button.