The microbiome has become an ever more important area of research in the life sciences. A growing number of researchers are now including microbiome data in their studies and trials. Analyzing microbiome data in particular comes with a host of new challenges and introduces a host of new risk factors for your research. One Codex helps reduce this risk and expedite your research with a fast and easy to use platform for microbiome data analysis.

We've put this document together to introduce the most common terms and concepts used in microbiome research and analysis. We hope it proves a valuable resource for anybody hoping to integrate the microbiome into their research.


To begin, we're going to define a handful of terms you'll see used to describe different aspects of the microbiome.


The "microbiome", like other biomes, encompasses the entire habitat. This includes the microorganisms, their genomes, and the environmental conditions in which they live. Many people will use the terms "microbiome" and "microbiota" interchangeably.


This is the collection of microorganisms themselves that are present in a defined environment.


The "metagenome" is specifically the collection of genes and genomes present in the microbiome.

You can find more details on these and related terms in this publication [1].

Approaches used to study the microbiome

We're going to focus specifically on microbiome research involving analysis of the DNA collected from a microbiome sample, or microbiome sequencing. These days, this type of analysis is largely done using Next Generation Sequencing (NGS) technology. The most common NGS approaches are whole genome sequencing, also known as shotgun sequencing, and amplicon (e.g. 16S) sequencing. One Codex provides metagenomics analysis pipelines for both types of sequencing data.

Whole Genome Sequencing

Whole Genome Sequencing is a library preparation and sequencing strategy that sequences large numbers of random fragments of DNA from the entire sample. Applied to the microbiome, shotgun sequencing captures the DNA sequences of all of the microbes in the sample. This is also commonly known as "whole metagenome sequencing", abbreviated as WGS or WMGS.

The One Codex Database uses shotgun sequencing data to identify all of the the microbes present in your sample - including bacteria, viruses, and fungi - in a single rapid analysis. Learn how you can get started here!

Amplicon Sequencing

Another common, albeit older, approach to identifying the microbes in a sample involves amplifying a specific marker gene or set of genes. The premise of amplicon sequencing is based on identifying a gene that is present in most or all of the microbes you want to examine, but which also has enough variability from species to species to be able to distinguish between them.

The bacterial 16S rRNA gene is the most common amplicon target for identifying the bacteria present in microbial samples. This gene has 9 highly-variable (V) regions, which distinguish different bacteria. Unless you use long-read sequencing technology like PacBio or Oxford Nanopore sequencing, most researchers will amplify one to three variable regions for sequencing with Illumina technology, since Illumina read lengths cannot yet cover the whole 16S rRNA gene. Your choice of variable regions and primer sets will impact your results, since some species are too similar in some regions to be distinguished reliably.

Researchers frequently choose primer sets to span the V1-V3 region, the V4-V5 region, or even just the V4 region, but the most commonly used is the V3-V4 region. This is in part due to the higher level of variability in these two regions combined compared to some other regions, but also because many of the microbes found in the human gut can be more easily distinguished from each other in the V3-V4 region.

The Targeted Loci Database on One Codex is designed to analyze 16S microbiome sequencing data from any of the variable regions, or even the full 16S gene, in a single metagenomics analysis.

What should I choose?

As amplicon sequencing usually targets just one gene, you don't need a lot of sequence data to get a high-level view of the microbes in a sample. However, you are limited to a specific domain (e.g. with 16S, you can examine bacteria, but you won't be able to identify fungi). And due to the high level of similarity between species in some V regions, amplicon sequencing can lead to reduced precision and recall when compared to whole metagenome shotgun sequencing.

As well as increased precision and taxonomic resolution, whole metagenome shotgun sequencing allows you to sequence all organisms present in the sample, not just from one domain such as bacteria. WMGS also gives you the ability to look at the entire gene complement of a sample (the metagenome), which can give insights into the functional capabilities of the microbes within that sample, such as metabolic pathways that these microbes can execute. With WMGS, you also have the ability to assemble your reads into genomes from the microbes in the samples. The resulting genomes are referred to as metagenome-assembled genomes (MAGs), which open up the possibilities of identifying genome rearrangements and horizontal gene transfer events, or even new organisms within your sample. WMGS is therefore a far more informative choice, and highly used.

Commonly-used Metrics

Once you've chosen your sequencing approach and classified the sequence reads in your sample, there are some additional metrics that researchers use to assess the microbiota of their sample. Below are some of the most commonly used approaches.

Alpha Diversity

Many studies have shown differences in alpha diversity between cohorts of healthy humans compared with those suffering from certain diseases. But what does that mean?

Alpha diversity is a term used to describe the amount of microbial diversity within a single sample. It usually summarizes the number of different species (or other taxonomic levels). Some alpha diversity metrics include estimations of how abundant each species is relative to others. With these "evenness" based metrics, a sample dominated by one species will be represented as less diverse than other samples where the same species are found at more even abundances.

For more information on alpha diversity, and some of the common ways of calculating it, take a look at our alpha diversity document.

Beta Diversity

Beta diversity examines the relationship between two samples. It is a measure of how similar or different two samples are, based on the species they have in common. Most beta diversity metrics are weighted by how similar the abundance vectors are for each sample. Like alpha diversity, there are a number of different ways to measure beta diversity. We describe some of the most common metrics in our beta diversity document, along with a number of ways to plot these differences.

Differential Abundance Analyses

Both alpha and beta diversity analyses give a single metric based on the whole microbiome of a sample or pair of samples. We often want to see finer-grained details of what drives those differences. A common approach, which you'll see on our Quick Compare page, is to plot a stacked bar chart to look at differential abundance. This gives you a view of the microbes that dominate a sample, and lets you to compare those abundances between samples.


[1] Marchesi, J.R. and Ravel, J. The vocabulary of microbiome research: a proposal. Microbiome. 2015; 3: 31. doi: 10.1186/s40168-015-0094-5

Did this answer your question?