Beta diversity is a term used to express the differences between samples or environments. It is very often used in microbiome studies to help researchers see whether there are major differences between two groups, such as treatment and control groups. There are many ways of measuring beta diversity, as well as a number of ways to visualize and represent those differences. We'll walk you through some of the most commonly-used metrics and representations, to help you interpret your results. We'll use this example to describe the metrics:

## Beta Diversity Metrics

### Bray-Curtis Dissimilarity

Bray-Curtis dissimilarity [1] examines the abundances of microbes that are shared between two samples, and the number of microbes found in each. Bray-Curtis dissimilarity ranges from 0 to 1. If both samples share the same number of microbes, at the same abundance, their "dissimilarity" will equal zero. If they have absolutely no shared microbes, they will have maximum dissimilarity, i.e. 1.

Bray-Curtis dissimilarity is calculated as:

In our example above the blue, yellow, and orange species are shared between the two samples. To calculate *Cij*, we sum up the lesser of those species.

*Cij* = 5 (blue) + 3 (yellow) + 5 (orange) = 13

*Si* (sample A) = 13 + 3 + 6 = 22

*Sj* (sample B) = 5 + 5 + 5 + 5 + 4 = 24

BC = 1 - (2*13) / (22 + 24) = 1 - 26/46 = 0.4348

This is a commonly-used metric, as the bounded scale (i.e. 0 - 1) allows for an immediate and simplified idea of the similarity between the two samples. One thing to keep in mind, however, is that the bounded scale can result in amplified or contracted differences, depending on the number of microbes in the samples. For instance, if you have a pair of samples with very few microbes in each, any small changes in abundance will have a large affect on the Bray-Curtis dissimilarity. If, however, you have two samples with large numbers of species, similar changes in abundance would have less of an impact on the overall score. So it is good to consider alpha diversity as a possible driver in dissimilarity.

### Jaccard Distance

Like Bray-Curtis dissimilarity, the Jaccard distance [2] is also bounded between 0 and 1. Also like Bray-Curtis dissimilarity, it should be considered a "dissimilarity" as opposed to a true "distance". However Jaccard distance does not take abundances into account; just the presence of microbes in one or both samples. It's calculated using this formula.

In our example, Jaccard distance would be calculated as:

Jaccard = 1 - ( | {blue, yellow, orange} | / | {blue, yellow, orange, green, red} | )

= 1 - ( 3 / 5) = 0.4

### City Block Distance

City Block distance has multiple names, including **Manhattan Distance** and **Taxicab distance** [3]. It's so-named because the metric involves adding up the absolute differences of Cartesian coordinates. Imagine how a taxi drives through Manhattan. It must take multiple turns to get from point A to point B, diagonally across the city. This is how City Block distance works. It adds all of the incremental distances between each turn.

In our example, we can calculate the city block distance as:

City block = | 13 - 5 | + | 3 - 5 | + | 6 - 5 | + | 0 - 5 | + | 0 - 4 | = 20.

### Unweighted UniFrac

"UniFrac" considers the phylogenetic relationships between the microbes found in two samples [4]. The premise of this metric is that closely related microbes are often able to perform similar functions, so two samples with different species from the same genus would be more similar to each other than two samples where the different species came from different genera or other taxonomic ranks.

Unweighted UniFrac is the fraction of branch lengths between all microbes in both samples that is different (unshared) between the samples.

### Weighted UniFrac

Weighted UniFrac [5] follows the same logic as Unweighted UniFrac, but also takes the abundances of microbes in the samples into account.

Weighted UniFrac is largely impacted by the abundances of the microbes, whereas unweighted UniFrac does not take abundance into account at all. A third UniFrac metric, generalized UniFrac [6], was developed to strike a balance between the biases of weighted and unweighted UniFrac.

### Aitchison Distance

One of the concerns when dealing with microbiome data is that many people treat read counts as absolute abundances of microbes, and apply statistical tests that are designed for absolute abundances to the data. However, read counts from microbiome data are not absolute abundances.

When we take a sample from an environment, we only obtain a proportion of what is there. Each processing step takes a smaller subset of what we obtained in the preceding step. And as the sequencers are limited to an estimated number of possible DNA strands, what we observe per species is only a proportion of what the sequencer was capable of producing for this sample. We therefore need to treat microbiome data as proportional, or compositional [7].

As such, some microbiome researchers have started to incorporate Aitchison Distance into their studies. The first step involves normalizing the read counts using centered log-ratio (clr) transformations. These normalized counts can now be treated with statistical approaches that are designed for data in Euclidean space. Aitchison distance is the application of Euclidean distance to clr-transformed data. The clr-transformation for species *1 - n* in sample *x* is performed as:

Aitchison distance between samples *x* and *y* is therefore calculated as:

## Beta Diversity Representation Approaches

Applying any of the above metrics to a set of *n* samples returns a symmetrical distance matrix of size *n* x *n*. It's difficult to get a sense of the population-wide differences from a table like this, so there are a number of ways to visualize these results to make these differences a little clearer.

### Heat Maps

One way of viewing how similar or distant samples are from one another is using heat maps. Color scales make it instantly clear how distant samples are from each other. Heat maps also cluster your samples, giving you not just a score of distance or dissimilarity between pairs of samples, but potentially highlighting groups of similar samples.

### Multi-Dimensional Scaling (MDS)

While heat maps give a quick view of how close or distance pairs of samples are, we often want to see how the different clusters relate to one another, and if our metadata is driving those differences. But when you have many samples, this becomes difficult to do.

Imagine the case where you have two samples. You can represent the distance between these two samples with a line, where the length of that line indicates distance. This means we require 1 dimension to view the relationship between 2 samples. Now imagine you have a third sample. You can present the differences between three samples with a triangle, or 2 dimensional plot. If you add a fourth sample, you now need a third dimension to capture all of the distances between all samples. The more samples you have (n), the more dimensions you need (n-1) to fully capture all of the distances between all pairs. Our brains can't really comprehend too many dimensions!

Multi-dimensional scaling (MDS) helps us to work with many samples at once, by using algorithms to allow us to maintain a view of the distances between samples as much as possible, in a reduced number of dimensions. The dimensions are usually sorted in order of how much of the variance they explain, where dimension 1 explains more variance between all samples than dimension 2 and so on. For this reason, we tend to plot just 2-3 dimensions, depending on how much of the variance is explained by these first few dimensions alone.

Below are just a few ways of plotting data like this.

*Principal Component Analysis (PCA)*

Principal Component Analysis works by finding the best-fit line between all of your data points in multi-dimensional space - i.e. the line that minimizes the squared distance from each point to the line. This line, or vector, is your first component. You can think of this as a combination of variables (for instance, microbes and their abundances) that explain the maximum amount of difference between your samples. That amount of difference is your eigenvalue. By rotating and stretching your data in many different directions, PCA finds the next best-fit lines or eigenvectors, until all variance between your points can be explained.

One of the benefits of PCA is that you can identify which variables contributed most to the principal components. With this information you can generate a bi-plot, which means you can see not only how your samples cluster or spread, but also which variables contribute most to how well they cluster or spread, such as in the below example.

*Principal Coordinates Analysis (PCoA)*

Another commonly-used MDS approach is Principal Coordinates Analysis (PCoA), also known as classical multidimensional scaling. It relies on eigenvalue decomposition in its calculations.

We've made example data available in our blog post, and here, where we've plotted the first two principal coordinates for infant microbiome data from an auto-immunity study. Samples are colored by country, and you'll see that PC1 explains 50.37% of the variance between all samples, whereas PC2 explains a further 15.01% of the variance.

*What's the difference between PCA and PCoA?*

A question we often see! There are a couple of points to note:

- As we shown above, PCA allows you to examine the contributions of some of your variables (microbes) to the plot.
- PCoA takes in a distance or dissimilarity matrix, whereas PCA takes in the underlying data. If you've passed a Euclidean distance matrix to a PCoA, PCoA and PCA are essentially identical.
- Given that PCoA starts with a distance or dissimilarity matrix, it seems to better handle missing data than PCA does. Microbiome data is often quite sparse, with many microbes found at low abundance, or missing from many samples. For this reason, PCoA may be better suited than PCA for microbiome studies.

*Scaling by MAjorizing a COmplicated Function (SMACOF)*

SMACOF is another MDS approach. It calculates "stress" - a function assessing the squared differences between ideal distances and actual distances. "Stress" is minimized using an iterative majorization function. This results in a plot where closely related samples are plotted near to one another.

## Beta Diversity Calculations and Plots using One Codex

At One Codex, we have implemented some of the popular beta diversity metrics and plotting techniques in our `onecodex`

Python library, making it easier to apply these techniques to your One Codex samples. The python library is available through GitHub, and accessible through the Notebooks service we have implemented on our platform.

### Calculating Beta Diversity

To calculate a beta diversity matrix for your samples, you just need to call the following.

`samples.beta_diversity(metric='braycurtis')`

The available metrics with this command include `braycurtis`

, `jaccard`

, and `cityblock`

. By default, these analyses are performed on the abundances of species in your samples. However you can choose to change the taxonomic rank by adding the below parameter.

`samples.beta_diversity(metric='braycurtis', rank='family')`

The available ranks include `kingdom`

, `phylum`

, `class`

, `order`

, `family`

,

`genus`

, and `species`

.

UniFrac distances can be calculated using a separate command, as it takes the taxonomic tree into account.

`samples.unifrac(weighted=True, rank='family')`

This allows you to choose weighted or unweighted (with `weighted=False`

), and as with the above `beta_diversity`

command, you have the option to choose a taxonomic rank, or default to species abundances.

### Plotting Beta Diversity

We've also made a number of functions available in our python library to allow you to visualize your beta diversity results.

** **

** Heat maps ** are easily generated using the following command:

`samples.plot_distance(metric='braycurtis')`

Here, you have the metric choices of ` braycurtis `

, ` cityblock `

(or the interchangeable ` manhatten `

), ` jaccard `

, ` unifrac `

(which represents

` weighted_unifrac `

), and ` unweighted_unifrac `

.

As with the above commands, you can also choose the taxonomic rank. An additional option allows you to choose the linkage used for the hierarchical clustering.

`samples.plot_distance(metric='braycurtis', linkage='average')`

The default linkage is ` average `

, however ` single `

, ` complete `

, ` weighted `

,

` centroid `

, and ` median `

are also available choices.

** PCoA plots ** (and smacof plots, with

` method='smacof' `

) are available through the ` plot_mds `

command. The same metric and rank options are available for this.

`samples.plot_mds(method='pcoa',metric='braycurtis')`

** **

** PCA plots ** have their own command, with additional options.

- You can choose to plot a specified number of the top-contributing eigenvectors using the
`org_vectors`

parameter. - You can scale those line lengths using
`org_vectors_scale`

. - You can choose to have metadata fields show up when you hover over a point, by providing the metadata field as a string (or a list of strings for multiple metadata fields) to
`tooltip`

. - You can also choose to
`color`

or`label`

your points by a specified metadata field.

You'll see an example in our Advanced Tutorial Notebook and below, where we colored the samples by geographical location, we adjusted the point sizes by abundance of Bacteroides, and included the abundance of Firmicutes in the hover metadata.

`samples.plot_pca(color='geo_loc_name', size='Bacteroides', tooltip='Firmicutes')`

## Next Steps

Explore using notebooks on our platform to generate the results and plots you're interested in. Here's a handy place to get started!

## References

[1] Bray JR, Curtis JT (1957) An ordination of the upland forest communities of southern Wisconsin. Ecol Monogr 27(4):325–349.

[2] Jaccard P (1900) Contribution au problème de l’immigration post-glaciaire de la flore alpine. Bulletin de la Société Vaudoise des Sciences Naturelles 36:87–130.

[3] Krause, EF. (1987) Taxicab Geometry. Dover. ISBN 978-0-486-25202-5.

[4] Lozupone, C., Hamady, M. & Knight, R. UniFrac – An online tool for comparing microbial community diversity in a phylogenetic context. *BMC Bioinformatics* (2006) **7, **371. https://doi.org/10.1186/1471-2105-7-371

[5] Lozupone, C., Hamady, M., Kelley, S. T., & Knight, R. Quantitative and Qualitative β Diversity Measures Lead to Different Insights into Factors That Structure Microbial Communities. *Appl Environ Microbiol. *(2007) **73(5)**: 1576-1585. https://dx.doi.org/10.1128/AEM.01996-06

[6] Chen, J., Bittinger, K., Charlson, E.S., Hoffman, C., Lewis, J., Wu, G.D., Collman, R.G., Bushman, F.D., & Li, H. Associating microbiome composition with environmental covariates using generalized UniFrac distances. *Bioinformatics* (2012) **28(16)**: 2106-2113. https://dx.doi.org/10.1093/bioinformatics/bts342

[7] Gloor, G.B., Macklaim, J.M., Pawlowsky-Glahn, V., Egozcue, J.J. Microbiome Datasets Are Compositional: And This Is Not Optional. *Front. Microbiol.* (2017) **8:** 2224. https://dx.doi.org/10.3389/fmicb.2017.02224