Unsupervised learning of microbial units

Hierarchical decomposition of TARA Ocean dataset using the plnHDP. Shown in green the SRF samples, in blue the DCM samples, in orange the MES samples, and in brown the MIX samples. There are two other types of samples, namely those shown in pink, which represents MES samples at the marine oxygen minimum zone, and in purple a DCM sample at the marine oxygen minimum zone. Six clusters are shows, three of them named after their main depth constituents (SRF, DCM, MES), and three of them after their location, CH for the coast of Chile, SO for the South Ocean, and AR/SWM for the Arabian sea and two stations far off the South-West coast of Mexico.
Hierarchical decomposition of TARA Ocean dataset using the plnHDP. Shown in green the SRF samples, in blue the DCM samples, in orange the MES samples, and in brown the MIX samples. There are two other types of samples, namely those shown in pink, which represents MES samples at the marine oxygen minimum zone, and in purple a DCM sample at the marine oxygen minimum zone. Six clusters are shows, three of them named after their main depth constituents (SRF, DCM, MES), and three of them after their location, CH for the coast of Chile, SO for the South Ocean, and AR/SWM for the Arabian sea and two stations far off the South-West coast of Mexico.

In this project I am interested in extracting organizing the entirety of the heterogeneity of responses across samples and clades along a phylogeny present in traditional 16S dataset. By responses we mean the difference in relative abundance between sister clades.

Without any metadata, namely any prior information on factors like temperature, salinity, or other variables along an environmental gradient, the generative model, called a path-limited nested hierarchical Dirichlet process (plnHDP), is able to reconstructed and organize samples and clades along a hierarchical representation of clusters and sub-clusters of responses. This reconstruction of the whole heterogeneity of responses is another way to say that we discover the structure of all confounding factors present in the dataset.

This analysis takes inspiration from the field of computational linguistics, namely corpus analysis and topic modeling.

In the figure above and using the TARA Ocean expedition dataset, we see how the model naturally recovers the known partition of microbial communities along ocean depth, and discovers a putative novel type of microbial community present in the Scotia see and on the coast of South Africa.

This analysis naturally leads to the enumeration of novel microbial units which can be thought as a coarse-grain alternative to OTUs or ESVs. Those units are any set of clades which spans the set of sub-clusters in the hierarchical tree. One such set of clades is shown below. Those alternative microbial units serve to simplify downstream correlation and ordination analysis in a way that goes beyond typical taxon-base analyses.

Hierarchical decomposition of a Zebra fish dataset of gut microbiome under different zinc diets
Hierarchical decomposition of a Zebra fish dataset of gut microbiome under different zinc diets
Putative microbial units in a dataset of Zebrafish guts under different zinc diets
Putative microbial units in a dataset of Zebrafish guts under different zinc diets
Alice Doucet Beaupré
Alice Doucet Beaupré
Postdoctoral Fellow

Related