Unsupervised learning of microbial units
In this project I am interested in extracting organizing the entirety of the heterogeneity of responses across samples and clades along a phylogeny present in traditional 16S dataset. By responses we mean the difference in relative abundance between sister clades.
Without any metadata, namely any prior information on factors like temperature, salinity, or other variables along an environmental gradient, the generative model, called a path-limited nested hierarchical Dirichlet process (plnHDP), is able to reconstructed and organize samples and clades along a hierarchical representation of clusters and sub-clusters of responses. This reconstruction of the whole heterogeneity of responses is another way to say that we discover the structure of all confounding factors present in the dataset.
This analysis takes inspiration from the field of computational linguistics, namely corpus analysis and topic modeling.
In the figure above and using the TARA Ocean expedition dataset, we see how the model naturally recovers the known partition of microbial communities along ocean depth, and discovers a putative novel type of microbial community present in the Scotia see and on the coast of South Africa.
This analysis naturally leads to the enumeration of novel microbial units which can be thought as a coarse-grain alternative to OTUs or ESVs. Those units are any set of clades which spans the set of sub-clusters in the hierarchical tree. One such set of clades is shown below. Those alternative microbial units serve to simplify downstream correlation and ordination analysis in a way that goes beyond typical taxon-base analyses.