ABSTRACT
Recent advances in DNA sequencing technology have enabled rapid advances in our understanding of the contribution of the human microbiome to many aspects of normal human physiology and disease. A major goal of human microbiome studies is the identification of important groups of microbes that are predictive of host phenotypes. However, the large number of bacterial taxa and the compositional nature of the data make this goal difficult to achieve using traditional approaches. Furthermore, the microbiome data are structured in the sense that bacterial taxa are not independent of one another and are related evolutionarily by a phylogenetic tree. To deal with these challenges, we introduce the concept of variable fusion for high-dimensional compositional data and propose a novel tree-guided variable fusion method. Our method is based on the linear regression model with tree-guided penalty functions. It incorporates the tree information node-by-node and is capable of building predictive models comprised of bacterial taxa at different taxonomic levels. A gut microbiome data analysis and simulations are presented to illustrate the good performance of the proposed method. Supplementary materials for this article are available online.
Supplementary Materials
The online supplementary materials include the following:
Section A: Additional tree information
Section B: Additional simulation results
Section C: Additional simulations
Section D: Impact of the tree structure
Section E: Issue of tree rooting
Acknowledgments
The authors gratefully acknowledge Hongzhe Li and Jun Chen for providing the data.
Funding
This work was supported by Natural Science Foundation of China (11601326), and NIH grant R01 GM59507.