ABSTRACT
Cross-cohort validation is essential for gut-microbiome-based disease stratification but was only performed for limited diseases. Here, we systematically evaluated the cross-cohort performance of gut microbiome-based machine-learning classifiers for 20 diseases. Using single-cohort classifiers, we obtained high predictive accuracies in intra-cohort validation (~0.77 AUC), but low accuracies in cross-cohort validation, except the intestinal diseases (~0.73 AUC). We then built combined-cohort classifiers trained on samples combined from multiple cohorts to improve the validation of non-intestinal diseases, and estimated the required sample size to achieve validation accuracies of >0.7. In addition, we observed higher validation performance for classifiers using metagenomic data than 16S amplicon data in intestinal diseases. We further quantified the cross-cohort marker consistency using a Marker Similarity Index and observed similar trends. Together, our results supported the gut microbiome as an independent diagnostic tool for intestinal diseases and revealed strategies to improve cross-cohort performance based on identified determinants of consistent cross-cohort gut microbiome alterations.
Disclosure statement
No potential conflict of interest was reported by the authors.
Author contributions
W.H.C and X.M.Z designed and directed the research. J.Z, H.W., C.S. and N.L.G. helped with the sample collection. M.L and J.L analyzed the data, performed modeling and wrote the paper with results from all authors. W.H.C and X.M.Z. polished the manuscript through multiple iterations of discussions with all authors. All authors read and approved the final manuscript.
Data availability statement
The processed data and codes that support the findings of this study are available in GitHub repository at https://github.com/whchenlab/GMModels. These data were derived from the following resources available in the public domain: NCBI (https://www.ncbi.nlm.nih.gov/sra), ENA (https://www.ebi.ac.uk/ena/browser/), MGnify (https://www.ebi.ac.uk/metagenomics/), GMrepo v2 (https://gmrepo.humangut.info), and the accession codes were in TableS1.
Ethics approval
This study did not receive nor require ethics approval, as it reused the publicly available data.
Supplementary material
Supplemental data for this article can be accessed online at https://doi.org/10.1080/19490976.2023.2205386.