1,581
Views
2
CrossRef citations to date
0
Altmetric
Research Paper

Ordering taxa in image convolution networks improves microbiome-based machine learning accuracy

, , , &
Article: 2224474 | Received 15 Dec 2022, Accepted 08 Jun 2023, Published online: 21 Jun 2023

Figures & data

Table 1. Different approaches to the microbiome ML limitations discussed in the introduction.

Table 2. Table of datasets.

Figure 1. iMic’s and gMic’s architectures and AUCs.

(a) gMic+v architecture: We position all observed taxa in the leaves of the taxonomy tree (cladogram), and set their value to the preprocessed frequency to each leaf. Each internal node is the average of its direct descendants. These values are the input to a GCN layer with the adjacency matrix of the cladogram. The GCN layer is followed by two fully connected layers with binary output. (b) iMic’s architecture: The values in the cladogram are as in gMic+v. The cladogram is then used to populate a 2-dimensional matrix. Each row in the image represents a taxonomic level. The order in each row is based on a recursive hierarchical clustering of the sample values preserving the structure of the tree. The image is the input of a CNN followed by 2 fully connected layers with binary output. (c) Comparison between model performance: The average AUC is measured on the external test set on nine different phenotypes. Each subplot is a phenotype. The stars represent the significance of the p-value (after Benjamini Hochberg correction) on the external test set. If there were differences in the significance on the 10 CVs and the external test set, the different corrected p-value of the 10 CVs is reported in brackets, *-p0.05, **-p0.01, ***-p0.001. For the parallel results of 10 CVs see Supp. Mat. Fig. S2. The rightmost set of plots is the baseline. The green bars are the current best baseline. The light blue bar to the right is the best baseline obtained using the MIPMLP. The central pink bars are the iMic AUC using either a one or two-dimensional CNN. The leftmost bars are for gMic (either gMic or gMic+v). We also added the iMic results to allow for a comparison.
Figure 1. iMic’s and gMic’s architectures and AUCs.

Table 3. Sequential datasets details.

Table 4. 10 CVs mean performances with standard deviation on external test sets; the std is the std among CV folds.

Table 5. Features can be added to iMic’s learning. Average AUCs of iMic-CNN2 with and without non-microbial features as well as average results of naive models with non-microbial features. The results are the average AUCs on an external test with 10 CVs ± their standard deviations (stds).

Figure 2. iMic copes with the ML challenges above better than other methods.

(a) Average test AUC (over 10 CVs) as a function of the different sparsity levels, where the first point is the AUC of the original sparsity level (72%, “baseline”) on the Cirrhosis dataset. iMic has the highest AUCs for all simulated sparsity levels (purple line). The error bars represent the standard errors. (b) Average change in AUC (AUC – baseline AUC) as a function of the sparsity level on the Cirrhosis dataset. (c) Overall average in AUC change in all the other datasets apart from Cirrhosis. (d) Average AUC as a function of the number of samples in the training set (Cirrhosis dataset). The error bars represent the standard errors of each model over the 10 CVs. (e) Average change in AUC (AUC – baseline AUC) as a function of the percent of samples in the training set. (f) Overall average AUC change over all the algorithms that managed to learn (baseline AUC >0.55) as a function of the percent of samples in the training set. (g) Importance of ordering taxa. The x-axis represents the average AUC over 10 CVs and the y-axis represents the different datasets used. The deep purple bars represent the AUC on the images without taxa reordering, while the light purple bars represent the AUC on the images with the Dendrogram reordering with standard errors. All the differences between the AUCs are significant after Benjamini Hochberg correction (p-value <0.001). All the AUCs are calculated on an external test set for each CV. Quite similar results were obtained on the 10 CVs.
Figure 2. iMic copes with the ML challenges above better than other methods.

Figure 3. Interpretation of iMic’s results.

(a, b) Cladogram projections: To visualize the taxa contributing to each class, the healthy class (a) and the CD class (b), we projected the most significant microbes back on the cladogram. The purple points on the cladograms represent taxa that are in the top decile of the gradients. The taxa in bold are important taxa that are consistent with the literature. (c, d) Grad-Cam images: Each image represents the average contribution of each input value to the gradients of the neural network back-propagation, as computed by the Grad-Cam algorithm. We put the Grad-Cam after the first CNN layer. The results presented here are from the CD dataset. (c) represents the average gradients for the healthy subjects of the cohort and (d) represents the average gradients for the CD subjects. The color reflects the average values of the gradients, such that the blue colors represent low gradients, and the yellow colors represent the high gradients, using the ‘viridis’ colormap. The differences between the two heatmaps represent the contribution of different taxa to the prediction of different phenotypes. Note that the main contribution to the classification is at the genus and family level (rows 6 and 5). Similar results were obtained for the other datasets (Fig. S11 in Supp. Mat.) (e-h). Interpretation tests on the CD dataset (e), the IBD dataset (f), the Cirrhosis dataset (g), and the Ravel dataset (h). Average AUC values over 10 CVs on the external test set. The x-axis represents the fraction of removed columns. The dark bars represent the performance when all of the columns with Grad-Cams values lower than this fraction have been removed and the light bars represent the performance when the columns with scores above this fraction have been removed. The black line represents the average AUC over 10 CVs of the original model with all the input columns. Results from the other datasets were similar, see Supp. Mat Fig. S11. Removing the top scoring columns always reduced the performance. Removing the bottom scoring columns increases or does not change the AUC.
Figure 3. Interpretation of iMic’s results.

Figure 4. 3D learning.

(a) iMic 3D Architecture: The ASV frequencies of each snapshot are preprocessed and combined into images as in the static iMic. The images from the different time points are combined into a 3D image, which is the input of a 3-dimensional CNN followed by two fully connected layers that return the predicted phenotype. (b) Performance of 3D learning vs PhyLoSTM. The AUCs of the 3D-iMic are consistently higher than the AUCs of the phyLoSTM on all the tags and datasets we checked (n=5). The standard errors among the CVs are also shown. phyLoSTM is the current state-of-the-art for these datasets (two-sided T-test, p-value <0.0005). To visualize the three-dimensional gradients (as in ), we studied a CNN with a time window of 3 (i.e., 3 consecutive images combined using convolution). We projected the Grad-Cam images to the R, G, and B channels of an image. Each channel represents another time point where R = earliest, G = middle, and B = latest time point. (c,d) Images after Grad-Cam: Each pixel represents the value of the backpropagated gradients after the CNN layer. The 2-dimensional image is the combination of the three channels above. (i.e., the gradients of the first/second/third time step are in red/green/blue). The left image is for normal birth subjects in the DiGiulio dataset, and the right image is for pre-term birth subjects. (e,f) Grad-Cam projection. Projection of the above heatmaps on the cladogram as in . The taxa in bold are important taxa that are consistent with the literature.
Figure 4. 3D learning.

Table 6. Notations.

Supplemental material

Supplemental Material

Download MS Word (3.6 MB)

Data availability statement

All datasets are available at https://github.com/oshritshtossel/iMic/tree/master/Raw_data.