Gradient Tree Boosting for Hierarchical Data: Multivariate Behavioral Research: Vol 58 , No 5

Abstract

Gradient tree boosting is a powerful machine learning technique that has shown good performance in predicting a variety of outcomes. However, when applied to hierarchical (e.g., longitudinal or clustered) data, the predictive performance of gradient tree boosting may be harmed by ignoring the hierarchical structure, and may be improved by accounting for it. Tree-based methods such as regression trees and random forests have already been extended to hierarchical data settings by combining them with the linear mixed effects model (MEM). In the present article, we add to this literature by proposing two algorithms to estimate a combination of the MEM and gradient tree boosting. We report on two simulation studies that (i) investigate the predictive performance of the two MEM boosting algorithms and (ii) compare them to standard gradient tree boosting, standard random forest, and other existing methods for hierarchical data (MEM, MEM random forests, model-based boosting, Bayesian additive regression trees [BART]). We found substantial improvements in the predictive performance of our MEM boosting algorithms over standard boosting when the random effects were non-negligible. MEM boosting as well as BART showed a predictive performance similar to the correctly specified MEM (i.e., the benchmark model), and overall outperformed the model-based boosting and random forest approaches.

Keywords:

Article information

Conflict of interest disclosures: Each author signed a form for disclosure of potential conflicts of interest. No authors reported any financial or other conflicts of interest in relation to the work described.

Ethical principles: The authors affirm having followed professional ethical guidelines in preparing this work. These guidelines include obtaining informed consent from human participants, maintaining ethical treatment and respect for the rights of human or animal participants, and ensuring the privacy of participants and their data, such as ensuring that individual participants cannot be identified in reported results or from publicly available original or archival data.

Funding: This work was not supported.

Role of the funders/sponsors: None of the funders or sponsors of this research had any role in the design and conduct of the study; collection, management, analysis, and interpretation of data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication.

Acknowledgments: Ideas and opinions expressed herein are those of the authors alone, and endorsement by the authors’ institution is not intended and should not be inferred.

Open Scholarship

This article has earned the Center for Open Science badges for Open Data and Open Materials through Open Practices Disclosure. The data and materials are openly accessible at https://osf.io/kuzf6/. To obtain the author’s disclosure form, please contact the Editor.

Notes

1 For loss functions other than L₂, Equation 9 is difficult to minimize directly. Therefore, a gradient descent-based procedure is used instead. This procedure consists of two steps: First, the new tree is fit to the current negative gradient of the loss function ${\tilde{y}}_{j, m} = - {[\frac{\partial L (y_{j}, f (x_{j}))}{\partial f (x_{j})}]}_{f (x) = f_{m - 1} (x)}$ via least squares. Next, given the tree regions ${R_{g, m}}_{g = 1}^{G_{m}},$ the optimal constants per region are found by minimizing the original loss function, that is, by solving ${\hat{γ}}_{g, m} = arg \min_{γ_{g}} \sum_{x_{j} \in R_{g}} L (y_{j}, f_{m - 1} (x_{j}) + γ_{g, m}) .$ Hence, this approach “permits the replacement of the difficult function minimization problem [Equation 9] by least-squares function minimization, followed by only a single parameter optimization based on the original criterion” (Friedman, Citation2001, p. 1193). Note that for the Huber loss $L_{H, δ},$ the latter optimization problem does not have a closed form solution. Thus, the leaf predictions are found by using an approximation as suggested by Friedman (Citation2001, p. 1198).

2 Hajjem et al. (Citation2014) additionally investigated the setting of uncorrelated variables, but found no effects of setting ρ to either 0 or 0.4.

3 In another simulation, we compared the results of our self-implemented gradient tree boosting algorithm with the results of the package gbm and found very similar results.

4 In rpart, the maximal tree depth limits (but does not fix) the maximal number of terminal nodes. For a maximal tree depth of d, a maximum of $2^{d}$ terminal nodes can result.

5 For the standard random forest and MERF, the default settings for the main hyper-parameters were: number of trees $= 500,$ number of variables randomly sampled as candidates at each split $= p / 3,$ and minimum size of terminal nodes $= 5 .$ For the model-based boosting models, the main default values were: initial number of boosting iterations $= 100$ (as mentioned, the optimal number of iterations was then chosen using the cvrisk function), shrinkage $= 0.1,$ and maximum tree depth $= 1$ in case of the model using a tree base-learner.

6 We also examined the predictive performance of the REEMforest of Capitaine et al. (Citation2021). However, since the algorithm took over 3 hours per replication to converge in simulation conditions where the training data consisted of 250 clusters, even after we adapted the respective R function to use C++, we only investigated the simulation condition using 100 clusters with 30 observations. Furthermore, we restricted the maximum number of iterations to 3 instead of the default value of 100 (even then the average computation time of the REEMforest algorithm still was around 12 minutes in both simulations, which was more than 11 times higher than the computation time of MERF). The results showed that for both simulations, the predictive performance of REEMforest was very similar to the predictive performance of MERF (Simulation I: PMSE = 2.55, PMSE = 2.60, and PMSE = 2.62 in the known clusters case and PMSE = 2.64, PMSE = 2.88, and PMSE = 3.55 in the unknown clusters case when ICC = 0.05, ICC = 0.25, and ICC = 0.50, respectively; Simulation II: PMSE = 2.68 (5.24), PMSE = 2.97 (6.26), PMSE = 3.02 (6.59) in the known clusters case and PMSE = 2.98 (6.40), PMSE = 6.64 (20.15), PMSE = 15.43 (53.08) in the unknown clusters case when $α = 0.05 (0.20)$ and ICC = 0.05, ICC = 0.25, and ICC = 0.50, respectively).

Log in via your institution

Access through your institution

Log in to Taylor & Francis Online

Shibboleth

Log in to Taylor & Francis Online

Restore content access

Restore content access for purchases made as guest

Purchase options * Save for later

PDF download + Online access

48 hours access to article PDF & online version
Article PDF can be downloaded
Article PDF can be printed

USD 53.00 Add to cart

Issue Purchase

30 days online access to complete issue
Article PDFs can be downloaded
Article PDFs can be printed

USD 352.00 Add to cart

* Local tax will be added as applicable

Gradient Tree Boosting for Hierarchical Data

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Gradient Tree Boosting for Hierarchical Data

Abstract

Article information

Open Scholarship

Notes

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature