Abstract
The default variable-importance measure in random forests, Gini importance, has been shown to suffer from the bias of the underlying Gini-gain splitting criterion. While the alternative permutation importance is generally accepted as a reliable measure of variable importance, it is also computationally demanding and suffers from other shortcomings. We propose a simple solution to the misleading/untrustworthy Gini importance which can be viewed as an over-fitting problem: we compute the loss reduction on the out-of-bag instead of the in-bag training samples.
Acknowledgments
I gratefully acknowledge the Berlin School of Economics and Law for granting me a research sabbatical without which this work would have been difficult to complete.
Notes
1 short for
.
2 (0): (1):
(2):
(3):
.
3 It turns out that is equivalent to the measure defined in Zhou and Hooker (2019), while a slightly modified version of
leads to the same OOB based MDI score defined in Li et al. (Citation2019) as shown in the Appendix A2.
4 For easier notation we have (i) left the multiplier 2 and (ii) omitted an index for the class membership.