ABSTRACT
Studies investigating individual differences in reading ability often involve data sets containing a large number of collinear predictors and a small number of observations. In this article, we discuss the method of Random Forests and demonstrate its suitability for addressing the statistical concerns raised by such data sets. The method is contrasted with other methods of estimating relative variable importance, especially Dominance Analysis and Multimodel Inference. All methods were applied to a data set that gauged eye-movements during reading and offline comprehension in the context of multiple ability measures with high collinearity due to their shared verbal core. We demonstrate that the Random Forests method surpasses other methods in its ability to handle model overfitting and accounts for a comparable or larger amount of variance in reading measures relative to other methods.
Supplemental Material
Supplemental data for this article can be accessed at www.tandfonline.com/hssr.
Notes
1 The term “relative importance” is used here in line with the existing statistical literature, which refers to the term “importance” as a statistic associated with a variable rather than an interpretational value for theory building or policymaking.
2 In Matsuki, Kuperman, and Van Dyke (Citation2015) we use the Random Forests technique to establish relative variable importance in the joint pool of participant variables, which gauge individual differences and text variables at the level of word (length, frequency, contextual predictability), sentence (word position in a sentence, word’s syntactic role), and passage (text complexity).
3 Text complexity is not included as a predictor in these models because it requires 4 regression coefficients.