Abstract
The integration of machine learning in educational data analysis presents challenges regarding the availability of sufficient training data, especially in the context of high missing data ratios. These challenges arise from data partitioning practices, resulting in smaller datasets and less precise models. Behavioral scientists have increasingly incorporated machine learning into propensity score estimation, necessitating investigations into the most effective training and testing partitioning methods for machine learning-based imputation. To address this gap in the literature, our Monte Carlo experiment examines the impact of partitioning methods and missing data ratios. Simulated datasets, featuring missing ratios of 10%, 30%, 50%, and 70%, are divided into training and testing sets, ranging from 80–20 to 20–80. Results indicate that each imputation method delivers highly accurate average treatment effects. However, in the context of maintaining covariate balance across diverse conditions, complex ensemble methods outperform artificial neural networks. A real-data comparison (Study II) further underscores that the adoption of sophisticated machine learning techniques significantly enhances covariate balance. This research contributes valuable insights into the development of machine learning-based imputation methods, with a specific focus on scenarios characterized by high missing data ratios, in educational data analysis.
Disclosure statement
No potential conflict of interest was reported by the author(s)
Notes
1 A covariate is a confounder when it relates to Z and Y. All covariates in this article are confounders. For simplicity, we use the term” covariates” in the remainder of the article.
2 Different strategies have been suggested. The MIps technique, for instance, aggregates the PSs from all of the imputed datasets. A third multiple imputation strategy combines the PS parameters rather than the PSs itself, and a missingness pattern approach employs a distinct PS model for each pattern of missingness. For more information on other approaches see Leite et al. (Citation2021).