96
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Optimizing Imputation for Educational Data: Exploring Training Partition and Missing Data Ratios

ORCID Icon, &
Published online: 01 Jan 2024
 

Abstract

The integration of machine learning in educational data analysis presents challenges regarding the availability of sufficient training data, especially in the context of high missing data ratios. These challenges arise from data partitioning practices, resulting in smaller datasets and less precise models. Behavioral scientists have increasingly incorporated machine learning into propensity score estimation, necessitating investigations into the most effective training and testing partitioning methods for machine learning-based imputation. To address this gap in the literature, our Monte Carlo experiment examines the impact of partitioning methods and missing data ratios. Simulated datasets, featuring missing ratios of 10%, 30%, 50%, and 70%, are divided into training and testing sets, ranging from 80–20 to 20–80. Results indicate that each imputation method delivers highly accurate average treatment effects. However, in the context of maintaining covariate balance across diverse conditions, complex ensemble methods outperform artificial neural networks. A real-data comparison (Study II) further underscores that the adoption of sophisticated machine learning techniques significantly enhances covariate balance. This research contributes valuable insights into the development of machine learning-based imputation methods, with a specific focus on scenarios characterized by high missing data ratios, in educational data analysis.

Disclosure statement

No potential conflict of interest was reported by the author(s)

Notes

1 A covariate is a confounder when it relates to Z and Y. All covariates in this article are confounders. For simplicity, we use the term” covariates” in the remainder of the article.

2 Different strategies have been suggested. The MIps technique, for instance, aggregates the PSs from all of the imputed datasets. A third multiple imputation strategy combines the PS parameters rather than the PSs itself, and a missingness pattern approach employs a distinct PS model for each pattern of missingness. For more information on other approaches see Leite et al. (Citation2021).

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 169.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.