Missing Data Imputation with High-Dimensional Data: The American Statistician: Vol 78, No 2

1,249

Views

CrossRef citations to date

Altmetric

Abstract

Imputation of missing data in high-dimensional datasets with more variables P than samples N, $P ≫ N$ , is hampered by the data dimensionality. For multivariate imputation, the covariance matrix is ill conditioned and cannot be properly estimated. For fully conditional imputation, the regression models for imputation cannot include all the variables. Thus, the high dimension requires special imputation approaches. In this article, we provide an overview and realistic comparisons of imputation approaches for high-dimensional data when applied to a linear mixed modeling (LMM) framework. We examine approaches from three different classes using simulation studies: multiple imputation with penalized regression, multiple imputation with recursive partitioning and predictive mean matching; and multiple imputation with Principal Component Analysis (PCA). We illustrate the methods on a real case study where a multivariate outcome (i.e., an extracted set of correlated biomarkers from human urine samples) was collected and monitored over time and we discuss the proposed methods with more standard imputation techniques that could be applied by ignoring either the multivariate or the longitudinal dimension. Our simulations demonstrate the superiority of the recursive partitioning and predictive mean matching algorithm over the other methods in terms of bias, mean squared error and coverage of the LMM parameter estimates when compared to those obtained from a data analysis without missingness, although it comes at the expense of high computational costs. It is worthwhile reconsidering much faster methodologies like the one relying on PCA.

Keywords:

Supplementary Materials

In the supplement, further details for the MI algorithms of Section 3 as well as codes for their operationalization in the R software are provided. The simulation parameter settings and codes are also given. Finally, additional plots for the simulation and case study analysis are shown.

Acknowledgments

The authors would like to thank the two reviewers, the editor and associate editors for their insightful comments and suggestions for the improvement of the paper.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Data availability statement

Data for the case study is available upon request to Edwin R. van den Heuvel ([email protected]).

Missing Data Imputation with High-Dimensional Data

Information for

Open access

Opportunities

Help and information

Missing Data Imputation with High-Dimensional Data

Abstract

Supplementary Materials

Acknowledgments

Disclosure Statement

Data availability statement

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature