467
Views
3
CrossRef citations to date
0
Altmetric
Articles

Handling high-dimensional data with missing values by modern machine learning techniques

ORCID Icon & ORCID Icon
Pages 786-804 | Received 28 Sep 2020, Accepted 16 Apr 2022, Published online: 01 May 2022

References

  • R.R. Andridge and R.J. Little, A review of hot deck imputation for survey non-response, Int. Stat. Rev. 78 (2010), pp. 40–64.
  • H. Bang and J.M. Robins, Doubly robust estimation in missing data and causal inference models, Biometrics 61 (2005), pp. 962–973.
  • B.K. Beaulieu-Jones and J.H. Moore, Missing data imputation in the electronic health record using deeply learned autoencoders, Pacific Symposium on Biocomputing 2017, Big Island, Hawaii, World Scientific, 2017, pp. 207–218.
  • H. Boistard, G. Chauvet, and D. Haziza, Doubly robust inference for the distribution function in the presence of missing survey data, Scand. J. Stat. 43 (2016), pp. 683–699.
  • L. Bottou, Large-scale machine learning with stochastic gradient descent, Proceedings of COMPSTAT'2010, Paris, France, Springer, 2010, pp. 177–186.
  • L. Breiman, Random forests. Mach. Learn. 45 (2001), pp. 5–32.
  • L.F. Burgette and J.P. Reiter, Multiple imputation for missing data via sequential regression trees, Am. J. Epidemiol. 172 (2010), pp. 1070–1076.
  • J. Chen and J. Shao, Nearest neighbor imputation for survey data, J. Off. Stat. 16 (2000), pp. 113–131.
  • J. Chen and J. Shao, Jackknife variance estimation for nearest-neighbor imputation, J. Am. Stat. Assoc. 96 (2001), pp. 260–269.
  • S. Chen and D. Haziza, Multiply robust imputation procedures for the treatment of item nonresponse in surveys, Biometrika 104 (2017), pp. 439–453.
  • S. Chen and D. Haziza, Multiply robust nonparametric multiple imputation for the treatment of missing data, Stat. Sin. 29 (2019), pp. 2035–2053.
  • S. Chen and D. Haziza, Recent developments in dealing with item non-response in surveys: a critical review, Int. Stat. Rev. 87 (2019), pp. S192–S218.
  • S. Chen, D. Haziza, C. Léger, and Z. Mashreghi, Pseudo-population bootstrap methods for imputed survey data, Biometrika 106 (2019), pp. 369–384.
  • S. Chen and J.K. Kim, Semiparametric fractional imputation using empirical likelihood in survey sampling, Stat. Theor. Relat. Fields 1 (2017), pp. 69–81.
  • T. Chen and C. Guestrin, Xgboost: a scalable tree boosting system, Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, USA, 2016, pp. 785–794.
  • T. Chen, T. He, M. Benesty, V. Khotilovich, and Y. Tang, Xgboost: Extreme Gradient Boosting, R Package Version 0.4-2, 2015, pp. 1–4.
  • C.-Y. Cheng, W.-L. Tseng, C.-F. Chang, C.-H. Chang, and S.S.-F. Gau, A deep learning approach for missing data imputation of rating scales assessing attention-deficit hyperactivity disorder, Front. Psychiatry. 11 (2020), p. 199. 10.3389/fpsyt.2020.00673
  • N.C. Chung and J.D. Storey, Statistical significance of variables driving systematic variation in high-dimensional data, Bioinformatics 31 (2015), pp. 545–554.
  • M. Falk, A simple approach to the generation of uniformly distributed random variables with prescribed correlations, Commun. Stat. Simul. Comput. 28 (1999), pp. 785–791.
  • J. Fan and R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360.
  • X.Z. Fern and C.E. Brodley, Random projection for high dimensional data clustering: a cluster ensemble approach, Proceedings of the 20th international conference on machine learning (ICML-03), Washington DC, 2003, pp. 186–193.
  • J.H. Friedman, Greedy function approximation: a gradient boosting machine, Ann. Stat. 29 (2001), pp. 1189–1232.
  • A. Gandomi and M. Haider, Beyond the hype: big data concepts, methods, and analytics, Int. J. Inf. Manage. 35 (2015), pp. 137–144.
  • I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning, MIT Press, Cambridge, 2016.
  • P. Han and L. Wang, Estimation with missing data: beyond double robustness, Biometrika 100 (2013), pp. 417–430.
  • T. Hastie, R. Tibshirani, J.H. Friedman, and J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York, 2009.
  • D. Haziza and É. Lesage, A discussion of weighting procedures for unit nonresponse, J. Off. Stat. 32 (2016), pp. 129–145.
  • D.F. Heitjan and R.J. Little, Multiple imputation for the fatal accident reporting system, Appl. Stat. 40 (1991), pp. 13–29.
  • A.M. Jones, X. Koolman, and N. Rice, Health-related non-response in the British household panel survey and European community household panel: using inverse-probability-weighted estimators in non-linear models, J. R. Stat. Soc.: Ser. A (Stat. Soc.) 169 (2006), pp. 543–569.
  • J.D. Kang and J.L. Schafer, Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data, Stat. Sci. 22 (2007), pp. 523–539.
  • J.K. Kim, Parametric fractional imputation for missing data analysis, Biometrika 98 (2011), pp. 119–132.
  • J.K. Kim and W. Fuller, Fractional hot deck imputation, Biometrika 91 (2004), pp. 559–578.
  • J.K. Kim and M.K. Riddles, Some theory for propensity-score-adjustment estimators in survey sampling, Surv. Methodol. 38 (2012), p. 157.
  • J.K. Kim and J. Shao, Statistical Methods for Handling Incomplete Data, CRC Press, Chapman and Hall, 2013.
  • H.-P. Kriegel, P. Kröger, and A. Zimek, Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering, ACM Trans. Knowl. Discov. Data 3 (2009), pp. 1–58.
  • Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature 521 (2015), pp. 436–444.
  • X. Li and R. Xu, High-Dimensional Data Analysis in Cancer Research, Springer Science & Business Media, 2008.
  • G. Lin, C. Shen, Q. Shi, A. Van den Hengel, and D. Suter, Fast supervised hashing with decision trees for high-dimensional data, Proceedings of the IEEE conference on computer vision and pattern recognition, Columbus, OH, 2014, pp. 1963–1970.
  • A.R. Linero, Bayesian regression trees for high-dimensional prediction and variable selection, J. Am. Stat. Assoc. 113 (2018), pp. 626–636.
  • R.J. Little, Missing-data adjustments in large surveys, J. Bus. Econ. Stat. 6 (1988), pp. 287–296.
  • R.J. Little and D.B. Rubin, Statistical Analysis with Missing Data, Wiley, New York, 1987.
  • X. Ma, J. Sha, D. Wang, Y. Yu, Q. Yang, and X. Niu, Study on a prediction of p2p network loan default based on the machine learning lightgbm and xgboost algorithms according to different high dimensional data cleaning, Electron. Commer. Res. Appl. 31 (2018), pp. 24–39.
  • Y. Mishina, R. Murata, Y. Yamauchi, T. Yamashita, and H. Fujiyoshi, Boosted random forest, IEICE Trans. Inf. Syst. 98 (2015), pp. 1630–1636.
  • H. Moon, H. Ahn, R.L. Kodell, S. Baek, C.-J. Lin, and J.J. Chen, Ensemble methods for classification of patients for personalized medicine with high-dimensional data, Artif. Intell. Med. 41 (2007), pp. 197–207.
  • J.Z. Musoro, A.H. Zwinderman, M.A. Puhan, and R.B. Geskus, Validation of prediction models based on LASSO regression with multiply imputed data, BMC Med. Res. Methodol. 14 (2014), p. 116.
  • Y. Qi, Random Forest for Bioinformatics, Ensemble Machine Learning, Springer, New York, 2012, pp. 307–323.
  • Y.L. Qiu, H. Zheng, and O. Gevaert, A deep learning framework for imputing missing values in genomic data, bioRxiv, 2018. Available at https://www.biorxiv.org/content/early/2018/09/03/406066
  • J.N.K. Rao and J. Shao, Jackknife variance estimation with survey data under hot deck imputation, Biometrika 79 (1992), pp. 811–822.
  • M.H.D.M. Ribeiro, Ensemble approach based on bagging, boosting and stacking for short-term prediction in agribusiness time series, Appl. Soft Comput. 86 (2020), p. 105837.
  • J.M. Robins, A. Rotnitzky, and L.P. Zhao, Estimation of regression coefficients when some regressors are not always observed, J. Am. Stat. Assoc. 89 (1994), pp. 846–866.
  • D.B. Rubin, Statistical matching using file concatenation with adjusted weights and multiple imputations, J. Bus. Econ. Stat. 4 (1986), pp. 87–94.
  • D.B. Rubin, Multiple imputation after 18+ years, J. Am. Stat. Assoc. 91 (1996), pp. 473–489.
  • D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, Vol. 81, John Wiley & Sons, 2004.
  • D.B. Rubin and N. Schenker, Multiple imputation for interval estimation from simple random samples with ignorable nonresponse, J. Am. Stat. Assoc. 81 (1986), pp. 366–374.
  • C. Saunders, A. Gammerman, and V. Vovk, Ridge regression learning algorithm in dual variables, Proceedings of the 15th International Conference on Machine Learning, New York, ICML, 1998.
  • D.F. Schwarz, I.R. König, and A. Ziegler, On safari to random jungle: a fast implementation of random forests for high-dimensional data, Bioinformatics 26 (2010), pp. 1752–1758.
  • A.A. Shabalin, V.J. Weigman, C.M. Perou, and A.B. Nobel, Finding large average submatrices in high dimensional data, Ann. Appl. Stat. 3 (2009), pp. 985–1012.
  • A.D. Shah, J.W. Bartlett, J. Carpenter, O. Nicholas, and H. Hemingway, Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study, Am. J. Epidemiol. 179 (2014), pp. 764–774.
  • R. Tibshirani, Regression shrinkage and selection via the LASSO, J. R. Stat. Soc.: Ser. B (Methodol.) 58 (1996), pp. 267–288.
  • A.W. Van der Vaart, Asymptotic Statistics, Vol. 3, Cambridge University Press, 2000.
  • M.N. Wright and A. Ziegler, ranger: a fast implementation of random forests for high dimensional data in c++ and r, preprint, 2015. Available at arXiv:1508.04409
  • S. Yang and J.K. Kim, Fractional imputation in survey sampling: a comparative review, Stat. Sci. 31 (2016), pp. 415–432.
  • S. Yang and J.K. Kim, Nearest Neighbor Imputation for General Parameter Estimation in Survey Sampling, The Econometrics of Complex Survey Data, Emerald Publishing Limited, 2019.
  • S. Yang and J.K. Kim, Asymptotic theory and inference of predictive mean matching imputation using a superpopulation model framework, Scand. J. Stat. 47 (2020), pp. 839–861.
  • F.M. Zahid and C. Heumann, Multiple imputation with sequential penalized regression, Stat. Methods Med. Res. 28 (2019), pp. 1311–1327.
  • Y. Zhao and Q. Long, Multiple imputation in the presence of high-dimensional data, Stat. Methods Med. Res. 25 (2016), pp. 2021–2035.
  • R. Zhu and M.R. Kosorok, Recursively imputed survival trees, J. Am. Stat. Assoc. 107 (2012), pp. 331–340.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.