Search in:

Advanced search

Journal of Applied Statistics Volume 50, 2023 - Issue 3: Statistical Approaches for Big Data and Machine Learning

Submit an article Journal homepage

467

Views

CrossRef citations to date

Altmetric

Articles

Handling high-dimensional data with missing values by modern machine learning techniques

Sixia ChenDepartment of Biostatistics and Epidemiology, The University of Oklahoma Health Sciences Center, Oklahoma City, OK, USACorrespondence[email protected]

https://orcid.org/0000-0001-7502-2587 View further author information

Chao XuDepartment of Biostatistics and Epidemiology, The University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA

https://orcid.org/0000-0002-3821-6187 View further author information

Pages 786-804 | Received 28 Sep 2020, Accepted 16 Apr 2022, Published online: 01 May 2022

Cite this article
https://doi.org/10.1080/02664763.2022.2068514
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions

References

R.R. Andridge and R.J. Little, A review of hot deck imputation for survey non-response, Int. Stat. Rev. 78 (2010), pp. 40–64.
PubMed Web of Science ®Google Scholar
H. Bang and J.M. Robins, Doubly robust estimation in missing data and causal inference models, Biometrics 61 (2005), pp. 962–973.
PubMed Web of Science ®Google Scholar
B.K. Beaulieu-Jones and J.H. Moore, Missing data imputation in the electronic health record using deeply learned autoencoders, Pacific Symposium on Biocomputing 2017, Big Island, Hawaii, World Scientific, 2017, pp. 207–218.
Google Scholar
H. Boistard, G. Chauvet, and D. Haziza, Doubly robust inference for the distribution function in the presence of missing survey data, Scand. J. Stat. 43 (2016), pp. 683–699.
Web of Science ®Google Scholar
L. Bottou, Large-scale machine learning with stochastic gradient descent, Proceedings of COMPSTAT'2010, Paris, France, Springer, 2010, pp. 177–186.
Google Scholar
L. Breiman, Random forests. Mach. Learn. 45 (2001), pp. 5–32.
Web of Science ®Google Scholar
L.F. Burgette and J.P. Reiter, Multiple imputation for missing data via sequential regression trees, Am. J. Epidemiol. 172 (2010), pp. 1070–1076.
PubMed Web of Science ®Google Scholar
J. Chen and J. Shao, Nearest neighbor imputation for survey data, J. Off. Stat. 16 (2000), pp. 113–131.
Google Scholar
J. Chen and J. Shao, Jackknife variance estimation for nearest-neighbor imputation, J. Am. Stat. Assoc. 96 (2001), pp. 260–269.
Web of Science ®Google Scholar
S. Chen and D. Haziza, Multiply robust imputation procedures for the treatment of item nonresponse in surveys, Biometrika 104 (2017), pp. 439–453.
Web of Science ®Google Scholar
S. Chen and D. Haziza, Multiply robust nonparametric multiple imputation for the treatment of missing data, Stat. Sin. 29 (2019), pp. 2035–2053.
Web of Science ®Google Scholar
S. Chen and D. Haziza, Recent developments in dealing with item non-response in surveys: a critical review, Int. Stat. Rev. 87 (2019), pp. S192–S218.
Web of Science ®Google Scholar
S. Chen, D. Haziza, C. Léger, and Z. Mashreghi, Pseudo-population bootstrap methods for imputed survey data, Biometrika 106 (2019), pp. 369–384.
PubMed Web of Science ®Google Scholar
S. Chen and J.K. Kim, Semiparametric fractional imputation using empirical likelihood in survey sampling, Stat. Theor. Relat. Fields 1 (2017), pp. 69–81.
PubMedGoogle Scholar
T. Chen and C. Guestrin, Xgboost: a scalable tree boosting system, Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, USA, 2016, pp. 785–794.
Google Scholar
T. Chen, T. He, M. Benesty, V. Khotilovich, and Y. Tang, Xgboost: Extreme Gradient Boosting, R Package Version 0.4-2, 2015, pp. 1–4.
Google Scholar
C.-Y. Cheng, W.-L. Tseng, C.-F. Chang, C.-H. Chang, and S.S.-F. Gau, A deep learning approach for missing data imputation of rating scales assessing attention-deficit hyperactivity disorder, Front. Psychiatry. 11 (2020), p. 199. 10.3389/fpsyt.2020.00673
PubMed Web of Science ®Google Scholar
N.C. Chung and J.D. Storey, Statistical significance of variables driving systematic variation in high-dimensional data, Bioinformatics 31 (2015), pp. 545–554.
PubMed Web of Science ®Google Scholar
M. Falk, A simple approach to the generation of uniformly distributed random variables with prescribed correlations, Commun. Stat. Simul. Comput. 28 (1999), pp. 785–791.
Web of Science ®Google Scholar
J. Fan and R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360.
Web of Science ®Google Scholar
X.Z. Fern and C.E. Brodley, Random projection for high dimensional data clustering: a cluster ensemble approach, Proceedings of the 20th international conference on machine learning (ICML-03), Washington DC, 2003, pp. 186–193.
Google Scholar
J.H. Friedman, Greedy function approximation: a gradient boosting machine, Ann. Stat. 29 (2001), pp. 1189–1232.
Web of Science ®Google Scholar
A. Gandomi and M. Haider, Beyond the hype: big data concepts, methods, and analytics, Int. J. Inf. Manage. 35 (2015), pp. 137–144.
Web of Science ®Google Scholar
I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning, MIT Press, Cambridge, 2016.
Google Scholar
P. Han and L. Wang, Estimation with missing data: beyond double robustness, Biometrika 100 (2013), pp. 417–430.
Web of Science ®Google Scholar
T. Hastie, R. Tibshirani, J.H. Friedman, and J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York, 2009.
Google Scholar
D. Haziza and É. Lesage, A discussion of weighting procedures for unit nonresponse, J. Off. Stat. 32 (2016), pp. 129–145.
Web of Science ®Google Scholar
D.F. Heitjan and R.J. Little, Multiple imputation for the fatal accident reporting system, Appl. Stat. 40 (1991), pp. 13–29.
Web of Science ®Google Scholar
A.M. Jones, X. Koolman, and N. Rice, Health-related non-response in the British household panel survey and European community household panel: using inverse-probability-weighted estimators in non-linear models, J. R. Stat. Soc.: Ser. A (Stat. Soc.) 169 (2006), pp. 543–569.
Web of Science ®Google Scholar
J.D. Kang and J.L. Schafer, Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data, Stat. Sci. 22 (2007), pp. 523–539.
Web of Science ®Google Scholar
J.K. Kim, Parametric fractional imputation for missing data analysis, Biometrika 98 (2011), pp. 119–132.
Web of Science ®Google Scholar
J.K. Kim and W. Fuller, Fractional hot deck imputation, Biometrika 91 (2004), pp. 559–578.
Web of Science ®Google Scholar
J.K. Kim and M.K. Riddles, Some theory for propensity-score-adjustment estimators in survey sampling, Surv. Methodol. 38 (2012), p. 157.
Google Scholar
J.K. Kim and J. Shao, Statistical Methods for Handling Incomplete Data, CRC Press, Chapman and Hall, 2013.
Google Scholar
H.-P. Kriegel, P. Kröger, and A. Zimek, Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering, ACM Trans. Knowl. Discov. Data 3 (2009), pp. 1–58.
Web of Science ®Google Scholar
Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature 521 (2015), pp. 436–444.
PubMed Web of Science ®Google Scholar
X. Li and R. Xu, High-Dimensional Data Analysis in Cancer Research, Springer Science & Business Media, 2008.
Google Scholar
G. Lin, C. Shen, Q. Shi, A. Van den Hengel, and D. Suter, Fast supervised hashing with decision trees for high-dimensional data, Proceedings of the IEEE conference on computer vision and pattern recognition, Columbus, OH, 2014, pp. 1963–1970.
Google Scholar
A.R. Linero, Bayesian regression trees for high-dimensional prediction and variable selection, J. Am. Stat. Assoc. 113 (2018), pp. 626–636.
Web of Science ®Google Scholar
R.J. Little, Missing-data adjustments in large surveys, J. Bus. Econ. Stat. 6 (1988), pp. 287–296.
Web of Science ®Google Scholar
R.J. Little and D.B. Rubin, Statistical Analysis with Missing Data, Wiley, New York, 1987.
Google Scholar
X. Ma, J. Sha, D. Wang, Y. Yu, Q. Yang, and X. Niu, Study on a prediction of p2p network loan default based on the machine learning lightgbm and xgboost algorithms according to different high dimensional data cleaning, Electron. Commer. Res. Appl. 31 (2018), pp. 24–39.
Web of Science ®Google Scholar
Y. Mishina, R. Murata, Y. Yamauchi, T. Yamashita, and H. Fujiyoshi, Boosted random forest, IEICE Trans. Inf. Syst. 98 (2015), pp. 1630–1636.
Web of Science ®Google Scholar
H. Moon, H. Ahn, R.L. Kodell, S. Baek, C.-J. Lin, and J.J. Chen, Ensemble methods for classification of patients for personalized medicine with high-dimensional data, Artif. Intell. Med. 41 (2007), pp. 197–207.
PubMed Web of Science ®Google Scholar
J.Z. Musoro, A.H. Zwinderman, M.A. Puhan, and R.B. Geskus, Validation of prediction models based on LASSO regression with multiply imputed data, BMC Med. Res. Methodol. 14 (2014), p. 116.
PubMed Web of Science ®Google Scholar
Y. Qi, Random Forest for Bioinformatics, Ensemble Machine Learning, Springer, New York, 2012, pp. 307–323.
Google Scholar
Y.L. Qiu, H. Zheng, and O. Gevaert, A deep learning framework for imputing missing values in genomic data, bioRxiv, 2018. Available at https://www.biorxiv.org/content/early/2018/09/03/406066
Google Scholar
J.N.K. Rao and J. Shao, Jackknife variance estimation with survey data under hot deck imputation, Biometrika 79 (1992), pp. 811–822.
Web of Science ®Google Scholar
M.H.D.M. Ribeiro, Ensemble approach based on bagging, boosting and stacking for short-term prediction in agribusiness time series, Appl. Soft Comput. 86 (2020), p. 105837.
Web of Science ®Google Scholar
J.M. Robins, A. Rotnitzky, and L.P. Zhao, Estimation of regression coefficients when some regressors are not always observed, J. Am. Stat. Assoc. 89 (1994), pp. 846–866.
Web of Science ®Google Scholar
D.B. Rubin, Statistical matching using file concatenation with adjusted weights and multiple imputations, J. Bus. Econ. Stat. 4 (1986), pp. 87–94.
Web of Science ®Google Scholar
D.B. Rubin, Multiple imputation after 18+ years, J. Am. Stat. Assoc. 91 (1996), pp. 473–489.
Web of Science ®Google Scholar
D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, Vol. 81, John Wiley & Sons, 2004.
Google Scholar
D.B. Rubin and N. Schenker, Multiple imputation for interval estimation from simple random samples with ignorable nonresponse, J. Am. Stat. Assoc. 81 (1986), pp. 366–374.
Web of Science ®Google Scholar
C. Saunders, A. Gammerman, and V. Vovk, Ridge regression learning algorithm in dual variables, Proceedings of the 15th International Conference on Machine Learning, New York, ICML, 1998.
Google Scholar
D.F. Schwarz, I.R. König, and A. Ziegler, On safari to random jungle: a fast implementation of random forests for high-dimensional data, Bioinformatics 26 (2010), pp. 1752–1758.
PubMed Web of Science ®Google Scholar
A.A. Shabalin, V.J. Weigman, C.M. Perou, and A.B. Nobel, Finding large average submatrices in high dimensional data, Ann. Appl. Stat. 3 (2009), pp. 985–1012.
Web of Science ®Google Scholar
A.D. Shah, J.W. Bartlett, J. Carpenter, O. Nicholas, and H. Hemingway, Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study, Am. J. Epidemiol. 179 (2014), pp. 764–774.
PubMed Web of Science ®Google Scholar
R. Tibshirani, Regression shrinkage and selection via the LASSO, J. R. Stat. Soc.: Ser. B (Methodol.) 58 (1996), pp. 267–288.
Google Scholar
A.W. Van der Vaart, Asymptotic Statistics, Vol. 3, Cambridge University Press, 2000.
Google Scholar
M.N. Wright and A. Ziegler, ranger: a fast implementation of random forests for high dimensional data in c++ and r, preprint, 2015. Available at arXiv:1508.04409
Google Scholar
S. Yang and J.K. Kim, Fractional imputation in survey sampling: a comparative review, Stat. Sci. 31 (2016), pp. 415–432.
Web of Science ®Google Scholar
S. Yang and J.K. Kim, Nearest Neighbor Imputation for General Parameter Estimation in Survey Sampling, The Econometrics of Complex Survey Data, Emerald Publishing Limited, 2019.
Google Scholar
S. Yang and J.K. Kim, Asymptotic theory and inference of predictive mean matching imputation using a superpopulation model framework, Scand. J. Stat. 47 (2020), pp. 839–861.
PubMed Web of Science ®Google Scholar
F.M. Zahid and C. Heumann, Multiple imputation with sequential penalized regression, Stat. Methods Med. Res. 28 (2019), pp. 1311–1327.
PubMed Web of Science ®Google Scholar
Y. Zhao and Q. Long, Multiple imputation in the presence of high-dimensional data, Stat. Methods Med. Res. 25 (2016), pp. 2021–2035.
PubMed Web of Science ®Google Scholar
R. Zhu and M.R. Kosorok, Recursively imputed survival trees, J. Am. Stat. Assoc. 107 (2012), pp. 331–340.
PubMed Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Handling high-dimensional data with missing values by modern machine learning techniques

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Handling high-dimensional data with missing values by modern machine learning techniques

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date