Robust Estimation of Multivariate Location and Scatter in the Presence of Missing Data

Mike Danilov Quantitative Analyst at Google, 1600 Amphitheatre Parkway, Mountain View, CA, 94043, USA

Víctor J. Yohai Departamento de Matemáticas, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Ciudad Universitaria, Pabellón 1, 1428 Buenos Aires, Argentina

Ruben H. Zamar Department of Statistics, University of British Columbia, 333-6356 Agricultural Road, Vancouver, BC, V6T 1Z2, Canada

Abstract

Two main issues regarding data quality are data contamination (outliers) and data completion (missing data). These two problems have attracted much attention and research but surprisingly, they are seldom considered together. Popular robust methods such as S-estimators of multivariate location and scatter offer protection against outliers but cannot deal with missing data, except for the obviously inefficient approach of deleting all incomplete cases. We generalize the definition of S-estimators of multivariate location and scatter to simultaneously deal with missing data and outliers. We show that the proposed estimators are strongly consistent under elliptical models when data are missing completely at random. We derive an algorithm similar to the Expectation-Maximization algorithm for computing the proposed estimators. This algorithm is initialized by an extension for missing data of the minimum volume ellipsoid. We assess the performance of our proposal by Monte Carlo simulation and give some real data examples. This article has supplementary material online.

KEY WORDS:

Acknowledgments

This research was partially supported by grants X-018 and X-447 from the University of Buenos Aires, PIP 5505 from CONICET, PICT 00899 from ANPCyT, and Discovery grant from NSERC. We thank the Associate Editor and two referees for their comments and suggestions which resulted in several important improvements on the first version of this article.

Notes

NOTES. Gaussian LRT efficiency (relative to EM) for some robust scatter estimates. We consider clean 10-dimensional samples of size 100 with 10% of missing values.

NOTES. (*) Average obtained from 19 replicates because ERTBS crashed on one occasion. (**) Average obtained from seven replicates.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Robust Estimation of Multivariate Location and Scatter in the Presence of Missing Data

Information for

Open access

Opportunities

Help and information

Robust Estimation of Multivariate Location and Scatter in the Presence of Missing Data

Abstract

Acknowledgments

Notes

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature