Abstract
Two main issues regarding data quality are data contamination (outliers) and data completion (missing data). These two problems have attracted much attention and research but surprisingly, they are seldom considered together. Popular robust methods such as S-estimators of multivariate location and scatter offer protection against outliers but cannot deal with missing data, except for the obviously inefficient approach of deleting all incomplete cases. We generalize the definition of S-estimators of multivariate location and scatter to simultaneously deal with missing data and outliers. We show that the proposed estimators are strongly consistent under elliptical models when data are missing completely at random. We derive an algorithm similar to the Expectation-Maximization algorithm for computing the proposed estimators. This algorithm is initialized by an extension for missing data of the minimum volume ellipsoid. We assess the performance of our proposal by Monte Carlo simulation and give some real data examples. This article has supplementary material online.
Acknowledgments
This research was partially supported by grants X-018 and X-447 from the University of Buenos Aires, PIP 5505 from CONICET, PICT 00899 from ANPCyT, and Discovery grant from NSERC. We thank the Associate Editor and two referees for their comments and suggestions which resulted in several important improvements on the first version of this article.
Notes
NOTES. Gaussian LRT efficiency (relative to EM) for some robust scatter estimates. We consider clean 10-dimensional samples of size 100 with 10% of missing values.
NOTES. (*) Average obtained from 19 replicates because ERTBS crashed on one occasion. (**) Average obtained from seven replicates.