Abstract
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.
Supplementary Materials
Supplementary materials for this manuscript: For brevity, we collect proofs of most theoretical results, some additional simulation results, and the implementation details in the online supplementary materials. (.pdf file)
R package outference: R package containing code to perform the inferential methods described in this article. (available at https://github.com/shuxiaoc/outference)
R scripts R scripts to reproduce all figures and simulation results in this article. (.zip file)
Acknowledgments
The authors thank an associate editor for pointing us to the green rating dataset.