4,346
Views
115
CrossRef citations to date
0
Altmetric
Theory and Methods

Information-Based Optimal Subdata Selection for Big Data Linear Regression

, &
Pages 393-405 | Received 01 Jun 2016, Published online: 28 Jun 2018
 

ABSTRACT

Extraordinary amounts of data are being produced in many branches of science. Proven statistical methods are no longer applicable with extraordinary large datasets due to computational limitations. A critical step in big data analysis is data reduction. Existing investigations in the context of linear regression focus on subsampling-based methods. However, not only is this approach prone to sampling errors, it also leads to a covariance matrix of the estimators that is typically bounded from below by a term that is of the order of the inverse of the subdata size. We propose a novel approach, termed information-based optimal subdata selection (IBOSS). Compared to leading existing subdata methods, the IBOSS approach has the following advantages: (i) it is significantly faster; (ii) it is suitable for distributed parallel computing; (iii) the variances of the slope parameter estimators converge to 0 as the full data size increases even if the subdata size is fixed, that is, the convergence rate depends on the full data size; (iv) data analysis for IBOSS subdata is straightforward and the sampling distribution of an IBOSS estimator is easy to assess. Theoretical results and extensive simulations demonstrate that the IBOSS approach is superior to subsampling-based methods, sometimes by orders of magnitude. The advantages of the new approach are also illustrated through analysis of real data. Supplementary materials for this article are available online.

Supplementary Materials

The supplementary materials include an appendix with technical details and a document with additional numerical evidence for the performance of the IBOSS method.

Acknowledgments

The authors are grateful for the comments from three referees, an associate editor, and editor, which helped to improve the article.

Additional information

Funding

Wang’s research was supported by a Microsoft Azure for Research Award and a Simons Foundation Collaboration Grant for Mathematicians (515599). Yang’s research was supported by NSF grant DMS-140751. Stufken’s research was supported by NSF grant DMS-1506125.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 343.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.