Views

CrossRef citations to date

Altmetric

Articles

Feature Screening for Massive Data Analysis by Subsampling

Xuening Zhua School of Data Science, Fudan University, Shanghai, China

https://orcid.org/0000-0001-5824-5279 View further author information

Rui Panb School of Statistics and Mathematics, Central University of Finance and Economics, Beijing, China

https://orcid.org/0000-0002-6007-4285 View further author information

Shuyuan Wuc Guanghua School of Management, Peking University, Beijing, ChinaCorrespondence[email protected]
View further author information

Hansheng Wangc Guanghua School of Management, Peking University, Beijing, China

https://orcid.org/0000-0003-2386-0209 View further author information

Abstract

Modern statistical analysis often encounters massive datasets with ultrahigh-dimensional features. In this work, we develop a subsampling approach for feature screening with massive datasets. The approach is implemented by repeated subsampling of massive data and can be used for analyzing tasks with memory constraints. To conduct the procedure, we first calculate an R-squared screening measure (and related sample moments) based on subsamples. Second, we consider three methods to combine the local statistics. In addition to the simple average method, we design a jackknife debiased screening measure and an aggregated moment screening measure. Both approaches reduce the bias of the subsampling screening measure and therefore increase the accuracy of the feature screening. Last, we consider a novel sequential sampling method, that is more computationally efficient than the traditional random sampling method. The theoretical properties of the three screening measures under both sampling schemes are rigorously discussed. Finally, we illustrate the usefulness of the proposed method with an airline dataset containing 32.7 million records.

Keywords:

Supplementary Materials

Supplementary_Material.pdf: This document provides the extensions of the proposed method, the proofs of the theoretical results in the main text, and some additional simulation results. Appendix A reports some extensions and discussions of the proposed method. Appendix B contains the detailed proofs of the theoretical results of the AVS measure. Appendix C contains the detailed proofs of the main theorems and Lemmas developed in section 3.1-3.3 of the main text. In particular, it contains the proofs of theorems 1, 2, 3, 4, and 5 and Lemmas 1 and 2 of the main text. Appendix D contains the detailed proofs of screening consistency developed in sections 3.4 of the main text. In particular, it contains the proofs of theorems 5 and 6 and Lemma 3 of the main text. Appendix E provides technical lemmas which are useful to prove the results in section 3 of the main text. Finally, Appendix F contains some additional numerical results.

Code.zip: This file is the python code for the proposed method. Please see the “README.md” in the file for using the code.

Additional information

Funding

Xuening Zhu is supported by the National Natural Science Foundation of China (nos. 11901105, 71991472, U1811461), and the Shanghai Sailing Program for Youth Science and Technology Excellence (19YF1402700). The research of Rui Pan is supported by National Natural Science Foundation of China (NSFC, 11601539, 11631003), and the Emerging Interdisciplianry Project of Central University of Finance and Economics. Hansheng Wang’s research is partially supported by National Natural Science Foundation of China (No. 11831008).

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Feature Screening for Massive Data Analysis by Subsampling

Related Research Data

Information for

Open access

Opportunities

Help and information

Feature Screening for Massive Data Analysis by Subsampling

Abstract

Supplementary Materials

Additional information

Funding

Reprints and Corporate Permissions

Academic Permissions

Related Research Data

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature