390
Views
0
CrossRef citations to date
0
Altmetric
Dimension Reduction and Sparse Modeling

Big Data Model Building Using Dimension Reduction and Sample Selection

, ORCID Icon, , & ORCID Icon
Pages 435-447 | Received 24 Mar 2022, Accepted 08 Sep 2023, Published online: 15 Nov 2023
 

Abstract

It is difficult to handle the extraordinary data volume generated in many fields with current computational resources and techniques. This is very challenging when applying conventional statistical methods to big data. A common approach is to partition full data into smaller subdata for purposes such as training, testing, and validation. The primary purpose of training data is to represent the full data. To achieve this goal, the selection of training subdata becomes pivotal in retaining essential characteristics of the full data. Recently, several procedures have been proposed to select “optimal design points” as training subdata under pre-specified models, such as linear regression and logistic regression. However, these subdata will not be “optimal” if the assumed model is not appropriate. Furthermore, such subdata cannot be useful to build alternative models because it is not an appropriate representative sample of the full data. In this article, we propose a novel algorithm for better model building and prediction via a process of selecting a “good” training sample. The proposed subdata can retain most characteristics of the original big data. It is also more robust that one can fit various response model and select the optimal model. Supplementary materials for this article are available online.

Supplementary Materials

R Markdown for Empirical Evaluation and Simulation: The R mark-down “Eva_Sim.Rmd” contains R codes to perform the empirical evaluations and simulations in Section 3. The file also contains codes to load the datasets used as examples in the article.

Disclosure Statement

The authors declare that they have no known competing interests or personal relationships that could have appeared to influence the work reported in this article.

Additional information

Funding

This research was sponsored by the National Science Foundation under the award The Learner Data Institute (award #1934745). The opinions, findings, and results are solely the authors and do not reflect those of the funding agencies. This research of Dennis K. J. Lin was partially supported by National Science Foundation via Grant DMS-18102925. This research of HHS Lu was partially supported by National Science and Technology Council, Taiwan, under the grant numbers of 110-2118-M-A49-002-MY3 and 111-2634-F-A49-014-.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 180.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.