Abstract
An explosion in the availability of rich data from the technological advances is hindering efforts at statistical analysis due to constraints on time and memory storage, regardless of whether researchers employ simple methods (e.g., linear regression) or complex models (e.g., Gaussian processes). A recent approach to overcoming these limits involves information-based optimal subdata selection and Latin hypercube subagging. In the current study, we develop a novel subdata selection method for large-scale computer models based on expected improvement optimization. Numerical and empirical analysis using real-world data are used to select subdata by which to derive accurate predictions. During the optimization procedure, the proposed scheme employs the geometry of the input feature region as well as information related to output values. The data points associated with the largest improvement in prediction accuracy are combined in the construction of a subdataset that can be used to formulate predictions with affordable computing time. Supplementary materials for this article, including proofs of theorems and additional numerical results, are available online.
Supplementary Materials
The supplementary materials include proofs of theorems and additional numerical results.
Appendix: Section S1: the proofs of (3), Theorems 1 and 2. Section S2: the numerical results for Piston function (d = 7) and Wing Weight function (d = 10). Sections S3 and S4: the numerical studies for Bias-Prediction and Larger-d investigations
R code: R programs which can be used to replicate the numerical results in this article.
Acknowledgments
We thank the editor, associate editor, and two anonymous referees for their constructive comments and suggestions, which have helped us to improve the article.