952
Views
4
CrossRef citations to date
0
Altmetric
Articles

A Subsampling Method for Regression Problems Based on Minimum Energy Criterion

, &
Pages 192-205 | Received 31 May 2021, Accepted 10 Sep 2022, Published online: 31 Oct 2022

References

  • Alaoui, A. E., and Mahoney, M. W. (2015), “Fast Randomized Kernel Ridge Regression with Statistical Guarantees,” in Proceedings of the 28th International Conference on Neural Information Processing Systems (Vol. 1), pp. 775–783, MIT Press.
  • Ba, S., and Joseph, V. R. (2012), “Composite Gaussian Process Models for Emulating Expensive Functions,” The Annals of Applied Statistics, 6, 1838–1860. DOI: 10.1214/12-AOAS570.
  • Bachem, O., Lucic, M., and Krause, A. (2017), “Practical Coreset Constructions for Machine Learning,” arXiv e-prints, arXiv:1703.06476.
  • Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008), “Gaussian Predictive Process Models for Large Spatial Data Sets,” Journal of the Royal Statistical Society, Series B, 70, 825–848. DOI: 10.1111/j.1467-9868.2008.00663.x.
  • Barbian, M. H., and Assunção, R. M. (2017), “Spatial Subsemble Estimator for Large Geostatistical Data,” Spatial Statistics, 22, 68–88. DOI: 10.1016/j.spasta.2017.08.004.
  • Campbell, T., and Broderick, T. (2018), “Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent,” in Proceedings of the 35th International Conference on Machine Learning (Vol. 80), pp. 698–706, PMLR.
  • Cressie, N., and Johannesson, G. (2008), “Fixed Rank Kriging for Very Large Spatial Data Sets,” Journal of the Royal Statistical Society, Series B, 70, 209–226. DOI: 10.1111/j.1467-9868.2007.00633.x.
  • Dette, H., and Pepelyshev, A. (2010), “Generalized Latin Hypercube Design for Computer Experiments,” Technometrics, 52, 421–429. DOI: 10.1198/TECH.2010.09157.
  • Donoho, D. L., and Johnstone, I. M. (1994), “Ideal Spatial Adaptation by Wavelet Shrinkage,” Biometrika, 81, 425–455. DOI: 10.1093/biomet/81.3.425.
  • Drineas, P., Mahoney, M. W., Muthukrishnan, S., and Sarlós, T. (2011), “Faster Least Squares Approximation,” Numerische Mathematik, 117, 219–249. DOI: 10.1007/s00211-010-0331-6.
  • Fithian, W., and Hastie, T. (2014), “Local Case-Control Sampling: Efficient Subsampling in Imbalanced Data Sets,” The Annals of Statistics, 42, 1963–1724. DOI: 10.1214/14-AOS1220.
  • Furrer, R., Genton, M. G., and Nychka, D. (2006), “Covariance Tapering for Interpolation of Large Spatial Datasets,” Journal of Computational and Graphical Statistics, 15, 502–523. DOI: 10.1198/106186006X132178.
  • Gramacy, R. B., and Apley, D. W. (2015), “Local Gaussian Process Approximation for Large Computer Experiments,” Journal of Computational and Graphical Statistics, 24, 561–578. DOI: 10.1080/10618600.2014.914442.
  • Griffiths, W. E., Skeels, C., and Chotikapanich, D. (2002), Handbook of Applied Econometrics and Statistical Inference, pp. 575–590, New York: Marcel Dekker.
  • Gu, C. (2013), Smoothing Spline ANOVA Models (Vol. 297), New York: Springer.
  • Hajian-Tilaki, K. (2014), “Sample Size Estimation in Diagnostic Test Studies of Biomedical Informatics,” Journal of Biomedical Informatics, 48, 193–204. DOI: 10.1016/j.jbi.2014.02.013.
  • Hamid, R., Xiao, Y., Gittens, A., and Decoste, D. (2014), “Compact Random Feature Maps,” in Proceedings of the 31st International Conference on Machine Learning (Vol. 32), pp. 19–27, PMLR.
  • Han, L., Yang, T., and Zhang, T. (2020), “Local Uncertainty Sampling for Large-Scale Multi-Class Logistic Regression,” Annals of Statistics, 48, 1770–1788.
  • Hastie, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, New York: Springer.
  • He, L., and Hung, Y. (2020), “Gaussian Process Prediction using Design-based Subsampling,” Statistica Sinica, 32, 1165–1186.
  • Houser, J. (2007), “How Many are Enough? Statistical Power Analysis and Sample Size Estimation in Clinical Research,” Journal of Clinical Research Best Practices, 3, 1–5.
  • Huang, C., and Joseph, V. R. (2021), Supercompress: Supervised Compression of Big Data, r package version 1.0.
  • Huang, H., and Sun, Y. (2018), “Hierarchical Low Rank Approximation of Likelihoods for Large Spatial Datasets,” Journal of Computational and Graphical Statistics, 27, 110–118. DOI: 10.1080/10618600.2017.1356324.
  • Johnson, M. E., Moore, L. M., and Ylvisaker, D. (1990), “Minimax and Maximin Distance Designs,” Journal of Statistical Planning and Inference, 26, 131–148. DOI: 10.1016/0378-3758(90)90122-B.
  • Joseph, V. R., Dasgupta, T., Tuo, R., and Wu, C. J. (2015), “Sequential Exploration of Complex Surfaces using Minimum Energy Designs,” Technometrics, 57, 64–74. DOI: 10.1080/00401706.2014.881749.
  • Joseph, V. R., and Hung, Y. (2008), “Orthogonal-Maximin Latin Hypercube Designs,” Statistica Sinica, 18, 171–186.
  • Joseph, V. R., and Mak, S. (2021), “Supervised Compression of Big Data,” Statistical Analysis and Data Mining: The ASA Data Science Journal, 14, 217–229. DOI: 10.1002/sam.11508.
  • Joseph, V. R., Wang, D., Gu, L., Lyu, S., and Tuo, R. (2019), “Deterministic Sampling of Expensive Posteriors using Minimum Energy Designs,” Technometrics, 61, 297–308. DOI: 10.1080/00401706.2018.1552203.
  • Katzfuss, M., and Hammerling, D. (2017), “Parallel Inference for Massive Distributed Spatial Data using Low-Rank Models,” Statistics and Computing, 27, 363–375. DOI: 10.1007/s11222-016-9627-4.
  • Kaufman, C. G., Schervish, M. J., and Nychka, D. W. (2008), “Covariance Tapering for Likelihood-based Estimation in Large Spatial Data Sets,” Journal of the American Statistical Association, 103, 1545–1555. DOI: 10.1198/016214508000000959.
  • Kaya, H., Tüfekci, P., and Uzun, E. (2019), “Predicting CO and NOX Emissions from Gas Turbines: Novel Data and a Benchmark PEMS,” Turkish Journal of Electrical Engineering and Computer Science, 27, 4783–4796.
  • Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. I. (2014), “A Scalable Bootstrap for Massive Data,” Journal of the Royal Statistical Society, Series B, 76, 795–816. DOI: 10.1111/rssb.12050.
  • Li, M., Kwok, J. T., and Lu, B.-L. (2010), “Making Large-Scale Nyström Approximation Possible,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, Omnipress, ICML’10, pp. 631–638.
  • Liang, F., Cheng, Y., Song, Q., Park, J., and Yang, P. (2013), “A Resampling-based Stochastic Approximation Method for Analysis of Large Geostatistical Data,” Journal of the American Statistical Association, 108, 325–339. DOI: 10.1080/01621459.2012.746061.
  • Lin, N., and Xi, R. (2011), “Aggregated Estimating Equation Estimation,” Statistics and Its Interface, 4, 73–83. DOI: 10.4310/SII.2011.v4.n1.a8.
  • Liu, H., Lafferty, J., and Wasserman, L. (2009), “The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs,” Journal of Machine Learning Research, 10, 2295–2328.
  • Luo, Z., and Wahba, G. (1997), “Hybrid Adaptive Splines,” Journal of the American Statistical Association, 92, 107–116. DOI: 10.1080/01621459.1997.10473607.
  • Ma, P., Huang, J. Z., and Zhang, N. (2015), “Efficient Computation of Smoothing Splines via Adaptive Basis Sampling,” Biometrika, 102, 631–645. DOI: 10.1093/biomet/asv009.
  • Ma, P., Mahoney, M. W., and Yu, B. (2015), “A Statistical Perspective on Algorithmic Leveraging,” Journal of Machine Learning Research, 16, 861–911.
  • Ma, P., and Sun, X. (2015), “Leveraging for Big Data Regression,” Wiley Interdisciplinary Reviews Computational Statistics, 7, 70–76. DOI: 10.1002/wics.1324.
  • Mak, S. (2021), Support: Support Points, r package version 0.1.5.
  • Mak, S., and Joseph, V. R. (2017), “Projected Support Points: A New Method for High-Dimensional Data Reduction,” arXiv: Methodology.
  • Mak, S., and Joseph, V. R. (2018), “Support Points,” The Annals of Statistics, 46, 2562–2592.
  • Meng, C., Zhang, X., Zhang, J., Zhong, W., and Ma, P. (2020), “More Efficient Approximation of Smoothing Splines via Space-Filling Basis Selection,” Biometrika, 107, 723–735. DOI: 10.1093/biomet/asaa019.
  • Müller, H.-G. (1984), “Optimal Designs for Nonparametric Kernel Regression,” Statistics & Probability Letters, 2, 285–290.
  • Park, J., and Liang, F. (2012), “Bayesian Analysis of Geostatistical Models with an Auxiliary Lattice,” Journal of Computational and Graphical Statistics, 21, 453–475. DOI: 10.1080/10618600.2012.679228.
  • Rudi, A., Camoriano, R., and Rosasco, L. (2015), “Less is More: Nyström Computational Regularization,” in Proceedings of the 28th International Conference on Neural Information Processing Systems, volume 1 of NIPS’15, pp. 1657–1665.
  • Rue, H., and Held, L. (2005), Gaussian Markov Random Fields: Theory and Applications, London: Taylor and Francis.
  • Rue, H., and Tjelmeland, H. (2002), “Fitting Gaussian Markov Random Fields to Gaussian Fields,” Scandinavian Journal of Statistics, 29, 31–49. DOI: 10.1111/1467-9469.00058.
  • Sang, H., and Huang, J. Z. (2012), “A Full Scale Approximation of Covariance Functions for Large Spatial Data Sets,” Journal of the Royal Statistical Society, Series B, 74, 111–132. DOI: 10.1111/j.1467-9868.2011.01007.x.
  • Santner, T. J., Williams, B. J., and Notz, W. I. (2003), The Design and Analysis of Computer Experiments, New York: Springer.
  • Ting, D., and Brochu, E. (2018), “Optimal Subsampling with Influence Functions,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3654–3663.
  • Tzeng, S., and Huang, H.-C. (2018), “Resolution Adaptive Fixed Rank Kriging,” Technometrics, 60, 198–208. DOI: 10.1080/00401706.2017.1345701.
  • Vakayil, A., Joseph, R., and Mak, S. (2021), SPlit: Split a Dataset for Training and Testing, r package version 1.0.
  • Varin, C., Reid, N., and Firth, D. (2011), “An Overview of Composite Likelihood Methods,” Statistica Sinica, 21, 5–42.
  • Vecchia, A. V. (1988), “Estimation and Model Identification for Continuous Spatial Processes,” Journal of the Royal Statistical Society, Series B, 50, 297–312. DOI: 10.1111/j.2517-6161.1988.tb01729.x.
  • Wang, D., and Joseph, V. R. (2021), mined: Minimum Energy Designs, r package version 1.0-3.
  • Wang, H. (2019a), “Divide-and-Conquer Information-based Optimal Subdata Selection Algorithm,” Journal of Statistical Theory and Practice, 114, 393–405. DOI: 10.1007/s42519-019-0048-5.
  • Wang, H. (2019b), “More Efficient Estimation for Logistic Regression with Optimal Subsamples,” Journal of Machine Learning Research, 20, 1–59.
  • Wang, H., and Ma, Y. (2020), “Optimal Subsampling for Quantile Regression in Big Data,” Biometrika, 108, 99–112. DOI: 10.1093/biomet/asaa043.
  • Wang, H., Yang, M., and Stufken, J. (2019), “Information-based Optimal Subdata Selection for Big Data Linear Regression,” Journal of the American Statistical Association, 114, 393–405. DOI: 10.1080/01621459.2017.1408468.
  • Wang, H., Zhu, R., and Ma, P. (2018), “Optimal Subsampling for Large Sample Logistic Regression,” Journal of the American Statistical Association, 113, 829–844. DOI: 10.1080/01621459.2017.1292914.
  • Wang, L., Elmstedt, J., Wong, W. K., and Xu, H. (2021), “Orthogonal Subsampling for Big Data Linear Regression,” The Annals of Applied Statistics, 15, 1273–1290. DOI: 10.1214/21-AOAS1462.
  • Wang, Z., Zhu, H., Dong, Z., He, X., and Huang, S.-L. (2020), “Less is Better: Unweighted Data Subsampling via Influence Function,” in The Thirty-Fourth AAAI Conference on Artificial Intelligence (Vol. 34), pp. 6340–6347. DOI: 10.1609/aaai.v34i04.6103.
  • Williams, C. K. I., and Seeger, M. (2000), “Using the Nyström Method to Speed Up Kernel Machines,” in Proceedings of the 13th International Conference on Neural Information Processing Systems, pp. 661–667, MIT Press.
  • Xu, D., and Wang, Y. (2018), “Divide and Recombine Approaches for Fitting Smoothing Spline Models with Large Datasets,” Journal of Computational and Graphical Statistics, 27, 677–683. DOI: 10.1080/10618600.2017.1402775.
  • Yang, J., Sindhwani, V., Avron, H., and Mahoney, M. (2014), “Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels,” in Proceedings of the 31st International Conference on Machine Learning, (Vol. 32), pp. 485–493, PMLR.
  • Yang, Y., Pilanci, M., and Wainwright, M. (2017), “Randomized Sketches for Kernels: Fast and Optimal Non-parametric Regression,” The Annals of Statistics, 45, 991–1023. DOI: 10.1214/16-AOS1472.
  • Zhang, B., Sang, H., and Huang, J. (2019), “Smoothed Full-Scale Approximation of Gaussian Process Models for Computation of Large Spatial Datasets,” Statistica Sinica, 29, 1711–1737. DOI: 10.5705/ss.202017.0008.
  • Zhao, Y., Amemiya, Y., and Hung, Y. (2018), “Efficient Gaussian Process Modeling using Experimental Design-based Subagging.” Statistica Sinica, 28, 1459–1479.
  • Zhu, Z., and Stein, M. L. (2005), “Spatial Sampling Design for Parameter Estimation of the Covariance Function,” Journal of Statistical Planning and Inference, 134, 583–603. DOI: 10.1016/j.jspi.2004.04.017.
  • Zhu, Z., and Stein, M. L. (2006), “Spatial Sampling Design for Prediction with Estimated Parameters,” Journal of Agricultural, Biological, and Environmental Statistics, 11, 24–44.
  • Zimmerman, D. L. (2006), “Optimal Network Design for Spatial Prediction, Covariance Parameter Estimation, and Empirical Prediction,” Environmetrics, 17, 635–652. DOI: 10.1002/env.769.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.