260
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Active sampling: A machine-learning-assisted framework for finite population inference with optimal subsamples

ORCID Icon, ORCID Icon, &
Received 20 Dec 2022, Accepted 20 Jun 2024, Accepted author version posted online: 02 Jul 2024
Accepted author version

References

  • Ai, M., Wang, F., Yu, J., and Zhang, H. (2021a). Optimal subsampling for large-scale quantile regression. Journal of Complexity, 62:101512.
  • Ai, M., Yu, J., Zhang, H., and Wang, H. (2021b). Optimal subsampling algorithms for big data regressions. Statistica Sinica, 31:749–772.
  • Bach, F. R. (2007). Active learning for misspecified generalized linear models. In Advances in Neural Information Processing Systems 19.
  • Bärgman, J., Lisovskaja, V., Victor, T., Flannagan, C., and Dozza, M. (2015). How does glance behavior influence crash and injury risk? A ‘what-if’ counterfactual simulation using crashes and near-crashes from SHRP2. Transportation Research Part F: Traffic Psychology and Behaviour, 35:152–169.
  • Batsch, F., Daneshkhah, A., Palade, V., and Cheah, M. (2021). Scenario optimisation and sensitivity analysis for safe automated driving using Gaussian processes. Applied Sciences, 11(2):775.
  • Beaumont, J.-F. and Haziza, D. (2022). Statistical inference from finite population samples: A critical review of frequentist and Bayesian approaches. Canadian Journal of Statistics, 50(4):1186–1212.
  • Beygelzimer, A., Dasgupta, S., and Langford, J. (2009). Importance weighted active learning. In Proceedings of the 26th International Conference on Machine Learning.
  • Breidt, F. J. and Opsomer, J. D. (2017). Model-assisted survey estimation with modern prediction techniques. Statistical Science, 32(2):190–205.
  • Breiman, L. (2001). Random forests. Machine Learning, 45:5–32.
  • Brown, B. M. (1971). Martingale central limit theorems. The Annals of Mathematical Statistics, 42(1):59–66.
  • Bucher, C. G. (1988). Adaptive sampling — an iterative fast Monte Carlo procedure. Structural safety, 5(2):119–126.
  • Cassel, C. M., Särndal, C.-E., and Wretman, J. H. (1976). Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3):615–620.
  • Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • Cioppa, T. M. and Lucas, T. W. (2007). Efficient nearly orthogonal and space-filling latin hypercubes. Technometrics, 49(1):45–55.
  • Cohn, D. A. (1996). Neural network exploration using optimal experiment design. Neural Networks, 9(6):1071–1083.
  • Dai, W., Song, Y., and Wang, D. (2022). A subsampling method for regression problems based on minimum energy criterion. Technometrics, 65(2):192–205.
  • Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and Their Applications. Cambridge University Press, Cambridge, UK.
  • Deville, J.-C. and Särndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87(418):376–382.
  • Duoba, M. and Baby, T. V. (2023). Tesla Model 3 autopilot on-road data. Technical report, Livewire Data Platform; National Renewable Energy Laboratory; Pacific Northwest National Laboratory, Richland, WA.
  • Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1):1 – 26.
  • Ettinger, S., Cheng, S., Caine, B., Liu, C., Zhao, H., Pradhan, S., Chai, Y., Sapp, B., Qi, C. R., Zhou, Y., Yang, Z., Chouard, A., Sun, P., Ngiam, J., Vasudevan, V., McCauley, A., Shlens, J., and Anguelov, D. (2021). Large scale interactive motion forecasting for autonomous driving: The Waymo open motion dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  • Feng, S., Feng, Y., Yu, C., Zhang, Y., and Liu, H. X. (2020). Testing scenario library generation for connected and automated vehicles, part I: Methodology. IEEE Transactions on Intelligent Transportation Systems, 22(3):1573–1582.
  • Feng, S., Yan, X., Sun, H., Feng, Y., and Lui, H. X. (2021). Intelligent driving intelligence test for autonomous vehicles with naturalistic and adversarial environment. Nature Communitations, 12(748):1–14.
  • Fishman, G. S. (1996). Monte Carlo. Springer, New York, NY.
  • Fuller, W. A. (2009). Sampling Statistics. Wiley, Hoboken, NJ.
  • Gramacy, R. B. and Apley, D. W. (2015). Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578.
  • Hansen, M. H. and Hurwitz, W. N. (1943). On the theory of sampling from finite populations. The Annals of Mathematical Statistics, 14(4):333–362.
  • Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663–685.
  • Imberg, H., Jonasson, J., and Axelson-Fisk, M. (2020). Optimal sampling in unbiased active learning. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics.
  • Imberg, H., Lisovskaja, V., Selpi, and Nerman, O. (2022). Optimization of two-phase sampling designs with application to naturalistic driving studies. IEEE Transactions on Intelligent Transportation Systems, 23(4):3575–3588.
  • Isaksson-Hellman, I. and Norin, H. (2005). How thirty years of focused safety development has influenced injury outcome in Volvo cars. Annual Proceedings. Association for the Advancement of Automotive Medicine, 49:63–77.
  • Kern, C., Klausch, T., and Kreuter, F. (2019). Tree-based machine learning methods for survey research. Survey Research Methods, 13(1):73–93.
  • Kott, P. S. (2016). Calibration weighting in survey sampling. WIREs Computational Statistics, 8(1):39–53.
  • Lee, J. Y., Lee, J. D., Bärgman, J., Lee, J., and Reimer, B. (2018). How safe is tuning a radio?: using the radio tuning task as a benchmark for distracted driving. Accident Analysis & Prevention, 110:29–37.
  • Lei, B., Kirk, T. Q., Bhattacharya, A., Pati, D., Qian, X., Arroyave, R., and Mallick, B. K. (2021). Bayesian optimization with adaptive surrogate models for automated experimental design. npj Computational Materials, 194:1–12.
  • Leledakis, A., Lindman, M., Östh, J., Wågström, L., Davidsson, J., and Jakobsson, L. (2021). A method for predicting crash configurations using counterfactual simulations and real-world data. Accident Analysis & Prevention, 150:105932.
  • Lim, Y.-F., Ng, C. K., Vaitesswar, U., and Hippalgaonkar, K. (2021). Extrapolative Bayesian optimization with Gaussian process and neural network ensemble surrogate models. Advanced Intelligent Systems, 3(11):2100101.
  • Liu, K., Mei, Y., and Shi, J. (2015). An adaptive sampling strategy for online high-dimensional process monitoring. Technometrics, 57(3):305–319.
  • Lookman, T., Balachandran, P. V., Xue, D., and Yuan, R. (2019). Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design. npj Computational Materials, 5(21):1–17.
  • Ma, P., Chen, Y., Zhang, X., Xing, X., Ma, J., and Mahoney, M. W. (2022). Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. Journal of Machine Learning Research, 23(177):1–45.
  • Ma, P., Mahoney, M. W., and Yu, B. (2015). A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research, 16:861–911.
  • MacKay, D. J. C. (1992). Information-based objective functions for active data selection. Neural Computation, 4(4):590–604.
  • McConville, K. S. and Toth, D. (2019). Automated selection of post-strata using a model-assisted regression tree estimator. Scandinavian Journal of Statistics, 46(2):389–413.
  • Meng, C., Xie, R., Mandal, A., Zhang, X., Zhong, W., and Ma, P. (2021). LowCon: A design-based subsampling approach in a misspecified linear model. Journal of Computational and Graphical Statistics, 30(3):694–708.
  • Oh, M.-S. and Berger, J. O. (1992). Adaptive importance sampling in Monte Carlo integration. Journal of Statistical Computation and Simulation, 41(3–4):143–168.
  • Pan, Q., Byon, E., Ko, Y. M., and Lam, H. (2020). Adaptive importance sampling for extreme quantile estimation with stochastic black box computer models. Naval Research Logistics, 67(7):524–547.
  • R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B. B., Chen, X., and Wang, X. (2021). A survey of deep active learning. ACM Computing Surveys, 54(9):1–40.
  • Sande, L. and Zhang, L. (2021). Design-unbiased statistical learning in survey sampling. Sankhya A, 83:714–744.
  • Sauer, A., Gramacy, R. B., and Higdon, D. (2023). Active learning for deep Gaussian process surrogates. Technometrics, 65(1):4–18.
  • Sen, A. (1953). On the estimate of the variance in sampling with varying probabilities. Journal of the Indian Society of Agricultural Statistics, 5:119–127.
  • Sen, P. and Singer, J. (1993). Large Sample Methods in Statistics: An Introduction with Applications. CRC Press, Boca Raton, FL.
  • Settles, B. (2012). Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–114.
  • Seyedi, M., Koloushani, M., Jung, S., and Vanli, A. (2021). Safety assessment and a parametric study of forward collision-avoidance assist based on real-world crash simulations. Journal of Advanced Transportation, 2021:1–24. Article ID 4430730.
  • Sun, F., Gramacy, R. B., Haaland, B., Lawrence, E. C., and Walker, A. C. (2017). Emulating satellite drag from large simulation experiments. SIAM/ASA Journal on Uncertaintity Quantification, 7:720–759.
  • Sun, J., Zhou, H., Xi, H., Zhang, H., and Tian, Y. (2021). Adaptive design of experiments for safety evaluation of automated vehicles. IEEE Transactions on Intelligent Transportation Systems, 23(9):14497–14508.
  • Särndal, C.-E., Swensson, B., and Wretman, J. (2003). Model Assisted Survey Sampling. Springer, New York, NY.
  • Ta, T., Shao, J., Li, Q., and Wang, L. (2020). Generalized regression estimators with high-dimensional covariates. Statistica Sinica, 30(3):1135–1154.
  • Tillé, Y. (2006). Sampling Algorithms. Springer, New York, NY.
  • Wang, H., Zhu, R., and Ma, P. (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522):829–844.
  • Wang, X., Peng, H., and Zhao, D. (2021). Combining reachability analysis and importance sampling for accelerated evaluation of highway automated vehicles at pedestrian crossing. ASME Letters in Dynamic Systems and Control, 1(1):011017.
  • World Health Organization (2018). Global status report on road safety 2018. URL https://www.who.int/publications/i/item/9789241565684.
  • Xian, X., Wang, A., and Liu, K. (2018). A nonparametric adaptive sampling strategy for online monitoring of big data streams. Technometrics, 60(1):14–25.
  • Yao, Y. and Wang, H. (2019). Optimal subsampling for softmax regression. Statistical Papers, 60:585–599.
  • Yates, F. and Grundy, P. M. (1953). Selection without replacement from within strata with probability proportional to size. Journal of the Royal Statistical Society. Series B (Methodological), 15(2):253–261.
  • Yu, J., Wang, H., Ai, M., and Zhang, H. (2022). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association, 117(537):265–276.
  • Zhang, J., Meng, C., Yu, J., Zhang, M., Zhong, W., and Ma, P. (2023). An optimal transport approach for selecting a representative subsample with application in efficient kernel density estimation. Journal of Computational and Graphical Statistics, 32(1):329–339.
  • Zhang, M., Zhou, Y., Zhou, Z., and Zhang, A. (2024). Model-free subsampling method based on uniform designs. IEEE Transactions on Knowledge and Data Engineering, 36(3):1210–1220.
  • Zhang, T., Ning, Y., and Ruppert, D. (2021). Optimal sampling for generalized linear models under measurement constraints. Journal of Computational and Graphical Statistics, 30(1):106–114.
  • Zhou, Z., Yang, Z., Zhang, A., and Zhou, Y. (2024). Efficient model-free subsampling method for massive data. Technometrics, 66(2):240–252.