SPlit: An Optimal Method for Data Splitting

V. Roshan JosephStewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GACorrespondence[email protected]
View further author information

Akhil VakayilStewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GAView further author information

Pages 166-176 | Received 10 Dec 2020, Accepted 06 Apr 2021, Published online: 01 Jun 2021

Cite this article
https://doi.org/10.1080/00401706.2021.1921037
CrossMark

Full Article
Figures & data
References
Supplemental
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

References

Ayres-de Campos, D., Bernardes, J., Garrido, A., Marques-de Sa, J., and Pereira-Leite, L. (2000), “Sisporto 2.0: A Program for Automated Analysis of Cardiotocograms,” Journal of Maternal-Fetal Medicine, 9, 311–318.
PubMedGoogle Scholar
Blanco, J. L., and Rai, P. K. (2014), “nanoflann: a C++ header-only fork of FLANN, a library for nearest neighbor (NN) with kd-trees,” available at https://github.com/jlblancoc/nanoflann.
Google Scholar
Bowden, G. J., Maier, H. R., and Dandy, G. C. (2002), “Optimal Division of Data for Neural Network Models in Water Resources Applications,” Water Resources Research, 38, 2–1–2–11. DOI: https://doi.org/10.1029/2001WR000266.
Web of Science ®Google Scholar
Breiman, L. (2001), “Random Forests,” Machine Learning, 45, 5–32. DOI: https://doi.org/10.1023/A:1010933404324.
Web of Science ®Google Scholar
Brooks, T. F., Pope, D. S., and Marcolini, M. A. (1989), Airfoil Self-noise and Prediction, Washington, DC: NASA.
Google Scholar
Chen, W. Y., Mackey, L., Gorham, J., Briol, F.-X., and Oates, C. (2018), “Stein Points,” in Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, Volume 80 of Proceedings of Machine Learning Research, pp. 844–853, Stockholmsmässan, Stockholm Sweden: PMLR.
Google Scholar
Chen, Y., Welling, M., and Smola, A. (2010), “Super-samples From Kernel Herding,” Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, pp. 109–116.
Google Scholar
Dua, D., and Graff, C. (2017), “UCI Machine Learning Repository,” available at http://archive.ics.uci.edu/ml.
Google Scholar
Elo, I., Rodriguez, G., and Lee, H. (2001), Racial and Neighborhood Disparities in Birthweight in Philadelphia, Washington DC: Annual Meeting of the Population Association of America.
Google Scholar
Evett, I. W., and Spiehler, E. J. (1989), “Rule Induction in Forensic Science,” in Knowledge Based Systems, eds. P. H. Duffin, New York, NY: Halsted Press, pp. 152–160.
Google Scholar
Fang, K. T., and Wang, Y. (1994), Number-Theoretic Methods in Statistics, Boca Raton, FL: Chapman & Hall.
Google Scholar
Faraway, J. J. (2015), Linear Models with R, (2nd ed), Boca Raton, FL: CRC Press.
Google Scholar
Fisher, R. A. (1936), “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics, 7, 179–188. DOI: https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.
Google Scholar
Flury, B. (1990), “Principal Points,” Biometrika, 77, 33–41. DOI: https://doi.org/10.1093/biomet/77.1.33.
Web of Science ®Google Scholar
Friedman, J., Hastie, T., and Tibshirani, R. (2010), “Regularization Paths for Generalized Linear Models Via Coordinate Descent,” Journal of Statistical Software, 33, 1–22. DOI: https://doi.org/10.18637/jss.v033.i01.
PubMed Web of Science ®Google Scholar
Galvão, R. K. H., Araujo, M. C. U., José, G. E., Pontes, M. J. C., Silva, E. C., and Saldanha, T. C. B. (2005), “A Method for Calibration and Validation Subset Partitioning,” Talanta, 67, 736–740. DOI: https://doi.org/10.1016/j.talanta.2005.03.025.
PubMed Web of Science ®Google Scholar
Hastie, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, New York: Springer.
Google Scholar
Hickernell, F. J. (1999), “Goodness-of-fit Statistics, Discrepancies and Robust Designs,” Statistics and Probability Letters, 44, 73–78. DOI: https://doi.org/10.1016/S0167-7152(98)00293-4.
Web of Science ®Google Scholar
Joseph, V. R., Dasgupta, T., Tuo, R., and Wu, C. F. J. (2015), “Sequential Exploration of Complex Surfaces Using Minimum Energy Designs,” Technometrics, 57, 64–74. DOI: https://doi.org/10.1080/00401706.2014.881749.
Web of Science ®Google Scholar
Kennard, R. W., and Stone, L. A. (1969), “Computer Aided Design of Experiments,” Technometrics, 11, 137–148. DOI: https://doi.org/10.1080/00401706.1969.10490666.
Web of Science ®Google Scholar
Liaw, A., and Wiener, M. (2002), “Classification and Regression by Randomforest,” R News, 2, 18–22.
Google Scholar
Mak, S. (2019), “Support Points. R Package version 0.1.4,” available at https://cran.r-project.org/src/contrib/Archive/support.
Google Scholar
Mak, S., and Joseph, V. R. (2018a), “Projected Support Points: A New Method for High-dimensional Data Reduction,” arXiv preprint: 1708.06897.
Google Scholar
Mak, S., and Joseph, V. R. (2018b), “Support Points,” The Annals of Statistics, 46, 2562–2592.
Web of Science ®Google Scholar
May, R., Maier, H., and Dandy, G. (2010), “Data Splitting for Artificial Neural Networks Using SOM-based Stratified Sampling,” Neural Networks, 23, 283 – 294. DOI: https://doi.org/10.1016/j.neunet.2009.11.009.
PubMed Web of Science ®Google Scholar
Nash, W. J., Sellers, T. L., Talbot, S. R., Cawthorn, A. J., and Ford, W. B. (1994), The Population Biology of Abalone (Haliotis Species) in Tasmania. I. Blacklip Abalone (h. rubra) From the North Coast and Islands of Bass Strait, Technical Report, 48, Tasmania: Sea Fisheries Division, p. 411.
Google Scholar
Niederreiter, H. (1992), Random Number Generation and Quasi-Monte Carlo Methods, Philadelphia, PA: SIAM.
Google Scholar
Owen, A. B. (2013), Monte Carlo Theory, Methods and Examples, available at https://statweb.stanford.edu/∼owen/mc/.
Google Scholar
Reitermanová, Z. (2010), “Data Splitting,” WDS’10 Proceedings of Contributed Papers, Part I, pp. 31–36.
Google Scholar
Snee, R. D. (1977), “Validation of Regression Models: Methods and Examples,” Technometrics, 19, 415–428. DOI: https://doi.org/10.1080/00401706.1977.10489581.
Web of Science ®Google Scholar
Stevens, A., and Ramirez-Lopez, L. (2020), An Introduction to the Prospectr Package (R package version 0.2.1).
Google Scholar
Stone, M. (1974), “Cross-validatory Choice and Assessment of Statistical Predictions,” Journal of Royal Statistical Society, Series B, 36, 111–146. DOI: https://doi.org/10.1111/j.2517-6161.1974.tb00994.x.
Google Scholar
Street, W. N., Wolberg, W. H., and Mangasarian, O. L. (1993), “Nuclear Feature Extraction for Breast Tumor Diagnosis,” in Biomedical Image Processing and Biomedical Visualization, Vol. 1905, eds. R. S. Acharya and D. B. Goldgof, pp. 861–870. San Jose, CA: International Society for Optics and Photonics.
Google Scholar
Székely, G. J. and Rizzo, M. L. (2013), “Energy Statistics: A Class of Statistics Based on Distances,” Journal of Statistical Planning and Inference, 143, 1249–1272. DOI: https://doi.org/10.1016/j.jspi.2013.03.018.
Web of Science ®Google Scholar
Thodberg, H. H. (1993), “Ace of Bayes: Application of Neural Networks With Pruning,” Technical report, Roskilde, Danmark.
Google Scholar
Tibshirani, R. (1996), “Regression Shrinkage and Selection Via the Lasso,” Journal of the Royal Statistical Society, Series B, 58, 267–288. DOI: https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.
Google Scholar
Wang, H., Yang, M., and Stufken, J. (2019), “Information-based Optimal Subdata Selection for Big Data Linear Regression,” Journal of the American Statistical Association, 114, 393–405. DOI: https://doi.org/10.1080/01621459.2017.1408468.
Web of Science ®Google Scholar
Wu, C. F. J., and Hamada, M. S. (2011), Experiments: Planning, Analysis, and Optimization, (2nd ed.) Hoboken, NJ: Wiley.
Google Scholar
Xu, Y., and Goodacre, R. (2018), “On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning,” Journal of Analysis and Testing, 2, 249–262. DOI: https://doi.org/10.1007/s41664-018-0068-2.
PubMed Web of Science ®Google Scholar
Yeh, I.-C. (1998), “Modeling of Strength of High-performance Concrete Using Artificial Neural Networks,” Cement and Concrete Research, 28,1797–1808. DOI: https://doi.org/10.1016/S0008-8846(98)00165-3.
Web of Science ®Google Scholar
Zador, P. (1982), “Asymptotic Quantization Error of Continuous Signals and the Quantization Dimension,” IEEE Transactions on Information Theory, 28, 139–149. DOI: https://doi.org/10.1109/TIT.1982.1056490.
Web of Science ®Google Scholar

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

SPlit: An Optimal Method for Data Splitting

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

SPlit: An Optimal Method for Data Splitting

References

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date