665
Views
4
CrossRef citations to date
0
Altmetric
Teacher’s Corner

From Black Box to Shining Spotlight: Using Random Forest Prediction Intervals to Illuminate the Impact of Assumptions in Linear Regression

, &
Pages 414-429 | Received 19 Feb 2022, Accepted 15 Jul 2022, Published online: 07 Oct 2022

References

  • Baumer, B. S., Kaplan, D. T., and Horton, N. J. (2017), Modern Data Science with R, Boca Raton, FL: CRC Press.
  • Billheimer, D. (2019), “Predictive Inference and Scientific Reproducibility,” The American Statistician, 73, 291–295. DOI: 10.1080/00031305.2018.1518270.
  • Breiman, L. (2001a), “Random Forests,” Machine Learning, 45, 5–32. DOI: 10.1023/A:1010933404324.
  • Breiman, L. (2001b), “Statistical Modeling: The Two Cultures” (with comments and a rejoinder by the author), Statistical Science, 16, 199–231
  • Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984), Classification and Regression Trees, Boca Raton, FL: CRC Press.
  • Cannon, A. R., Cobb, G. W., Hartlaub, B. A., Legler, J. M., Lock, R. H., Moore, T. L., Rossman, A. J., and Witmer, J. (2013), Stat2: Building Models for a World of Data, New York: W.H. Freeman.
  • De Cock, D. (2011), “Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project,” Journal of Statistics Education, 19, 16.
  • De Veaux, R. D., Agarwal, M., Averett, M., Baumer, B. S., Bray, A., Bressoud, T. C., Bryant, L., Cheng, L. Z., Francis,A., Gould, R., Kim, A. Y., Kretchmar, M., Lu, Q., Moskol, A., Nolan, D., Pelayo, R., Raleigh, S., Sethi, R. J., Sondjaja, M., Tiruviluamala, N., Uhlig, P. X., Washington, T. M., Wesley, C. L., White, D., and Ping, Y. (2017), “Curriculum Guidelines for Undergraduate Programs in Data Science,” Annual Review of Statistics and Its Application, 4, 15–30. DOI: 10.1146/annurev-statistics-060116-053930.
  • Donoho, D. (2017), “50 Years of Data Science,” Journal of Computational and Graphical Statistics, 26, 745–766. DOI: 10.1080/10618600.2017.1384734.
  • Fawcett, L. (2018), “Using Interactive Shiny Applications to Facilitate Research-Informed Learning and Teaching,” Journal of Statistics Education, 26, 2–16. DOI: 10.1080/10691898.2018.1436999.
  • Grömping, U. (2009), “Variable Importance Assessment in Regression: Linear Regression Versus Random Forest,” The American Statistician, 63, 308–319. DOI: 10.1198/tast.2009.08199.
  • Harville, D. A. (2014), “The Need for More Emphasis on Prediction: A “nondenominational” Model-based Approach,” The American Statistician, 68, 71–83.
  • James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An Introduction to Statistical Learning (Vol. 112), New York: Springer.
  • James, G., Witten, D., Hastie, T., and Tibshirani, R. (2017), “ISLR: Data for an Introduction to Statistical Learning with Applications in R. Available at https://CRAN.R-project.org/package=ISLR, r package version 1.2.
  • Krause, J., Perer, A., and Ng, K. (2016), “Interacting with Predictions: Visual Inspection of Black-Box Machine Learning Models,” in Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 5686–5697.
  • Kuhn, M. (2017), “AmesHousing: The Ames Iowa Housing Data,” Available at https://CRAN.R-project.org/package=AmesHousing, r package version 0.0.3.
  • Kuhn, M., and Johnson, K. (2013), Applied Predictive Modeling (Vol. 26), New York: Springer.
  • Kuhn, M., and Johnson, K. (2018), “AppliedPredictiveModeling: Functions and Data Sets for ’Applied Predictive Modeling’,” Available at https://CRAN.R-project.org/package=AppliedPredictiveModeling, r package version 1.1-7.
  • Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2018), “Distribution-Free Predictive Inference for Regression,” Journal of the American Statistical Association, 113, 1094–1111. DOI: 10.1080/01621459.2017.1307116.
  • Liaw, A., and Wiener, M. (2002), “Classification and Regression by Randomforest,” R News, 2, 18–22. http://CRAN.R-project.org/doc/Rnews/.
  • Lin, Y., and Jeon, Y. (2006), “Random Forests and Adaptive Nearest Neighbors,” Journal of the American Statistical Association, 101, 578–590. DOI: 10.1198/016214505000001230.
  • Lock, R. (2021), “Lock5Data: Datasets for “Statistics: UnLocking the Power of Data”,” Available at https://CRAN.R-project.org/package=Lock5Data, r package version 3.0.0.
  • Lu, B., and Hardin, J. (2021), “A Unified Framework for Random Forest Prediction Error Estimation,” Journal of Machine Learning Research, 22, 1–41.
  • Meinshausen, N. (2006), “Quantile Regression Forests,” The Journal of Machine Learning Research, 7, 983–999.
  • Meinshausen, N. (2017), “quantregForest: Quantile Regression Forests,” Available at https://CRAN.R-project.org/package=quantregForest, r package version 1.3-7.
  • Mentch, L., and Hooker, G. (2016), “Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests,” The Journal of Machine Learning Research, 17, 841–881.
  • Mentch, L., and Hooker, G. (2017), “Formal Hypothesis Tests for Additive Structure in Random Forests,” Journal of Computational and Graphical Statistics, 26, 589–597.
  • Mentch, L., and Zhou, S. (2019), “Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success,” arXiv preprint arXiv:191100190.
  • Neter, J., Kutner, M. H., Nachtsheim, C. J., Wasserman, W. (1996), Applied Linear Statistical Models, New York: McGraw-Hill.
  • Palczewska, A., Palczewski, J., Robinson, R. M., and Neagu, D. (2013), “Interpreting Random Forest Models using a Feature Contribution Method,” in 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI), IEEE, pp. 112–119.
  • Palczewska, A., Palczewski, J., Robinson, R. M., and Neagu, D. (2014), “Interpreting Random Forest Classification Models using a Feature Contribution Method,” in Integration of Reusable Systems, eds. T. Bouabana-Tebibel and S. H. Rubin, pp. 193–218, Cham: Springer.
  • Ramsey, F., and Schafer, D. (2012), The Statistical Sleuth: A Course in Methods of Data Analysis, Boston: Cengage Learning.
  • Ribeiro, M. T., Singh, S., and Guestrin, C. (2016), “Why Should I Trust You?” Explaining the Predictions of any Classifier,” in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144.
  • Sarid, A. (2022), “Linear Regression – Model Fit and Intervals,” https://sarid.shinyapps.io/intervals_demo/.
  • Strobl, C., Boulesteix, A. L., Zeileis, A., and Hothorn, T. (2007), “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution,” BMC Bioinformatics, 8, 1–21. DOI: 10.1186/1471-2105-8-25.
  • Sugiyama, M. (2015), Introduction to Statistical Machine Learning, Waltham, MA: Morgan Kaufmann.
  • Vovk, V., Gammerman, A., and Shafer, G. (2005), Algorithmic Learning in a Random World, New York: Springer.
  • Vovk, V., Nouretdinov, I., and Gammerman, A. (2009), “On-line Predictive Linear Regression,” The Annals of Statistics, 37, 1566–1590. DOI: 10.1214/08-AOS622.
  • Wager, S., and Athey, S. (2018), “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests,” Journal of the American Statistical Association, 113, 1228–1242. DOI: 10.1080/01621459.2017.1319839.
  • Wager, S., Hastie, T., and Efron, B. (2014), “Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife,” The Journal of Machine Learning Research, 15, 1625–1651.
  • Wang, S. L., Zhang, A. Y., Messer, S., Wiesner, A., and Pearl,D. K. (2021), “Student-Developed Shiny Applications for Tteaching Statistics,” Journal of Statistics and Data Science Education, 29, 218–227. DOI: 10.1080/26939169.2021.1995545.
  • Zhang, H., and Wang, M. (2009), “Search for the Smallest Random Forest,” Statistics and its Interface, 2, 381. DOI: 10.4310/sii.2009.v2.n3.a11.
  • Zhang, H., Zimmerman, J., Nettleton, D., and Nordman, D. J. (2019), “Random Forest Prediction Intervals,” The American Statistician, 74, 392–406. DOI: 10.1080/00031305.2019.1585288.
  • Zhao, X., Wu, Y., Lee, D. L., and Cui, W. (2018), “iforest: Interpreting Random Forests via Visual Analytics,” IEEE Transactions on Visualization and Computer Graphics, 25, 407–416. DOI: 10.1109/TVCG.2018.2864475.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.