Search in:

Advanced search

The American Statistician Volume 76, 2022 - Issue 4

Submit an article Journal homepage

665

Views

CrossRef citations to date

Altmetric

Teacher’s Corner

From Black Box to Shining Spotlight: Using Random Forest Prediction Intervals to Illuminate the Impact of Assumptions in Linear Regression

Andrew J. SageDepartment of Mathematics, Statistics, and Computer Science, Lawrence University, Appleton, WI

Yang LiuDepartment of Mathematics, Statistics, and Computer Science, Lawrence University, Appleton, WI

Joe SatoDepartment of Mathematics, Statistics, and Computer Science, Lawrence University, Appleton, WICorrespondence[email protected]

Pages 414-429 | Received 19 Feb 2022, Accepted 15 Jul 2022, Published online: 07 Oct 2022

Cite this article
https://doi.org/10.1080/00031305.2022.2107568
CrossMark

Full Article
Figures & data
References
Supplemental
Citations
Metrics
Reprints & Permissions

References

Baumer, B. S., Kaplan, D. T., and Horton, N. J. (2017), Modern Data Science with R, Boca Raton, FL: CRC Press.
Google Scholar
Billheimer, D. (2019), “Predictive Inference and Scientific Reproducibility,” The American Statistician, 73, 291–295. DOI: 10.1080/00031305.2018.1518270.
Web of Science ®Google Scholar
Breiman, L. (2001a), “Random Forests,” Machine Learning, 45, 5–32. DOI: 10.1023/A:1010933404324.
Web of Science ®Google Scholar
Breiman, L. (2001b), “Statistical Modeling: The Two Cultures” (with comments and a rejoinder by the author), Statistical Science, 16, 199–231
Web of Science ®Google Scholar
Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984), Classification and Regression Trees, Boca Raton, FL: CRC Press.
Google Scholar
Cannon, A. R., Cobb, G. W., Hartlaub, B. A., Legler, J. M., Lock, R. H., Moore, T. L., Rossman, A. J., and Witmer, J. (2013), Stat2: Building Models for a World of Data, New York: W.H. Freeman.
Google Scholar
De Cock, D. (2011), “Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project,” Journal of Statistics Education, 19, 16.
Google Scholar
De Veaux, R. D., Agarwal, M., Averett, M., Baumer, B. S., Bray, A., Bressoud, T. C., Bryant, L., Cheng, L. Z., Francis,A., Gould, R., Kim, A. Y., Kretchmar, M., Lu, Q., Moskol, A., Nolan, D., Pelayo, R., Raleigh, S., Sethi, R. J., Sondjaja, M., Tiruviluamala, N., Uhlig, P. X., Washington, T. M., Wesley, C. L., White, D., and Ping, Y. (2017), “Curriculum Guidelines for Undergraduate Programs in Data Science,” Annual Review of Statistics and Its Application, 4, 15–30. DOI: 10.1146/annurev-statistics-060116-053930.
Web of Science ®Google Scholar
Donoho, D. (2017), “50 Years of Data Science,” Journal of Computational and Graphical Statistics, 26, 745–766. DOI: 10.1080/10618600.2017.1384734.
Web of Science ®Google Scholar
Fawcett, L. (2018), “Using Interactive Shiny Applications to Facilitate Research-Informed Learning and Teaching,” Journal of Statistics Education, 26, 2–16. DOI: 10.1080/10691898.2018.1436999.
Web of Science ®Google Scholar
Grömping, U. (2009), “Variable Importance Assessment in Regression: Linear Regression Versus Random Forest,” The American Statistician, 63, 308–319. DOI: 10.1198/tast.2009.08199.
Web of Science ®Google Scholar
Harville, D. A. (2014), “The Need for More Emphasis on Prediction: A “nondenominational” Model-based Approach,” The American Statistician, 68, 71–83.
Web of Science ®Google Scholar
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An Introduction to Statistical Learning (Vol. 112), New York: Springer.
Google Scholar
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2017), “ISLR: Data for an Introduction to Statistical Learning with Applications in R. Available at https://CRAN.R-project.org/package=ISLR, r package version 1.2.
Google Scholar
Krause, J., Perer, A., and Ng, K. (2016), “Interacting with Predictions: Visual Inspection of Black-Box Machine Learning Models,” in Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 5686–5697.
Google Scholar
Kuhn, M. (2017), “AmesHousing: The Ames Iowa Housing Data,” Available at https://CRAN.R-project.org/package=AmesHousing, r package version 0.0.3.
Google Scholar
Kuhn, M., and Johnson, K. (2013), Applied Predictive Modeling (Vol. 26), New York: Springer.
Google Scholar
Kuhn, M., and Johnson, K. (2018), “AppliedPredictiveModeling: Functions and Data Sets for ’Applied Predictive Modeling’,” Available at https://CRAN.R-project.org/package=AppliedPredictiveModeling, r package version 1.1-7.
Google Scholar
Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2018), “Distribution-Free Predictive Inference for Regression,” Journal of the American Statistical Association, 113, 1094–1111. DOI: 10.1080/01621459.2017.1307116.
Web of Science ®Google Scholar
Liaw, A., and Wiener, M. (2002), “Classification and Regression by Randomforest,” R News, 2, 18–22. http://CRAN.R-project.org/doc/Rnews/.
Google Scholar
Lin, Y., and Jeon, Y. (2006), “Random Forests and Adaptive Nearest Neighbors,” Journal of the American Statistical Association, 101, 578–590. DOI: 10.1198/016214505000001230.
Web of Science ®Google Scholar
Lock, R. (2021), “Lock5Data: Datasets for “Statistics: UnLocking the Power of Data”,” Available at https://CRAN.R-project.org/package=Lock5Data, r package version 3.0.0.
Google Scholar
Lu, B., and Hardin, J. (2021), “A Unified Framework for Random Forest Prediction Error Estimation,” Journal of Machine Learning Research, 22, 1–41.
Web of Science ®Google Scholar
Meinshausen, N. (2006), “Quantile Regression Forests,” The Journal of Machine Learning Research, 7, 983–999.
Web of Science ®Google Scholar
Meinshausen, N. (2017), “quantregForest: Quantile Regression Forests,” Available at https://CRAN.R-project.org/package=quantregForest, r package version 1.3-7.
Google Scholar
Mentch, L., and Hooker, G. (2016), “Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests,” The Journal of Machine Learning Research, 17, 841–881.
Web of Science ®Google Scholar
Mentch, L., and Hooker, G. (2017), “Formal Hypothesis Tests for Additive Structure in Random Forests,” Journal of Computational and Graphical Statistics, 26, 589–597.
PubMed Web of Science ®Google Scholar
Mentch, L., and Zhou, S. (2019), “Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success,” arXiv preprint arXiv:191100190.
Google Scholar
Neter, J., Kutner, M. H., Nachtsheim, C. J., Wasserman, W. (1996), Applied Linear Statistical Models, New York: McGraw-Hill.
Google Scholar
Palczewska, A., Palczewski, J., Robinson, R. M., and Neagu, D. (2013), “Interpreting Random Forest Models using a Feature Contribution Method,” in 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI), IEEE, pp. 112–119.
Google Scholar
Palczewska, A., Palczewski, J., Robinson, R. M., and Neagu, D. (2014), “Interpreting Random Forest Classification Models using a Feature Contribution Method,” in Integration of Reusable Systems, eds. T. Bouabana-Tebibel and S. H. Rubin, pp. 193–218, Cham: Springer.
Google Scholar
Ramsey, F., and Schafer, D. (2012), The Statistical Sleuth: A Course in Methods of Data Analysis, Boston: Cengage Learning.
Google Scholar
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016), “Why Should I Trust You?” Explaining the Predictions of any Classifier,” in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144.
Google Scholar
Sarid, A. (2022), “Linear Regression – Model Fit and Intervals,” https://sarid.shinyapps.io/intervals_demo/.
Google Scholar
Strobl, C., Boulesteix, A. L., Zeileis, A., and Hothorn, T. (2007), “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution,” BMC Bioinformatics, 8, 1–21. DOI: 10.1186/1471-2105-8-25.
PubMed Web of Science ®Google Scholar
Sugiyama, M. (2015), Introduction to Statistical Machine Learning, Waltham, MA: Morgan Kaufmann.
Google Scholar
Vovk, V., Gammerman, A., and Shafer, G. (2005), Algorithmic Learning in a Random World, New York: Springer.
Google Scholar
Vovk, V., Nouretdinov, I., and Gammerman, A. (2009), “On-line Predictive Linear Regression,” The Annals of Statistics, 37, 1566–1590. DOI: 10.1214/08-AOS622.
Web of Science ®Google Scholar
Wager, S., and Athey, S. (2018), “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests,” Journal of the American Statistical Association, 113, 1228–1242. DOI: 10.1080/01621459.2017.1319839.
Web of Science ®Google Scholar
Wager, S., Hastie, T., and Efron, B. (2014), “Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife,” The Journal of Machine Learning Research, 15, 1625–1651.
PubMed Web of Science ®Google Scholar
Wang, S. L., Zhang, A. Y., Messer, S., Wiesner, A., and Pearl,D. K. (2021), “Student-Developed Shiny Applications for Tteaching Statistics,” Journal of Statistics and Data Science Education, 29, 218–227. DOI: 10.1080/26939169.2021.1995545.
Web of Science ®Google Scholar
Zhang, H., and Wang, M. (2009), “Search for the Smallest Random Forest,” Statistics and its Interface, 2, 381. DOI: 10.4310/sii.2009.v2.n3.a11.
PubMed Web of Science ®Google Scholar
Zhang, H., Zimmerman, J., Nettleton, D., and Nordman, D. J. (2019), “Random Forest Prediction Intervals,” The American Statistician, 74, 392–406. DOI: 10.1080/00031305.2019.1585288.
Web of Science ®Google Scholar
Zhao, X., Wu, Y., Lee, D. L., and Cui, W. (2018), “iforest: Interpreting Random Forests via Visual Analytics,” IEEE Transactions on Visualization and Computer Graphics, 25, 407–416. DOI: 10.1109/TVCG.2018.2864475.
Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

From Black Box to Shining Spotlight: Using Random Forest Prediction Intervals to Illuminate the Impact of Assumptions in Linear Regression

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

From Black Box to Shining Spotlight: Using Random Forest Prediction Intervals to Illuminate the Impact of Assumptions in Linear Regression

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date