Foundations of Statistics for Data Scientists: With R and Python

by Alan Agresti and Maria Kateri, Boca Raton, FL: Chapman and Hall/CRC, 2022, xv + 453 pp., $136.46 (hcb).

Roger SauterBoeing RetireeCorrespondence[email protected]

A solid, foundational statistics book that prepares undergraduates for data science roles is a valuable resource. How well does this book accomplish this objective? Will this book serve as a good reference book for a practicing data scientist? Evaluating these questions is the approach used for reviewing this book.

The topics covered are not unusual for an undergraduate mathematical statistics class. Chapters 1–3 cover the basics for reliable statistical training of this type. This includes fundamental definitions of statistical concepts, variable types, and descriptive statistics in Chapter 1. Chapter 2 covers probability distributions, expectation of random variables, and correlation. Chapter 3 finishes the introduction of these basics with sampling distributions. It can be hard to see the applicability of these statistical concepts for data scientists in industry but the embedded code can augment the understanding for the reader. In addition the expanded help with R and Python in the appendices greatly enhances this book as a resource for those using data in their work. The section in the Appendix A connected to chapter 1 recommends useful R packages, tips on handling missing values, and how to create heat maps. As statistical methods, software, and data science tools continue to expand rapidly, some of this information may become dated relatively soon, but it takes a basic mathematical statistics book down a useful path. The direction is set early in chapter 1 with this advice, “…careful thought is needed to decide which statistical methods are appropriate for any particular situation, as they all make certain assumptions, and some methods work poorly when the assumptions are violated.” Statistical packages can always run the numbers, but understanding the assumptions discussed in this book is necessary for proper use.

Chapters 4–6 take the reader into the heart of more traditional statistical inference used by a frequentist along with the corresponding Bayesian approach. The classical approach has been the building blocks for the majority of statistical analysis but now there are so many more tools at the disposal of talented and resourceful data scientists. I like that in chapter 4 the authors introduce the use of bootstrapping. Not only does bootstrapping help when the underlying sampling distribution is unknown but it has multiple applications for the data scientist. The introduction of the Bayesian approach is great, realizing that it may require additional resources for practical implementation. Chapter 5 covers the obligatory treatment of statistical tests. I like that the authors include Bayesian statistical tests with the frequentist versions in this chapter and that they address the importance of practical significance versus statistical significance and how statistical tests can be misleading. Chapter 6 includes the basic setup for linear models and least squares. This is expanded to include multiple regression, summarizing variability in regression models, with statistical inference using the classical approach and Bayesian methods and concludes the chapter with some matrix notation. Some of the practical advice presented in this chapter is “Reality is more complex and never perfectly described by a model, but a model is a tool for making our perception of reality clearer.” Or as George Box is often quoted as saying “All models are wrong, but some are useful” (Box Citation1979). I also like that the authors whet the appetite of the reader by pointing to chapter 7 and describe when it is more appropriate to use regularization rather than the more common usage of multiple regression techniques.

The final chapters, 7–9, briefly expose the reader to generalized linear models, clustering, classification, and a historical overview of statistics. In chapter 7, the practical techniques of linear regression are expanded to include useful tools like logistic and Poisson regression. This greatly enhances a data scientist’s toolbox for working with the wide variety of situations that occur when trying to model real world relationships. Chapter 8 introduces the reader to classification and clustering. Valuable practical information is provided in their discussion of when to use logistic regression versus linear discriminant analysis and classification trees. The rest of the chapter gives a brief discussion of k nearest neighbors and neural networks for classification and cluster analysis. The topics introduced here serve only as a starting point for the reader to do more learning in these areas in order to apply these types of analysis in practical applications. Chapter 9 discusses some of the history of data science. What I find the most interesting for a data scientist from this chapter are their final words addressing what they call the pillars of wisdom for practicing data science. These are as follows: plan the study well, data quality is paramount, be aware of potential sources of bias, expect variability and deal with it properly, check assumptions and use appropriate statistical methods, aim for parsimony in methods, presentation, and interpretation, and make analyses reproducible and encourage replication. These are all true and valuable, but situations arise where not all assumptions can be verified. When this happens and the process needs to move forward it is necessary to give our best estimate, qualified with all concerns (transparency). Sometimes we have to act as a lawyer because of all the fine print needed to qualify what can be concluded from the available data and what we may need to reevaluate as more and/or better data is collected.

The appendices for using R and Python are definitely one of the strengths of this book both for training future data scientists and a reference for those transitioning into the data science role. The examples in the appendices provide more complicated, realistic examples with outliers for bootstrap confidence intervals, simulation, nonparametric, survival, regularization, and clustering. It also shows the reader how to do the same analysis in R and Python which can aid the reader in their understanding of how to use these two different programming languages.

Every data scientist benefits from having a solid statistical foundation. If this knowledge can be gained as an undergraduate student that is ideal, but some without statistical training transition into a data science role after they leave college. It is important for these data scientists to also have access to the same knowledge. Internet searches for statistical techniques are a valuable tool, but when researching how to use a data science tool it is easy to overlook the assumptions needed to properly interpret the results. The statistical foundation that this book provides about the proper use of statistical methods is key for the application of data science tools in practice. That makes this book a valuable asset to those working on a degree in data science and to those who are getting their education on the job.

Roger Sauter
Boeing Retiree
[email protected]

Reference

Box, G. E. P. (1979), “Robustness in the Strategy of Scientific Model Building,” in Robustness in Statistics, eds. R. L. Launer and G. N. Wilkinson, pp. 201–236, New York: Academic Press. DOI: 10.1016/B978-0-12-438150-6.50018-2.
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Foundations of Statistics for Data Scientists: With R and Python

by Alan Agresti and Maria Kateri, Boca Raton, FL: Chapman and Hall/CRC, 2022, xv + 453 pp., $136.46 (hcb).

Reference

Information for

Open access

Opportunities

Help and information

Foundations of Statistics for Data Scientists: With R and Python

by Alan Agresti and Maria Kateri, Boca Raton, FL: Chapman and Hall/CRC, 2022, xv + 453 pp., $136.46 (hcb).

Reference

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date