2,927
Views
0
CrossRef citations to date
0
Altmetric
Book Review

Statistical Computing With R

(2nd ed.), by Maria L. Rizzo. Boca Raton, FL: Chapman & Hall/CRC Press, 2019, xiv+474 pp., $99.95(H), ISBN: 978-1-46-655332-3.

This book is a textbook introducing statistical computing techniques with examples implemented in R. This book tries to keep a balance between theory and practice, with more focus on the latter. It not only elaborates the theories of statistical computing, but also provides plenty of R codes to help the readers practice what they learned from the book. As stated in the preface, the targeted readers of this book are graduate students and advanced undergraduates with preparation in the relevant mathematics foundations. From this point of view, the content of the book fits well to the anticipated audience.

Chapter 1 provides a quick introduction on R language. Starting from a step-by-step installation guide of both R and R-studio, it briefly covers the basic syntax, the distribution and hypothesis testing, how to define your own function and common data object inside R. The last piece is plotting. This is by no means the most comprehensive R guidelines, but it offers the R basics for the examples throughout the book. The help section is particularly useful, in case students have questions on certain functions or libraries—they know where they can look up (other than having to search for the information elsewhere). Chapter 2 covers the basics of probability and statistics. Assuming most students are familiar with Statistics 101, this chapter refreshes the readers’ memory to prepare for the subsequent chapters. This chapter focuses on theory instead of coding.

Starting from Chapter 3, the book officially steps into statistical computing. Every chapter was structured with a theoretical section blended with examples and R code. Chapter 3 is a natural application of the knowledge covered in Chapter 2: how to generate a random variable given a distribution. Chapter 4 extends that from generating a random variable to a random process. However, the related section appears to be a bit too brief for a student to fully understand transitions from counting process to Brownian motion. Admittedly, it is a challenging topic and would have required a very long section to cover more details of the random process, although the book does provide readers a lot of useful reference materials to read further. Chapter 5 is more application-oriented about the visualization in data analysis, after generating the multivariate distribution. This is not only helpful for students but also for statistics practitioners in industry. For example, the correlation plot and heat diagram are commonly used to illustrate the pairwise correlation among a group of features. The PCA section is particularly useful. When facing the representation of multidimensional data and how close each point is to each other, it is a computational efficient approach to use the top PCs to construct the coordinates, without losing much information. R is very powerful in visualization (especially ggplot2) and there are so many details in its capability that are even plausibly beyond the scope of this book.

Chapters 6–10 are the highlights of this book in my opinion. It not only covers details of commonly used methods including Monte Carlo (MC), bootstrap, resampling, etc., but also provides use cases of these methods. These sections align perfectly with my experience in the industry, as a statistical practitioner on A/B testing and experiment design. As an example, when handling massive real world data, the assumptions of standard tests are often violated to some degrees (independent assumption, finite sample, or even asymptotic convergence), the bootstrap method is usually the most practical method to use if you are not constrained by latency. There is a thorough treatment of bootstrap confidence intervals covering all the high order methods and their properties. Another example is the permutation test. As a classical nonparametric testing procedure, it is frequently used in A/B testing in industry as well. Chapter 10 covers the basic idea of permutation test and also provides practical recommendations for the number of permutations in approximate permutation tests. If I have to be really critical, the application in Chapter 10 about the cross-validation (CV), seems a bit removed from the bootstrap. CV is mostly used for model selection when facing overfitting issues. The k-fold CV is more on random partitioning instead of the bootstrap.

Markov chain Monte Carlo (MCMC) is a deep topic and Chapter 11 serves as an introduction for it. It covers the main ideas of MCMC and the algorithms constructing the Markov chains such as Metropolis–Hasting, Gibbs sampler, and so on. Methods of checking for convergence are briefly discussed. Chapter 12 focuses on the nonparametric density estimation from observed data. This is again powerful in real data analysis. We face a lot of problems that require a nonparametric approach, density estimation provides a flexible and powerful tool for visualization, exploration, and analysis of data.

The rest of the book treats R as a general computing language or even programming languages. For example, it can solve optimization problems, which is particularly useful in computing the maximum likelihood estimation. I really appreciate the section on “finding source code” in Chapter 15. A lot of the libraries in R are written in C or Fortran. Occasionally, we need to dig into those codes and make changes to suit our needs. It is very helpful in our daily research to be able to find the source code and compile the changes. For example, I used to open the source C code in Classification and Regression Trees and implemented my own loss functions to change the way of splitting nodes.

To sum up, I think this book covers most common topics in statistical computing. For readers with basic knowledge in statistics and linear algebra, this book is a good starter guide. The book makes a good choice to use R to illustrate many concepts, algorithms, and visualization. For more advanced and practical readers, R may face challenges in computation efficiency when handling very large datasets in certain industrial use cases. How to leverage parallel computing is essential for many topics in this book such as bootstrap and MCMC. I have observed people hesitated to use Bayesian probabilistic inference because of the limit in scalability; they ended up turning to deep neural network instead which has much better support in terms of the libraries, the framework and the computation resource. In mid-2019, Google released edward2 in tensorflow as a probabilistic inference library. I would encourage the authors mentioning this in the book even as an extended reference. Similarly, for data manipulation, sparkR provided a distributed way to handle big amounts of data while keeping similar syntax as R. It would be beneficial for readers if they are going to apply the techniques in this book.

Finally, I would like to give credit to the author on making their code available on github. This makes it convenient for readers to try the code themselves without lots of typing. It also allows the authors to easily make updated code available to readers.

Ling Leng
Amazon.com

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.