3,664
Views
5
CrossRef citations to date
0
Altmetric
Book Reviews

Introduction to Data Science: Data Analysis and Prediction Algorithms With R

by Rafael A. Irizarry. Boca Raton, FL: Chapman and Hall/CRC, Taylor & Francis Group, 2020, xxx + 713 pp., $99.95, ISBN: 978-0-367-35798-6.

The textbook belongs to the Data Science series and presents a modern approach to statistical evaluations via powerful abilities of the R language. The monograph is organized in six parts and thirty-eight chapters, each with multiple subsections, exercises, examples of codes, and discussions on outcomes. Introduction describes the book structure, and Chapter 1 of “Getting Started With R and RStudio” describes the R console that executes the typed commands, the codes saved as scripts, RStudio that presents a user-friendly integrated development environment (IDE) and provides many useful tools. Multiple screenshots demonstrate running commands while editing scripts, changing global options, and installing R packages and libraries.

Part I of “R” starts with Chapter 2 on “R Basics” which describes the main building blocks of R and its logistics, functions and prebuilt objects, variable names, data types and frames, vectors, matrices, and lists, data subsets and coercion, sorting and ranking, vectors arithmetic and logical operators, workplace saving, and some plots, with case study on the US gun murders in comparison with other countries. Chapter 3 of “Programming Basics” explains how to use conditional expressions and to define functions, describes namespaces and for-loops, vectorization and functionals, such as apply, sapply, replicate, and others. Chapter 4 of “The tidyverse” introduces a specific data format referred as tidy (one observation in a row, different variables in each column, all data available) for a more efficient operating on data frames with collection of packages called the tidyverse, which can be loaded by installing the library(tidiverse). This library includes such popular packages as dplyr for manipulating data frames, purrr for working with functions, ggplot2 for graphing, and many others. The new ways of working with data frames with dplyr are described, including the pipe % >% operator for applying one function after the other, summarizing in groups, nested sorting, tibble as a special modern kind of data frames obtained in data stratifying, tibble, dot, and do operators. Chapter 5 of “Importin Data” describes paths and working directory, the readr and readxl packages for the main tidyverse data importing functions.

Part II of “Data Visualization” presents in Chapter 6 “Introduction to Data Visualization,” with several datasets in graphical illustrations, citing the father of exploratory data analysis (EDA) J. W. Tukey: “The greatest value of a picture is when it forces us to notice what we never expected to see.” Chapter 7 of “ggplot2” shows how to create various kind of plots with different features using the package loaded with library(ggplot2). Chapter 8 of “Visualizing Data Distributions” uses an example of data on the student heights to demonstrate their distributions by genders and regions in histograms and smoothed density, bars and box plots, percentiles and quantiles, QQ-plots, and cumulative graphs. Chapter 9 of “Data Visualizing in Practice” on the example of data on developing countries presents various scatterplots with multiple panels of faceting variables, time series plots, multimodal distributions with box and ridge plots, weighted density, and data transformation. Chapter 10 of “Data Visualization Principles” considers how humans detect patterns and make comparisons by viewable number of quantities, distributions by values or categories, with special features related to position, length, angles, area, ordering, colors, and brightness. Various kinds of plots are presented, plots for two variables are discussed, including slop charts, Bland–Altman plot, and plots with encoded third variable. A case study on infectious diseases and vaccines in graph presentation is given as well. Chapter 11 of “Robust Summaries” deals with finding the outliers and using median, inter quartile range (IQR), and absolute median deviations in graphs.

Part III of “Statistics With R” starts with Chapter 12 “Introduction to Statistics With R” notes that the next chapters describe statistical concepts and explain them by implementing R codes on the case studies. Chapter 13 “Probability” defines and calculates on various datasets the relative frequency, discrete and continuous probability distributions, presents Monte Carlo simulations for categorical data, sampling with and without replacement, conditional probabilities, addition and multiplication rules, combinations and permutations, discusses Monty Hall and Birthday problems. Chapter 14 of “Random Variables” deals with data affected by chance because the data come from a random sample, or with a measurement error, or the source is random by nature. The expected value and standard error, central limit theorem (CLT) and the law of large numbers, population versus sample and properties of average values are considered on example of financial crisis of 2007–2008 which occurred because the risks of mortgage-backed securities (MBS) and collateral debt obligations (CDO) were grossly underestimated. Chapter 15 of “Statistical Inference” describes polls and estimate properties, confidence intervals, power and p-values, chi-square and odds ratio tests, and the problem of small p-values for large samples. Chapter 16 of “Statistical Models” continues with poll aggregators by combined data by different experts to improve predictions. Data-driven and hierarchical models, and Bayesian simulation and statistics are used for election forecasting and for predicting the electoral college. Chapter 17 “Regression” focuses on the bivariate regression model. Chapter 18 “Linear Models” is devoted to one of the main tools in data science—multiple regression modeling. On the example of the baseball data, the least squares estimates (LSE) are applied for building regressions using various R tools, particularly, in the tidyverse and broom packages for stratified models. Chapter 19 “Association Is Not Causation,” aka correlation is not causation, discusses spurious correlation, reversing cause and effect, confounders, and Simpson’s paradox, on the example of the US Berkley admission data.

Part IV of “Data Wrangling,” in Chapter 20 of “Introduction to Data Wrangling” reminds that the original data subsets can be obtained in different forms from string processing, html parsing, tables with times and dates, and text mining, so several preliminary steps are needed to present the whole dataset in the frame or tidyverse format. Chapter 21 of “Reshaping Data” describes tidyr package which includes several functions for tidying data. Chapter 22 of “Joining Tables” characterizes several functions for binding and intersecting datasets. Chapter 23 of “Web Scraping,” or web harvesting, shows how to extract a data from a website, for instance, from Wikipedia page. The information used by a browser to render webpages comes as a text file from a server, and a text is coded in the hypertext markup language (HTML). Cascading style sheets (CSS) are widely used to make webpages to look nice, and the rvest package helps to import a webpage into R. A format widely adopted in the internet is the JavaScript Object Notation (JSON), and jsonlite package can be used to read it as a data frame. Chapter 24 of “String Processing” describes how to extract numerical data and names contained in the string, and to make many other operations on them with help of the stringr package, that is illustrated on several case studies. Chapter 25 of “Parsing Dates and Times” describes the tidyverse functionality for working with dates through the lubridate package. Chapter 26 of “Text Mining” is devoted to operating with a free form text, that is needed in such applications as spam filtering, cyber-crime prevention, counter-terrorism, and sentiment analysis. The tydytext package provides converting a free text into a tidy table. A case study of the twitter account of D. J. Trump during 2016 election and other cases are presented.

Part V of “Machine Learning,” in Chapter 27 of “Introduction to Machine Learning” defines this topic as the modern most popular data science methodologies widely applied, for instance, in the handwritten zip code readers implemented in postal service, speech recognition technologies, movie recommendation systems, spam and malware detectors, housing price predictors, and driverless cars. Another term often used for this approach is artificial intelligence (AI), although AI is rather related to the algorithms like those developed for chess playing machines by programing rules, while the machine learning is based on algorithms and decisions built with data. Machine learning uses available data to build a model and then apply it for predictions for the continuous output, or classification for categorical output. A quality for a model or a machine learning algorithm defined by the training and test subsets is considered, together with the features of the confusion matrix, sensitivity and specificity, receiver operating characteristics (ROC) and precision-recall curves, balanced accuracy and F1-score, conditional probabilities and expectations for minimizing squared loss function. Chapter 28 of “Smoothing” considers curve fitting, aka low pass filtering, extremely useful in machine learning because conditional probabilities reveal the trends or shapes which are estimated in the presence of uncertainty in data. Bin smoothing and kernels, local weighted regression (loess) and fitting parabolas are described. Chapter 29 of “Cross Validation” discusses how to implement cross validation with caret package, questions of k-nearest neighbors (kNN), over-training and over-smoothing, picking the k in kNN, K-fold cross validation, and bootstrap. Chapter 30 of “The caret Package” describes this package, which contains now 237 different methods, in more detail, including its train function, cross validation, and shows fitting with loess on examples. The manual on these package techniques is available at https://topepo.github.io/caret/available-models.html, and see also https://topepo.github.io/caret/train-models-by-tag.html. Chapter 31 of “Examples of Algorithms” includes methods of supervised learning, such as linear regression and predict function, logistic regression and generalized linear models, kNN and generative models, naïve Bayes and controlling prevalence, linear and quadratic discriminant analyses, classification and regression trees (CART), random forests and true conditional probability. Chapter 32 of “Machine Learning in Practice” demonstrates applications of kNN and random forest, with variable importance, visual assignments, and ensembles. Chapter 33 of “Large Datasets” deals with computational techniques and statistical concepts specifically oriented to the analysis of big data. Various approaches are described, including matrix algebra with vectorization and filtering based on summaries, indexing with matrices and data binarization, distances in higher dimensions and preserving distances, dimension reduction and orthogonal transformations, principal component analysis (PCA) and singular value decomposition (SVD), regularization and penalized least squares, matrix factorization and factor analysis, with examples of movie and user effects. Chapter 34 of “Clustering” focuses on the algorithms of unsupervised learning, including hierarchical clustering and k-means, with using heatmaps and filtering features.

Part VI “Productivity Tools,” in Chapter 35 of “Introduction to Productivity Tools” describes preferences of using the Unix shell as a tool for managing files and directories. The version control system Git is introduced, together with the service GitHub which permits to keep track of the script and report changes, to host and share the code—see at http://github.com. Chapter 36 of “Organizing With Unix” explains more detail on Linux, the shell, and the command line, navigating the filesystem by changing directories, giving examples of application of many Unix commands. Chapter 37 of “Git and GitHub” provides illustrations with multiple screenshots on this topic, describing repositories, clones, and using these tools in RStudio. Chapter 38 “Reproducible Projects With RStudio and R Markdown” finalizes with creating reports on data analysis projects, with help of the package knitR which minimizes the work. And a detail useful Index of nineteen pages closes the book.

The monograph presents a great introduction to data science and modern R programing, with tons of examples of application of the R abilities throughout the whole volume. The book suggests multiple links to the internet websites related to the topics under consideration, that makes it an incredibly useful source of contemporary data science and programing, helping to students and researchers in their projects.

Stan Lipovetsky
Minneapolis, MN

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.