Full article: Data Science and Predictive Analytics, 2nd ed.

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Data science is a new and fast growing field that received a lot of attention lately. In the most broad sense, data science is the study of the generalizable extraction of knowledge from data Dhar (Citation2013). To this end, it draws upon various methods in statistics, informatics, computer science, and applied mathematics. Notably, there is a well-known internet meme, often attributed to Josh Wills, the former Director of Data Engineering at Slack, which humorously defines a data scientist as someone who is “better at statistics than any software engineer and better at software engineering than any statistician.” While this meme is partly playful, it underscores the fact that, compared to traditional statisticians, data scientists must possess a more interdisciplinary skill set and hands-on computational abilities to navigate large, noisy, and sometimes unstructured datasets.

Classical textbook on applied statistics and machine learning, such as Hastie, Tibshirani, and Friedman (Citation2009) and Zhou (Citation2021)), place significant emphasis on rigorous statistical and machine learning theories and methods for data modeling and analysis. While these textbooks are undoubtedly valuable resources for graduate students specializing in statistics or machine learning, they often lack comprehensive coverage of data wrangling, processing, and practical guidance for addressing common challenges encountered with real-world datasets. Drawing on my 16 years of experience as a professor teaching graduate students majoring in Biostatistics and Computational Biology, I had previously believed that practical skills could only be acquired through participation in supervised collaborative research projects, such as NIH-funded biomedical research projects, or summer internships at pharmaceutical companies. However, it is important to note that real-world data analysis projects typically require only a narrowly defined subset of practical skills, rendering them insufficient for systematic study of data science techniques.

Given that context, I am delighted to discover that “Data Science and Predictive Analytics” (referred to as DSPA hereafter), authored by Dr. Dinov, is an ideal book for individuals aspiring to pursue a career in data science with a focus on practicality and versatility. Notably, DSPA presents all analytical methods through readily available R scripts, accompanied by visually appealing visualizations and result interpretations. Unlike many theory-driven textbooks in this field, DSPA dedicates sections to practical skills such as basic R programming (Section 1.2), data manipulation and exploratory analysis (Section 2.1), handling specialized data types (Chapter 10), as well as text mining and natural language processing (Chapter 7). Moreover, I find the inclusion of a subsection on responsible data science and ethical predictive analytics (Section 1.1.8) highly informative. Additionally, Section 2.1.13 provides comprehensive guidance on addressing missing values and outliers in the data, which are also handy for the targeted readers.

DSPA boasts an extensive coverage of analytic methods, encompassing fundamental linear regression models (Chapter 3), dimensionality reduction (Chapter 4), supervised classification (Chapter 5), unsupervised clustering (Chapter 8), feature selection (Chapter 11), longitudinal and time-course data analysis (Chapter 12), and even deep learning (part of Chapter 6 and Chapter 14). Many of these chapters contain extensively written case studies, which further enhances its practical value.

However, it is important to note that DSPA is not merely a compilation of “practical tricks.” The book includes sections that delve into crucial foundational knowledge in data science, such as linear algebra and regression (Section 3) and general theory and methods for optimization (Chapter 13). Most chapters start with a brief introduction to the analytical problem under discussion, accompanied by mathematical formulations of relevant methods. The motivations, theoretical assumptions, advantages, and disadvantages of commonly employed techniques are typically discussed prior to the associated R codes. This ensures that a curious reader can gain a deeper understanding beyond the mere execution of specific tasks using R functions.

Like any book, DSPA also has areas that could benefit from further improvement. First, there are instances in DSPA where additional descriptions and explanations between the R code would enhance understanding. For instance, Section 2.1.15 introduces the important concept of cohort rebalancing, but the explanation provided is brief, relying heavily on the extensive R code. Similarly, in Section 4.5, independent component analysis (ICA) is defined and discussed, but DSPA could provide a better understanding of the motivation behind ICA by mentioning that (a) ICA is designed to search for independent signals that are not normally distributed (Hyvärinen, and Oja Citation2000), and (b) the excess kurtosis of a Gaussian (normal) random variable is zero.

Second, I have observed some typos in DSPA. For example, on p. 186, the second bullet point states: “When $n ≫ k$ , …” which should likely be “When $n ≪ k$ ,” referring to the “large p, small n” phenomenon. Another typo is found on p. 246, where the definition of $p_{j | i}$ should have the lower index in the summation as “ $j \neq i$ ” instead of “ $k \neq i$ ”.

In summary, DSPA is a well-written book that covers a broad range of essential topics in data science. I have personally learned several useful techniques in R programming and data manipulation from it. DSPA can serve as a self-sufficient textbook for an introductory data science course or as a valuable reference for real-world data analysis. Advanced readers may find it beneficial to use DSPA as a gateway to various domains in data science, subsequently exploring specific books and research papers to gain a deeper understanding of the theories and motivations underlying the methods described in DSPA.

Xing Qiu
Department of Biostatistics and Computational Biology
University of Rochester
Rochester, NY
[email protected]

References

Dhar, V. (2013), “Data Science and Prediction,” Communications of the ACM, 56, 64–73. DOI: 10.1145/2500499.
Web of Science ®Google Scholar
Hastie, T., Tibshirani, R., and Friedman, J. H. (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.), New York: Springer.
Google Scholar
Hyvärinen, A., and Oja, E. (2000), “Independent Component Analysis: Algorithms and Applications, Neural Networks, 13, 411–430. DOI: 10.1016/s0893-6080(00)00026-5.
PubMed Web of Science ®Google Scholar
Zhou, Z.-H. (2021), Machine Learning, Springer Nature.
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Data Science and Predictive Analytics, 2nd ed.

Ivo D. Dinov, Cham, Switzerland: Springer, 2023, xxxiv + 918 pp., $119.99(H), ISBN 978-3-031-17482-7.

References

Information for

Open access

Opportunities

Help and information

Data Science and Predictive Analytics, 2nd ed.

Ivo D. Dinov, Cham, Switzerland: Springer, 2023, xxxiv + 918 pp., $119.99(H), ISBN 978-3-031-17482-7.

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date