1,647
Views
0
CrossRef citations to date
0
Altmetric
Book Review

Statistics and machine learning methods for EHR data – from data extraction to data analytics

by Hulin Wu et al., 2021. Boca Raton, FL:CRC Press, ISBN 978-0-367-44239-2, 327 pages, 130.00 USD, Hardback

ORCID Icon

Electronic health records (EHR) carry a rich source of data for real-world evidence (RWE) generation. As more and more healthcare providers use EHR in practice and due to the increasing accessibility, EHR data can play a substantial role in healthcare research. The authors’ effort to provide a structured description of “EMR/EHR data processing and analysis methods” in this book is very timely. The authors state that this book is based on their experience from the Cerner health fact database (Cerner, 2020), and that experience certainly benefits this book. The authors' efforts must be appreciated for putting together all the data management and the statistical tools relevant for EHR research and coming up with a nearly perfect book in the first edition itself. This book appeals beyond the EHR research, and researchers involved in observational data analysis and comparative effectiveness research (CER) at large may also benefit from this book.

The book consists of 11 chapters. Chapter 1 is devoted to introducing the EHR data. The next nine chapters (Chapters 2–10) can be broadly grouped into two parts. Part I (Chapters 2–5) details the data extraction, data cleaning and data preparation for EHR research. Part II (Chapter 6–10) describes relevant statistical tools for data extracted from the EHR database. The last chapter (Chapter 11) offers some future direction for EHR data research, including resources available to the researchers, and suggests some “post research practices”.

The introductory Chapter 1 provides an excellent description of the EHR data originating from hospital and clinic patient care system and carefully differentiates it from electronic medical record (EMR) data. It continues to describe the strength of EHR data, why it should play an important role in healthcare research vis-à-vis gold-standard randomized clinical trial, and the challenges in dealing with EHR data (e.g., missing data, high dimensional data, potential inaccuracies in the data etc.). This chapter also gives an overview of the Cerner database which is later used for illustration in the remaining chapters. Further, it outlines the steps in the life cycle of EHR data which are to be described in details in Chapters 2–10.

Chapters 2–5 are very detailed with examples, and readers with all experience level in data management should benefit from these chapters. Chapter 2 describes the project management for the research based on the EHR database with examples. This chapter covers naming conventions, coding conventions, version control, programming rules, people management, and task management. Chapter 3 talks about the design and management of the EHR database system, and the tools available for query and data extraction from the EHR database for the research purpose. An important feature in Chapter 3 is that it lists out 22 EHR databases which many researchers may find very informative. Chapter 4 describes tools available for data cleaning, the challenges in data cleaning and some recommendation. In Section 4.2, the authors talk about the tools for general data wranglers. Some upfront discussion on the purpose and definition of data wrangling would be helpful which is currently missing. Chapter 4 is devoted to data (pre-)processing and data preparation. These two concepts are clearly defined. Under data (pre-)processing methods, techniques for derivating a new variable, addressing data issues, and reducing data dimension are presented. The discussion on data preparation includes project-specific data cleaning, and the determination of exposure and response status for both primary and sensitivity analyses.

Missing data is the major issue with EHR data and Chapter 6 is devoted to discussing missing data imputation. This section discusses several missing data imputation methods, including both commonly used methods (such as multiple imputation method and LOCF) as well as rarely used methods following propensity scores and tree-based approach with the application of these methods in the Cerner database. The authors barely touched upon deep learning methods for missing data imputation without illustration. Surprisingly, the authors did not discuss the pattern mixture model and selection model, neither the reasoning for exclusion of these methods was stated. In the illustration, a comparison of imputation methods was made using AUC value and also results based on the missForest imputation method are presented; an upfront explanation of AUC value and missForest imputation would have been helpful.

Chapter 7 presents several causal inference methods including propensity score-based methods, mediation analysis and instrumental variable framework. The section on propensity score-based methods appears more of a literature review with very little explanation about the basic concept. The example on the propensity score methods is very helpful; however, it is not clear how standardized mean difference (SMD) is obtained. The discussion on mediation analysis is very concise and is presented very clearly. Unlike the propensity score-based approach and mediation analysis, the section on the Instrumental variable approach is unnecessarily burdened with mathematical formulae. A new method is presented in Section 7.5; however, the inclusion of such a method that is not peer-reviewed may be debatable. This chapter also discusses briefly targeted maximum likelihood estimation and doubly robust estimator.

Chapter 8 describes the several exploratory data analysis (EDA) methods with which most readers are already expected to be well-versed. EDA methods presented in this chapter include contingency tables, chi-square test, Fisher exact test, generalized linear model, survival model, mixed-effects model and variable selection methods (e.g., LASSO). Two relatively lesser-known methods, namely multivariate lagged drug model and Gaussian process, are also described without illustration; illustration of these two methods would be helpful, which is currently missing.

Chapter 9 focuses on machine learning methods, including neural network and deep learning approaches for EHR data analysis. Chapter 10 describes several supervised learning methods for predictions including tree ensembles and support vector machines. All the methods presented in Chapters 9 and 10 are supported with example and necessary R code.

Overall this book largely serves its purpose and stands upto the expectation to give an excellent overview of EHR data, preparation of the analysis dataset from the database and the relevant statistical tools to analyze the EHR data. Chapters 2–5 are very detailed, where authors presented several aspects of data management with examples, pointed out the common problems one may encounter and the recommendation based on their first-hand experience. My favourite chapters are Chapter 9 and 10 where statistical concepts are explained very well in relatively simpler words with example and code for implementation. Chapters 5, 6 and 7 may be improved by adding examples, providing code for implementation, focusing more on the concept and removing unnecessary literature-review and mathematical derivations. The book contains some minor typos (e.g., both G and Z are used to denote instrument variable in Chapter 7) and formatting lapses, and an alert reader will discover and amend them along the way. This book should make it to the bookshelf of anyone involved in data preparation and statistical analysis for EHR research.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.