1,153
Views
1
CrossRef citations to date
0
Altmetric
Book Reviews

Practical Text Analytics: Maximizing the Value of Text Data

by Murugan Anandarajan, Chelsey Hill, and Thomas Nolan, New York, NY: Springer Publishing Company, Incorporated, 2018, vii + 285 pp., $76.64, ISBN: 978-3-319-95663-3.

&

The authors preface this book with the goal of making text analytics accessible for researchers by introducing essential elements in creating a text mining pipeline and illustrating practical examples in text analytics using software packages in R, Python, RapidMiner, and SAS. The book is designed as an introductory tutorial for starting researchers and professionals in text data-related fields and with minimal background mathematical or statistical knowledge, or programing experience.

The book contains five main sections beyond the introductory Chapter 1. The first section on “Planning the Text Analytics Project” (Chapters 2 and 3) introduces fundamentals of content analysis and a planning framework for text analytics applications. The second section on “Text Preparation” (Chapters 4 and 5) outlines steps and techniques for text reprocessing and data reduction. The next section titled “Text Analysis Techniques” (Chapters 6–10) is the nucleus of the book, covering essential supervised and unsupervised learning methods for retrieving information from prepared text data. The fourth section on “Communicating the Results” (Chapters 11 and 12) focuses on the interpretation of the text analysis results which is comprised of storytelling and data visualization. The final section (“Examples of Text Analytics Software,” Chapters 13–16) has each chapter illustrate real-world data examples analyzed and coded using a programing language.

Chapter 1 discusses the definition, origins, current development, and applications of text analytics. The practical use of text analytics in business and industry is illustrated through applications that investigate the most frequent terms and phrases in text analytics-related articles’ abstracts and titles and the most prevalent jobs and skills required for text analytics. The rest of the chapter provides an overview of the next four sections where the topics cover the main steps in the process of text analytics. There is a summarized note with statements highlighting the keywords and the important concepts at the very end of each chapter.

Chapter 2 demonstrates fundamentals of content analysis as theoretical foundations for text analytics, including deductive and inductive approaches, data units, sampling strategies, and coding processes. Chapter 3 then details elements to be considered in planning a text analytics project. The titles and layout in Chapter 3 are a bit confusing. For instance, Section 3.3 is headlined as “Planning Process” but while consisting of only one figure that is titled “Text Analytics Planning Tasks,” actually refers to subsequent steps after the initial planning of tasks under consideration. These steps are then explained in detail through Sections 3.4–3.6. It would have been better to have each section clearly correspond to one of the main steps in text analytics planning process without confusion and redundancy.

Chapters 4 and 5 essentially deal with preparing text data for analysis. Chapter 4 focuses on tokenizing, standardizing and cleaning, removing stop words, stemming and lemmatizing. Both basic and advanced techniques involved in these preprocessing steps are provided and illustrated with an example of dog-owners describing their dogs. Chapter 5 details the process of converting cleaned text data into an analysis-ready term-document matrix (TDM) or document-term matrix (DTM), starting from the inverted index, then proceeding to the term-document matrix and three primary frequency-weighting options. Chapter 5 ends with a discussion on the choices of different frequency weighting methods and readers are given references to examples in other chapters, but it would have been more insightful to provide summaries and comments on how the weighting choices in those examples were obtained.

Chapter 6 explains the latent semantic analysis (LSA) of the TDM that applies singular value decomposition (SVD) to reduce dimension, and two primary analysis techniques performed on the semantic space, cosine similarity and queries that aim to measure associations among terms and the hidden meaning and concepts present in the text documents being analyzed. As the essential method used in LSA, SVD provides a rank-k approximation of the original data matrix and the last section is this chapter provides some discussion on the choice of a proper k. It is disappointing that the chapter does not at least sketch out more recent and advanced topics related to LSA, such as probabilistic LSA.

Chapter 7 illustrates hierarchical cluster analysis (HCA) and k-means clustering (kMC) to find meaningful and interpretable groupings in text data. The distance or similarity between terms or documents can be calculated from the DTM or TDM and used as the input in HCA and kMC. The “Dog Description” example of Chapter 4 is used to illustrate the two clustering methods discussed. This chapter also provides subjective and graphing methods to determine the number of clusters. In our view, this chapter is rather rudimentary and would have benefited from describing methods such as k-mean directions (Maitra and Ramler Citation2010) that use the specific structure of text data while finding groups using the cosine similarity.

Chapter 8 introduces several probabilistic topic models for finding latent subjects or themes that are present in text, including latent Dirichlet allocation, dynamic topic models, correlated topic models, supervised latent Dirichlet allocation, and structural topic models. The topic models introduced here are involved with some advanced statistical concepts and methods, such as probabilistic graphical models, which readers may be unfamiliar with, so some preliminary introduction and background knowledge to this topic might have benefited some readers.

Chapter 9 describes text classification that categorizes text documents when the true group labels are known. Classification models introduced include naive Bayes, k-nearest neighbors, support vector machines, decision trees, random forests, and neural networks and are demonstrated with a dogs-breed example. We feel that Section 9.3 that describes evaluation of model fit should have also included discussion and strategies on guarding against overfitting and how this can be used better in a predictive context.

Chapter 10 introduces sentiment analysis, that is, methods that retrieve opinions and feelings from text data. Two types of sentiment analysis are explained. The first is a lexicon-based approach that uses preset dictionaries, such as General Inquirer, AFINN, OpinionFinder, and SentiWordNet, to identify sentiment of text. The other approach is learning-based and employs machine learning (classification) techniques such as naive Bayes, support vector machines, and logistic regression to identify the sentiment of text data.

Chapters 11 and 12 explore ways of interpreting results of text analyses. For text data, storytelling plays an important role in communicating essential and meaningful information from the data to audience and calling for actions. Chapter 11 demonstrates concepts and framework of storytelling from data with real world examples from United Parcel Service and Zillow. Chapter 12 then introduces visualizing text data and analysis results using heat maps, word clouds, top term plots, cluster visualizations, topics over time, and network graphs. These are worthwhile chapters that illustrate the utility of text analysis and its use in a real-world setting.

Chapter 13 (Sentiment Analysis of Movie Reviews Using R), Chapter 14 (Latent Semantic Analysis (LSA) in Python), Chapter 15 (Learning-Based Sentiment Analysis Using RapidMiner), and Chapter 16 (SAS Visual Text Analytics) introduce the five software programs used in text analytics with real-world examples. For each programing language, a brief introduction and installation instructions are given and step-by-step commands for implementing specific analysis are presented and explained. Since there are various packages and functions for doing the same analysis in R, Python, RapidMiner, and SAS, it would have been more helpful to the practitioner if there was a reference section at the end of each chapter discussing other options also available in the software.

We decided to review this book to get a good understanding of the basic principles of text analytics and use it as the textbook for an introductory level seminar class for graduate students majoring in a wide range of disciplines. We found this book helpful for learning the text preparation process through practical examples such as wordcloud, one of the text analysis visualization tool. However, although the book is aimed toward readers with varying levels of mathematical and statistical sophistication, we feel that some major omissions as well as concepts included without much preliminary introduction or perfunctorily addressed will make it difficult for all but the most advanced readers, who in turn will probably not have much use for this book anyway.

Fan Dai and Ranjan Maitra
Iowa State University

References

  • Maitra, R., and Ramler, I. P. (2010), “A k-Mean-Directions Algorithm for Efficient Clustering of Data on a Sphere,” Journal of Computational and Graphical Statistics, 19, 377–396. DOI: 10.1198/jcgs.2009.08155.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.