Book Reviews: Journal of the American Statistical Association: Vol 112 , No 518

Click to increase image sizeClick to decrease image size

Almut E. D. Veraart

Imperial College London

Handbook of Cluster Analysis. Christian Hennig, Marina Meila, Fionn Murtagh, and Roberto Rocci (eds.). Boca Raton, FL: Chapman & Hall/CRC Press, 2015, xx + 753 pp., $119.95 (H), ISBN: 978-1-46-655188-6.

The Handbook of Cluster Analysis provides a readable and fairly thorough overview of the highly interdisciplinary and growing field of cluster analysis. The editors rose to the challenge of the Handbook of Modern Statistical Methods series to balance well-developed methods with state-of-the-art research. The book is a collection of papers about how to find groups within data, each written by prominent researchers from computer science, statistics, data science, and elsewhere. Some chapters are application driven while others are solely focused on theory. The editors bookend the text with a solid overview and history of the literature at the beginning, to help newcomers navigate the rest of the handbook, and practical strategies at the end, to help a practitioner choose amongst the competing methods.

Organized into six sections, the handbook devotes four of these to present numerous clustering methods and algorithms, categorized into commonly accepted frameworks: optimization methods, dissimilarity-based methods, methods based on probability models, and methods based on density estimation. The last two sections cover extensions of the standard methods to specific cluster and data formats as well as approaches to cluster validation and other general issues.

Section 1 discusses optimization methods, which generally revolve around the K-means algorithm as it is still one of the most popular clustering methods. The first chapter of the section is quite interesting and sheds new light on the old by discussing the traditional square-error criterion within three different theoretical frameworks. The next chapter outlines the K-medoids algorithm as a robust alternative to the standard K-means. The last chapter covers the theory of center-based clustering in detail with theorems and proofs and introduces the probability-based mixture model, which is one of the models covered in Section 3.

Section 2 provides a clear overview of dissimilarities, metrics, and method specifications in the context of hierarchical clustering with a useful table that directly compares hierarchical methods. The section ends with the theory and application of spectral clustering as a more flexible method for clustering data.

Section 3 focuses on probability models, likely the most accessible section to statisticians and biostatisticians. After covering the basic latent variable model, the finite mixture model, from both a Frequentist and Bayesian perspective, the other chapters discuss variations of this model for different types of data: categorical data, data collected over time, over space, or more generally as a function over time, and as a network. This is particularly helpful for practitioners. Additionally, one chapter of this section focuses solely on significance testing when evaluating the clustering results.

Section 4 explores the approach of defining clusters as “regions of high density separated from other such regions by regions of low density.” The first chapter presents a way to build clusters based on the peaks of the density estimate. The next paper thoroughly discusses a mean-shift algorithm based on kernel density estimation, directly comparing different approaches. Lastly, the section connects algorithmic clustering with observed natural processes such as self-organizing maps, simulated annealing, swarm optimization, and ant colony dynamics.

Section 5 includes an assortment of other types of clustering methods that do not fit into the standard categories including semisupervised, consensus, fuzzy, and rough set clustering as well as two-mode partitioning. Semisupervised clustering incorporates side-information (e.g., pairwise constraints, triplet constraints, class labels) in how to group units in order to improve the overall final clustering. Consensus clustering is an ensemble method, combining multiple partitions of a dataset together. Fuzzy clustering allows units to belong to more than one cluster at a time in contrast to hard partitioning clustering. Two-mode partitioning clusters both the rows and columns of a rectangular data matrix. The last chapter reviews methods for an interesting type of data, symbolic data, which “aims to represent variability intrinsic to the entities under analysis.” In this case, the entities to be grouped represent multiple individual units. The example they present is about clustering car models, where each car model (e.g., Toyota Camry) represents many individual cars that all vary slightly in their physical characteristics.

Section 6 presents numerous criteria to quantify the validity of a clustering result. First they discuss internal criteria that are used when there is no known true grouping, and then they discuss external criteria to compare the clustering results to a known gold standard. They also discuss evaluating the final clustering based on stability and robustness amid small perturbations or outliers. The handbook wraps up with useful strategies for choosing a method as well as presenting methods to graphically explore groups within data. These final remarks repeat and emphasize concepts and issues from earlier chapters.

Overall, the handbook is a thorough reference for past and present work. It gives the reader a general overview of the field, which is of great value since the work crosses many disciplinary boundaries. The numerous clustering methods are organized to help researchers find the relevant chapters and references therein. Depending on the reader’s mathematical or computational background, some chapters may be inaccessible on the first read through and would have been strengthened by more real data and code examples. Although the variation in tone and technical level between chapters reflects the diversity of research perspectives working in this field of cluster analysis, it does make it difficult to read the text cover to cover. That being said, this book definitely could be a good reference that one would dip into as needed.

Brianna C. Heggeseth

Williams College

MixtureModel-Based Classification. Paul D. McNicholas. Boca Raton, FL: Chapman & Hall/CRC Press, 2016, xxiv+212 pp., $89.95 (H), ISBN: 978-1-48-222566-2.

This monograph is an extensive introduction of mixture models with applications in classification and clustering. Model-based approaches for classification and clustering have become an important research topic in statistics and other related fields, for example, computer science and artificial intelligence. This monograph is among the first books that aim to provide a systematic introduction of this topic.

Even though only classification is mentioned in the title, this book in fact includes classification, clustering, and semisupervised classification. After a brief introduction in Chapter 1, Chapter 2 introduces the classical Gaussian mixture model. The expectation-maximization (EM) algorithm is needed for the case of unsupervised and semisupervised learning and, hence, has been carefully explained in this chapter. It also introduces several types of Gaussian parsimonious clustering models with different assumptions on covariance matrices. In Chapter 3, the Gaussian mixtures are generalized to mixtures of factor analyzers and extensions, such as mixtures of common factor analyzers.

Chapter 4 focuses on high-dimensional data, which is my favorite part in this book. High-dimensional analysis has become a main focus in statistics in recent years. This chapter classifies dimensional reduction methods into two categories: mapping the data to a lower-dimensional space; or selecting a subset from all candidate variables. In addition to mixtures of factor analyzers introduced in the previous chapter, the author introduces two other methods that belong to the first category, Gaussian mixture modeling and dimension reduction (GMMDR) and Gaussian mixture modeling for high-dimensional data (HD-GMM). For the second category, the author introduces a number of methods including the LASSO-penalized BIC, variable selection for clustering and classification (VSCC), the clustvarsel package for R, and the selvarclust software.

Chapter 5 considers mixtures of distributions with varying tail weight such as the mixture of multivariate t-distributions and the mixture of power exponential distributions. Chapters 6 and 7 give an account of mixtures of skewed distributions as well as mixtures of distributions that parameterize both skewness and concentration. Chapters 8 and 9 contain miscellaneous topics, including longitudinal data, robust clustering, and clustering methods for categorical data, among others.

The author did good work by organizing the materials in a very natural way as well as presenting methods and algorithms in great detail. Moreover, many case studies help the reader understand and appreciate the methodologies presented.

On the other hand, this book would have been strengthened by adding a few extra materials. First, I would have expected that an introductory book like this would include theoretical analysis of clustering and classification; in particular, theoretical properties of the EM algorithm for mixture models, such as by Xu and Jordan (Citation1996). Second, this book puts great emphasis on distributions with varying concentration and skewness (Chapters 5, 6, and 7), which may not be practical in high-dimensional data. Finally, brief discussions on models for clustering on networks such as stochastic blockmodels (Holland, Laskey, and Leinhardt Citation1983) and latent position cluster models (Handcock, Raftery, and Tantrum Citation2007) would attract more advanced readers.

Yunpeng Zhao

George Mason University

Book Reviews

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Book Reviews

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature