Abstract
In model-based clustering of complex data, a probability model, typically a finite mixture probability model, forms the basis of the distance measure between any pair of clusters. The idea of model-based clustering was popularized by the framework and accompanying software of Fraley and Raftery (2002). In particular, model-based agglomerative hierarchical clustering is now a frequently used approach for probabilistic grouping of data, due to the speed and simplicity of implementation. This article investigates deficiencies in the clusterings proposed from this popular approach, and presents a review of small refinements and extensions to the procedure with differing performance gains and computational costs. The improvements are illustrated through application to simulated and real data examples, including the clustering of gene expression time profiles. Some of the proposed improvements to agglomerative clustering are, like the procedure itself in its usual form, deterministic; perhaps surprisingly though, the best overall results here are obtained via a stochasticized version of the entire procedure. While the focus of this article is probability model-based clustering, many of the schemes presented are equally applicable to agglomerative clustering under any distance measure.
The simulated data from this article along with the C++ code used for implementing the algorithms for all of the examples can be obtained online from the Supplemental Material.
Keywords: :