Abstract
Many clustering methods, including k-means, require the user to specify the number of clusters as an input parameter. A variety of methods have been devised to choose the number of clusters automatically, but they often rely on strong modeling assumptions. This article proposes a data-driven approach to estimate the number of clusters based on a novel form of cross-validation. The proposed method differs from ordinary cross-validation, because clustering is fundamentally an unsupervised learning problem. Simulation and real data analysis results show that the proposed method outperforms existing methods, especially in high-dimensional settings with heterogeneous or heavy-tailed noise. In a yeast cell cycle dataset, the proposed method finds a parsimonious clustering with interpretable gene groupings. Supplementary materials for this article are available online.
Acknowledgments
We thank Rob Tibshirani for getting us started on this problem and for providing code for some initial simulations. We thank Art Owen for providing us with a summary of the relevant theory on k-means clustering, and for giving us feedback on our theoretical results. We also thank Cliff Hurvich, Josh Reed, and Jeff Simonoff, for providing comments on an early draft of this article and for suggesting further avenues of inquiry.