1,439
Views
33
CrossRef citations to date
0
Altmetric
Original Articles

Estimating the Number of Clusters Using Cross-Validation

&
Pages 162-173 | Received 09 Feb 2017, Accepted 17 Jul 2019, Published online: 30 Sep 2019

References

  • Ben-Hur, A., Elisseeff, A., and Guyon, I. (2001), “A Stability Based Method for Discovering Structure in Clustered Data,” in Pacific Symposium on Biocomputing (Vol. 7), pp. 6–17.
  • Caliński, T., and Harabasz, J. (1974), “A Dendrite Method for Cluster Analysis,” Communications in Statistics—Theory and Methods, 3, 1–27. DOI: 10.1080/03610927408827101.
  • Charrad, M., Ghazzali, N., Boiteau, V., and Niknafs, A. (2014), “NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set,” Journal of Statistical Software, 61, 1–36. DOI: 10.18637/jss.v061.i06.
  • Chiang, M. M.-T., and Mirkin, B. (2010), “Intelligent Choice of the Number of Clusters in k-Means Clustering: An Experimental Study With Different Cluster Spreads,” Journal of Classification, 27, 3–40. DOI: 10.1007/s00357-010-9049-5.
  • Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T. G., Gabrielian, A. E., Landsman, D., Lockhart, D. J., and Davis, R. W. (1998), “A Genome-Wide Transcriptional Analysis of the Mitotic Cell Cycle,” Molecular Cell, 2, 65–73. DOI: 10.1016/S1097-2765(00)80114-8.
  • Dortet-Bernadet, J.-L., and Wicker, N. (2008), “Model-Based Clustering on the Unit Sphere With an Illustration Using Gene Expression Profiles,” Biostatistics, 9, 66–80. DOI: 10.1093/biostatistics/kxm012.
  • Fang, Y., and Wang, J. (2012), “Selection of the Number of Clusters via the Bootstrap Method,” Computational Statistics & Data Analysis, 56, 468–477. DOI: 10.1016/j.csda.2011.09.003.
  • Fraley, C., and Raftery, A. E. (2002), “Model-Based Clustering, Discriminant Analysis, and Density Estimation,” Journal of the American Statistical Association, 97, 611–631. DOI: 10.1198/016214502760047131.
  • Fujita, A., Takahashi, D. Y., and Patriota, A. G. (2014), “A Non-Parametric Method to Estimate the Number of Clusters,” Computational Statistics & Data Analysis, 73:27–39. DOI: 10.1016/j.csda.2013.11.012.
  • Gabriel, K. R. (2002), “Le Biplot–Outil d’Exploration de Données Multidimensionelles,” Journal de la Société Francaise de Statistique, 143, 5–55.
  • Hartigan, J. A. (1975), Clustering Algorithms, New York: Wiley.
  • Hartigan, J. A., and Wong, M. A. (1979), “Algorithm AS 136: A k-Means Clustering Algorithm,” Journal of the Royal Statistical Society, Series C, 28, 100–108. DOI: 10.2307/2346830.
  • Haslbeck, J. M. B., and Wulff, D. U. (2018), “cluster: Cluster Analysis Basics and Extensions,” R Package Version 0.2-2.
  • Hastie, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics (2nd ed.), New York: Springer.
  • Hennig, C. (2015), “fpc: Flexible Procedures for Clustering,” R Package Version 2.1-10.
  • Jain, A. K. (2010), “Data Clustering: 50 Years Beyond k-Means,” Pattern Recognition Letters, 31, 651–666. DOI: 10.1016/j.patrec.2009.09.011.
  • Jain, A. K., Murty, M. N., and Flynn, P. J. (1999), “Data Clustering: A Review,” ACM Computing Surveys, 31, 264–323. DOI: 10.1145/331499.331504.
  • Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2018), “cluster: Cluster Analysis Basics and Extensions,” R Package Version 2.0.7-1.
  • Mangasarian, O. L., Setiono, R., and Wolberg, W. (1990), “Pattern Recognition via Linear Programming: Theory and Application to Medical Diagnosis,” in Large-Scale Numerical Optimization, Philadelphia, PA: SIAM, pp. 22–31.
  • Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F. (2017), “e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien,” R Package Version 1.6-8.
  • Owen, A. B., and Perry, P. O. (2009), “Bi-Cross-Validation of the SVD and the Nonnegative Matrix Factorization,” The Annals of Applied Statistics, 3, 564–594. DOI: 10.1214/08-AOAS227.
  • Pollard, D. (1981), “Strong Consistency of k-Means Clustering,” The Annals of Statistics, 9, 135–140. DOI: 10.1214/aos/1176345339.
  • Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y., Goumnerova, L. C., Black, P. M., Lau, C., and Allen, J. C. (2002), “Prediction of Central Nervous System Embryonal Tumour Outcome Based on Gene Expression,” Nature, 415, 436–442. DOI: 10.1038/415436a.
  • R Core Team (2018), R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing.
  • Schlimmer, J. C. (1987), “Concept Acquisition Through Representational Adjustment,” PhD thesis, Department of Information and Computer Science, University of California, Irvine.
  • Scrucca, L., Fop, M., Murphy, T. B., and Raftery, A. E. (2016), “mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models,” The R Journal, 8, 205–233. DOI: 10.32614/RJ-2016-021.
  • Sugar, C. A., and James, G. M. (2003), “Finding the Number of Clusters in a Dataset,” Journal of the American Statistical Association, 98, 750–763. DOI: 10.1198/016214503000000666.
  • Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J., and Church, G. M. (1999), “Systematic Determination of Genetic Network Architecture,” Nature Genetics, 22, 281–285. DOI: 10.1038/10343.
  • Tibshirani, R., and Walther, G. (2005), “Cluster Validation by Prediction Strength,” Journal of Computational and Graphical Statistics, 14, 511–528. DOI: 10.1198/106186005X59243.
  • Tibshirani, R., Walther, G., and Hastie, T. (2001), “Estimating the Number of Clusters in a Data Set via the Gap Statistic,” Journal of the Royal Statistical Society, Series B, 63, 411–423. DOI: 10.1111/1467-9868.00293.
  • Venables, W. N., and Ripley, B. D. (2002), Modern Applied Statistics With S (4th ed.), New York: Springer, ISBN 0-387-95457-0.
  • Wang, J. (2010), “Consistent Selection of the Number of Clusters via Crossvalidation,” Biometrika, 97, 893–904. DOI: 10.1093/biomet/asq061.
  • Wickham, H. (2016), ggplot2: Elegant Graphics for Data Analysis, New York: Springer-Verlag.
  • Wilson, E. B. (1927), “Probable Inference, the Law of Succession, and Statistical Inference,” Journal of the American Statistical Association, 22, 209–212. DOI: 10.1080/01621459.1927.10502953.
  • Wold, S. (1978), “Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models,” Technometrics, 20, 397–405. DOI: 10.1080/00401706.1978.10489693.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.