Abstract
Finding the number of clusters in a data set is considered as one of the fundamental problems in cluster analysis. This paper integrates maximum clustering similarity (MCS), for finding the optimal number of clusters, into R statistical software through the package MCSim. The similarity between the two clustering methods is calculated at the same number of clusters, using Rand [Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66:846–850.] and Jaccard [The distribution of the flora of the alpine zone. New Phytologist. 1912;11:37–50.] indices, corrected for chance agreement. The number of clusters at which the index attains its maximum with most frequency is a candidate for the optimal number of clusters. Unlike other criteria, MCS can be used with circular data. Seven clustering algorithms, existing in R, are implemented in MCSim. A graph of the number of clusters vs. clusters similarity using corrected similarity indices is produced. Values of the similarity indices and a clustering tree (dendrogram) are produced. Several examples including simulated, real, and circular data sets are presented to show how MCSim successfully works in practice.
Disclosure statement
No potential conflict of interest was reported by the authors.
ORCID
Ahmed N. Albatineh http://orcid.org/0000-0001-5646-4945
Additional information
Notes on contributors
Ahmed N. Albatineh
Ahmed N. Albatineh is currently an associate professor of Biostatistics in the department of Community Medicine and Behavioral Sciences within the Faculty of Medicine at Kuwait University. He received a Bachelor of Science in Mathematics from Yarmouk University in Jordan, a Master of Science in Operations Research, a Master of Science in Applied Statistics, and a PhD in Statistics all from Western Michigan University in Kalamazoo, Michigan, USA. He taught at Nova Southeastern University and Florida International University. His research interests are in Cluster Analysis, Statistical Computations, and application of Statistics in Health Sciences.
Meredith L. Wilcox
Meredith L. Wilcox is the Director of Project and Quality Management at Midwest Biomedical Research. In this role, she manages clinical trials from the start-up phase to study completion. She also oversees the conduct and quality of nutrition and pharmaceutical trials at the site level at MB Clinical Research. Meredith is currently transitioning to a statistician role at Midwest Biomedical Research. Meredith holds a Bachelor of Science in Statistics and a Master of Public Health (MPH) with a specialization in Biostatistics.
Bashar Zogheib
Bashar Zogheib received his PhD in Mathematics from the University of Windsor, Ontario, Canada in 2006 after receiving two Master degrees: in Statistics and mathematics from the University of Windsor, Canada. He also received a third Master degree in Mathematics Education from Wayne State University, Michigan, USA. His research and numerous peer-reviewed publications focus primarily on numerical solutions for partial differential equations, computational fluid dynamics, applied statistics, and mathematics education. He previously taught at the University of Windsor in Canada, Millersville University of Pennsylvania and Nova Southeastern University in Florida. Currently, he is the Associate Dean for Administration for the college of Arts and Sciences and a Professor of Mathematics at the American University of Kuwait.
Magdalena Niewiadomska-Bugaj
Magdalena Niewiadomska-Bugaj is professor and chair of the Department of Statistics at Western Michigan University in Kalamazoo, Michigan, USA. Her research interests include classification, categorical data, methodology for zero inflated data, and association modeling.