1,001
Views
54
CrossRef citations to date
0
Altmetric
Original Articles

A New Variable Weighting and Selection Procedure for K-means Cluster Analysis

&
Pages 77-108 | Published online: 19 Mar 2008
 

Abstract

A variance-to-range ratio variable weighting procedure is proposed. We show how this weighting method is theoretically grounded in the inherent variability found in data exhibiting cluster structure. In addition, a variable selection procedure is proposed to operate in conjunction with the variable weighting technique. The performances of these procedures are demonstrated in a simulation study, showing favorable results when compared with existing standardization methods. A detailed demonstration of the weighting and selection procedure is provided for the well-known Fisher Iris data and several synthetic data sets.

Notes

Note that the scaling of the variables using RC j precludes comparison of RC j values across data sets (each data set will always have an RC j = 1. However, if so desired, variables could be compared across data sets using the 1 M j values. If employing this strategy, we recommend the standard cautions for comparing variables that were measured on different entities.

* = Best method.

2 The proof by CitationSteinley and Henson (2005) is straightforward. Assuming each of the variables are independent, the marginal probability of overlap is defined as a specific integral on the variable of interest. The independence of variables assumption allows the overlap of the joint distribution to be calculated as the product of the values observed at the level of the marginal distributions. Thus, higher marginal overlap leads to higher joint overlap. For example, three variables having a marginal probability of overlap of .10 would have a joint probability of overlap of .001, whereas the same three variables having marginal probability of overlap of .25 would have a joint probability of overlap of (.25)3≈ .02, about 20 times that of the previous condition.

* p ≤ .0001, two-tailed.

* = < .01.

* = Best method.

* p ≤ .0001, two-tailed.

3 Before employing this standardization technique, the user should carefully consider the nature of each of the variables as it can influence the RC values for each of the variables. As can seen by Equation Equation10, a binary variable with an equal number of 0's and 1's would have the greatest RC value possible; however, as the number of 0's and 1's becomes increasingly different, it becomes possible for the RC value of a binary variable to be lower than that of a continuous variable. Nonetheless, the user should be aware of the potential overweighting of discrete variables (with discrete variables with the fewest categories having the most potential for overweighting) when analyzing data sets with mixed data sets. To protect against this problem, the variable selection procedure should be implemented (see CitationBrusco, 2004, for a similar implementation that worked very effectively in the presence of binary variables).

a The best solution for each subset size.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 352.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.