Abstract
A variance-to-range ratio variable weighting procedure is proposed. We show how this weighting method is theoretically grounded in the inherent variability found in data exhibiting cluster structure. In addition, a variable selection procedure is proposed to operate in conjunction with the variable weighting technique. The performances of these procedures are demonstrated in a simulation study, showing favorable results when compared with existing standardization methods. A detailed demonstration of the weighting and selection procedure is provided for the well-known Fisher Iris data and several synthetic data sets.
Notes
Note that the scaling of the variables using RC j precludes comparison of RC j values across data sets (each data set will always have an RC j = 1. However, if so desired, variables could be compared across data sets using the 1 M j values. If employing this strategy, we recommend the standard cautions for comparing variables that were measured on different entities.
* = Best method.
2 The proof by CitationSteinley and Henson (2005) is straightforward. Assuming each of the variables are independent, the marginal probability of overlap is defined as a specific integral on the variable of interest. The independence of variables assumption allows the overlap of the joint distribution to be calculated as the product of the values observed at the level of the marginal distributions. Thus, higher marginal overlap leads to higher joint overlap. For example, three variables having a marginal probability of overlap of .10 would have a joint probability of overlap of .001, whereas the same three variables having marginal probability of overlap of .25 would have a joint probability of overlap of (.25)3≈ .02, about 20 times that of the previous condition.
* p ≤ .0001, two-tailed.
* = < .01.
* = Best method.
* p ≤ .0001, two-tailed.
3 Before employing this standardization technique, the user should carefully consider the nature of each of the variables as it can influence the RC values for each of the variables. As can seen by Equation Equation10, a binary variable with an equal number of 0's and 1's would have the greatest RC value possible; however, as the number of 0's and 1's becomes increasingly different, it becomes possible for the RC value of a binary variable to be lower than that of a continuous variable. Nonetheless, the user should be aware of the potential overweighting of discrete variables (with discrete variables with the fewest categories having the most potential for overweighting) when analyzing data sets with mixed data sets. To protect against this problem, the variable selection procedure should be implemented (see CitationBrusco, 2004, for a similar implementation that worked very effectively in the presence of binary variables).
a The best solution for each subset size.