2,543
Views
5
CrossRef citations to date
0
Altmetric
Research Article

MANOVA, LDA, and FA criteria in clusters parameter estimation

Article: 1071013 | Received 23 Apr 2015, Accepted 04 Jul 2015, Published online: 10 Aug 2015

Abstract

Multivariate analysis of variance (MANOVA) and linear discriminant analysis (LDA) apply such well-known criteria as the Wilks’ lambda, Lawley–Hotelling trace, and Pillai’s trace test for checking quality of the solutions. The current paper suggests using these criteria for building objectives for finding clusters parameters because optimizing such objectives corresponds to the best distinguishing between the clusters. Relation to Joreskog’s classification for factor analysis (FA) techniques is also considered. The problem can be reduced to the multinomial parameterization, and solution can be found in a nonlinear optimization procedure which yields the estimates for the cluster centers and sizes. This approach for clustering works with data compressed into covariance matrix so can be especially useful for big data.

Public Interest Statement

The paper describes several main multivariate statistical techniques, such as multivariate analysis of variance, linear discriminant analysis, and factor analysis in relation to cluster analysis. It shows that known in those techniques criteria of quality of solutions can be used for data clustering as well. These criteria are employed to find cluster centers and sizes, because optimizing such objectives corresponds to the best distinguishing between clusters. The problem is expressed via multinomial parameterization of the clusters characteristics, and solution can be found in optimization procedure yielding estimates for clusters. This approach uses only sample covariance matrix and not the observations themselves, so it can be especially valuable in difficult clustering tasks on big data from data bases, data warehouses, and data-mining problems.

1. Introduction

Multivariate analysis of variance (MANOVA) is a well-known generalization of the analysis of variance (ANOVA) extended from one to many dependent variables, and the multivariate analysis of numerical covariates (MANCOVA) is a similar extension of ANCOVA. Linear discriminant analysis (LDA) and MANOVA can be considered as dual techniques—in LDA the independent variables are the predictors (or attributes) and the dependent variables are the groups, while in MANOVA vice versa—the independent variables are the groups (dummy variables identifying the clusters belonging) and the dependent variables are the attributes. Both these techniques can be presented via the canonical correlation analysis (CCA) between two sets of variables, with an additional step of prediction of one set by another one. When the canonical aggregates of the variables are obtained, all the tests for multivariate variables can be reduced to the known tests for one variable. LDA consists in testing significance of the discriminant functions of the attributes (with additional classification), and MANOVA—in testing significance of differences between groups’ vectors of means (with additional identification which of the attributes have different means across the groups). There are various criteria known for testing quality of LDA and MANOVA solutions (see, for instance, Dillon & Goldstein, Citation1984; Härdle & Simar, Citation2012; Izenman, Citation2008; Timm, Citation1975).

The current paper considers possibilities to apply these criteria for clustering purposes. Indeed, if such criteria permit testing quality of LDA and MANOVA solutions, they can be optimized to obtain the best distinguishing between the clustering groups as well. As it is well-known in multivariate statistical analysis the total (T) variance–covariance matrix can be presented as the sum T = B + W of the so-called “Between” (B) and “Within” (W) matrices defined by the variance–covariance between and within the groups, respectively. The loadings for aggregation of the attributes in LDA or MANOVA are commonly found by maximizing a criterion of the quotient of the between-to-within quadratic forms which can be presented via the generalized eigenproblem with these matrices. Using its eigenvalues, it is possible to test the quality of LDA and MANOVA solutions. For this aim, the so-called Wilks’ lambda criterion Λ (a multivariate generalization of F-statistics) compares the determinants (generalized variances) calculated by within and total variances. Wilks’ Λ varies from 0 to 1, and Λ = 0 indicates that the groups’ mean vectors differ, while Λ = 1 shows that the groups’ means are the same. Thus, minimizing an objective of the Wilks’ criterion by the parameters of the cluster centers permits to find them, as well as the group base sizes, which define the best distinguishing between groups. Wilks’ lambda has the so called U-distribution which is tabulated, for instance, in Timm (Citation1975). It also can be numerically approximated and presented as a regular F-test.

There are other overall tests on significance known in MANOVA, for instance, Hotelling T2 (a multivariate generalization of the t-test for comparing vectors of means for two groups), and its generalization to Lawley–Hotelling trace criterion. The maximization of this criterion can be used for clustering problem. A modification of this criterion is given by another criterion widely used in MANOVA—the so-called Pillai’s trace. It also can be used for clustering aims via the corresponding maximized objective. Such criteria are very convenient for clustering problem because they do not require calculation of each eigenvalue but only their total, which coincides with the trace, or sum of the diagonal elements of the matrix used in maximization. Practical application of all these tests in LDA and MANOVA are demonstrated, for instance, throughout the monograph (Timm, Citation1975). There are other tests, like Roy’s largest eigenvalue λmax, and more specific ones like Levene’s test to define whether the variances between groups are equal, partial eta-square (similar to partial F-statistics) to see the variance explained by individual independent variable, and other extensions to the multivariate statistics. However, for clustering aims, we are interested in the criteria operating with the totals of the eigenvalues which can be reduced to some functions of the main variance–covariance T, B, and W matrices.

Contemporary cluster analysis includes a large spectrum of methods developed in the areas of segmentation, pattern recognition, machine learning, data mining, and others (for instance, see Bishop, Citation2006; Eldén, Citation2007; Frey & Dueck, Citation2007; Gan, Ma, & Wu, Citation2007; Hastie, Friedman, & Tibshirani, Citation2001; Lipovetsky, Citation2012; Lipovetsky, Tishler, & Conklin, Citation2002; Liu & Motoda, Citation2008; Nowakowska, Koronacki, & Lipovetsky, Citation2014; Ripley, Citation1996). Various works suggest different ways to divide data into groups of observations more closely related within each group in comparison with the relations between the groups by variance–covariance matrices (Brusco & Steinley, Citation2007; DeSarbo, Carroll, Clark, & Green, Citation1984; Friedman & Meulman, Citation2004; Heiser & Groenen, Citation1997; Szekely & Rizzo, Citation2005). The current paper suggests a new approach based on the special objectives corresponding to the MANOVA and LDA criteria which optimization guarantees the best quality of the distinguishing among the groups estimated using the same criteria. This problem can be reduced to the optimizing procedure for nonlinear approximation of the covariance matrix by the total of the outer products of the distances from cluster centers to total center. The means and sizes of the clusters can be found using parameterization via multinomial shares—a technique developed and successfully applied for solving various problems in regression, principal components, singular value decomposition, and clustering (Lipovetsky, Citation2009a, 2009b, 2013a). The relation with factor analysis by least squares (LS) and generalized least squares (GLS) objectives (Joreskog, Citation1977; Jöreskog, Citation1967; Jöreskog & Goldberger, Citation1972; Lawley & Maxwell, Citation1971; Maxwell, Citation1983) are discussed as well. This approach can be especially useful for large data-sets with thousands and millions of observations because it operates not with original multiple observations but with the data compressed into covariance matrix.

2. Criteria for finding cluster centers and sizes

Let X denote N by n matrix with elements xij (rows i = 1,2, …, N of observations by the variables in n columns x1, x2, …, xn). The elements of the total matrix Stot of second moments are defined as:

(1) (Stot)jk=i=1N(xij-Mj)(xik-Mk),(1)

where Mj denotes the mean value of each xj. Let observations be divided into K subsets, and these clusters are numbered as q = 1,2, …, K, and each q-th cluster have Nq observations, so their total equals the sample base:

(2) N1+N2++NK=N.(2)

Consider decomposition of the cross-product (1) into the items related to the data subsets with sizes (2). Such a transformation is known in ANOVA (Ladd, Citation1966; Lipovetsky & Conklin, Citation2005) and can be presented as:(3) i=1N(xij-Mj)(xik-Mk)=q=1Ki=1Nq(xijq-mjq)(xikq-mkq)+q=1KNq(mjq-Mj)(mkq-Mk)(3)

where xijq indicates that the i-th observation by the j-th variable belongs to the q-th cluster, and mjq is the mean value of each j-th variable within q-th cluster. The relation between the subsets and total means for each j-th variable is as follows:

(4) N1mj1+N2mj2++NKmjK=NMj,j=1,2,,n,(4)

where both sides express simply the total by each xj.

The double sum in (3) equals the pooled second moment within each cluster:

(5) (Swithin)jk=q=1K(Swithinq)jk=q=1Ki=1Nq(xijq-mjq)(xikq-mkq).(5)

The last sum in (3) corresponds to the weighted by group sizes second moment between the cluster means centered by the total means:

(6) (Sbetween)jk=q=1KNq(mjq-Mj)(mkq-Mk).(6)

So (3) can be presented as the matrix sum:(7) Stot=Swithin+Sbetween=Swithin+q=1KNq(mq-M)(mq-M),(7)

with the outer product of vectors of distances from the centers mq for each q-th cluster to the total center M, where each vector mq consists of the means mjq by all the variables, and the vector M contains the total means Mj. For a given matrix Stot (7), the data clustering corresponds to maximizing the distances between the groups and minimizing them within the groups. If observations within each q-th cluster collapse to one point of its center, the elements of the matrix Swithin (5) reach zero. Thus, to find clusters, we can minimize the total of squared elements of matrix Swithin, or in other words—the total of differences between elements of known matrix Stot and unknown matrix Sbetween in (7):

(8) F=Stot-Sbetween2=Stot-q=1KNq(mq-M)(mq-M)2min.(8)

The objective (8) presents the squared Frobenius norm for a matrix (also known as Hilbert–Schmidt, or Schur norm). This formulation corresponds to the LS objective for the nonlinear regression model of fitting the values in Stot by the known vector M and the sets of unknown constants Nq and unknown vectors mq. Estimation of these parameters in the approach of multinomial parameterization is considered in Lipovetsky (Citation2013a, 2013b) where the problem is reduced to nonlinear regression modeling. A brief description of this technique is given in Appendix.

The basic relation (7) is also the fundamental equation for MANOVA and LDA where it is usually presented in one of the following known notations:

(9) Stot=Swithin+Sbetween=W+B=E+H,(9)

where Swithin = W = E denotes the Within (W), or Error (E) matrix, and Sbetween = B = H denotes the Between (B), or Hypothesis (H) matrix of second moments. Finding vectors of loadings, or discriminant functions α in LDA or MANOVA is commonly performed by maximizing the criterion of the Rayleigh quotient of the between-to-within quadratic forms:

(10) F=αHα/αEα,(10)

which can be presented as the generalized eigenproblem:

(11) Hα=λEα,(11)

with the eigenvalues λ of the matrix E−1H.

The so-called Wilks’ lambda criterion (a multivariate generalization of F-statistics) compares the generalized variances within the groups and in the whole data-set:(12) Λ=|Swithin||Stot|=|E||E+H|=|E||E|·|1+E-1H|=1|1+E-1H|=j=1n11+λj.(12)

This criterion (12) can be used for finding the groups’ centers. For this aim it can be rewritten via the means of Sbetween (6) as the unknown parameters of interest and minimized:(13) Λ=|Stot-Sbetween||Stot|=|Stot-1|·|Stot-Sbetween|=|I-Stot-1Sbetween|,(13)

where in the numerator we have minimization similar to used in (8), and in the denominator is the constant of the determinant of the total variance–covariance matrix. The difference in LS (8) and Wilks’ (13) criteria consists in using Euclidean norm squared or the generalized variance in determinants, respectively.

Another overall test in MANOVA is Hotelling T2-statistic, a multivariate generalization of the t-test for comparing vectors of means for two groups, and its generalization to Lawley–Hotelling trace (Tr, the total of diagonal elements) criterion:

(14) T2=TrE-1H=j=1nλj.(14)

The maximized criterion (14) can be used for clusters parameter estimation by trace of the matrix:(15) T2=TrE-1H=Tr(Stot-Sbetween)-1Sbetween=Tr(Stot-Sbetween)-1(Sbetween-Stot+Stot=Tr(I-Stot-1Sbetween)-1-I(15)

with the same Sbetween (6).

A modification of (14) widely used in MANOVA is the so-called Pillai’s trace:(16) V=Tr(E+H)-1H=Tr(1+E-1H)-1E-1H=j=1nλj1+λj(16)

For estimation of cluster centers this test corresponds to the objective for maximization:

(17) V=Tr(E+H)-1H=TrStot-1Sbetween.(17)

Both (15) and (17) criteria use optimization by the same matrix Stot-1Sbetween of fitting used in (13) as well. All these criteria are convenient for clusters parameter estimations because they do not require calculation of each eigenvalue but work with the total matrices. The meaning of all these objectives, including LS (8), is similar—to identify the parameters of cluster centers by closeness of Sbetween to Stot (7).

The LS (8) and MANOVA (12)–(17) objectives for clusters parameter estimations correspond to the criteria in Joreskog’s classification for methods of factor analysis (Joreskog, Citation1977; Jöreskog, Citation1967; Jöreskog & Goldberger, Citation1972; Lawley & Maxwell, Citation1971; Maxwell, Citation1983) based on a given covariance matrix S approximated by a matrix Σ of lower rank (built as the product of a matrix of loadings and its transposed). Joreskog (Citation1977) distinguishes the unweighted least-squares (ULS)

(18) ULS=12TrS-Σ2,(18)

the generalized least-squares

(19) GLS=12TrIn-S-1Σ2,(19)

and the maximum likelihood (ML)

(20) ML=TrΣ-1S-lnΣ-1S-n.(20)

Our notations for the matrices Stot and Sbetween in (8) correspond to the matrices S and Σ in Joreskog’s notations. The total of all elements squared, or squared Frobenius norm in (8), can be equally presented as the trace of a matrix multiplied by its transposition. Thus, we see that up to the constant ½, the expression (8) equals ULS (18).

The objective (8) can be transformed as follows:

(21) F=Stot-Sbetween2=Stot(In-Stot-1Sbetween)2Stot2·In-Stot-1Sbetween2.(21)

Besides of (8), skipping the constant term ||Stot||2 in (21), it is possible to use the objective of total residual sum of squares in minimizing the following deviations:

(22) F~=In-Stot-1Sbetween2=In-Stot-1q=1KNq(mq-M)(mq-M)2min.(22)

This objective coincides with GLS (19), up to the term ½. MANOVA (13), (15), (17) objectives also use the matrix Stot-1Sbetween which corresponds to Σ−1S in (19) and (20).

Quality of data fit for the objective (8) can be estimated via pseudo-R2 similar to the coefficient of multiple determination in nonlinear regression, defined as:

(23) R2=1-Fmin/Tr(Stot2),(23)

where Fmin is the residual sum of squares in the minimum of the objective (8), divided by the total sum of squares of all the elements in the fitted matrix expressed via the trace of the squared matrix Stot. This measure will be in favor of the ULS results obtained by the objective (18). Similarly for the objective (22), the pseudo-R2 can be defined as one minus F~min divided by the original sum of squares equal n for the identity matrix In. This measure would correspond to the quality of fit for the GLS results (19).

The objectives (8) and (22), and the presentation in (18) and (19), are theoretically meaningful and correspond to MANOVA relations (12)–(17). In implementation for the numerical estimations, the totals of the squared deviations of the numerical covariance matrix’ elements and their parametric counterparts are used, and there the first objective (8), or ULS (18), is preferable because it does not include the inversion of the covariance matrix which could be prone to multicollinearity in the data.

3. Numerical examples

Consider the iris data (Fisher, Citation1936) on the measured sepal and petal length and width of fifty iris specimens for each of three species, Iris setosa (SE), Iris versicolor (VE), and Iris virginica (VI). This data can be found in the Iris file available in the software package (S-PLUS’2000, Citation1999), or in R datasets. The variables are highly correlated: except the sepal width, the correlations range from 0.81 to 0.96.

In Table , the first three numerical columns show the means of the variables for each kind of iris, the next three columns show the groups centers and sizes estimated by the ULS (18), then the next three columns show the results by the GLS (19), and the last three columns present the regular K-means clustering for comparison. The last row presents the pseudo-R2 (23), which is, of course, favorable to the criterion (8), but important is that the quality of the ULS is the same as of K-means. The vectors of cluster centers for the ULS outperform the GLS—they are noticeably closer to the original centers of the iris specimens. ULS results are very similar to K-means as well, but in contrast to K-means the ULS cluster centers and base sizes are obtained using only covariance matrix, without the data-set itself.

Table 1. Cluster centers and sizes estimation

Estimation by the objective (8), or ULS (18), does not require inversion of the covariance matrix, thus, the clustering results are more robust and less prone to multicollinearity within the data. It is the reason why ULS regularly outperforms the GLS technique (19) or (22), which employs the inverted covariance matrix with possible inflated values of elements and leads to worse clustering results that was observed by various data-sets.

4. Summary

This work considers the problem of finding cluster centers and sizes by fitting covariance matrix with the between-cluster matrix of a lower rank constructed by outer products of the parameters of cluster centers weighted by cluster sizes. Relation of this approach to the criteria from multivariate analysis of variance, MANOVA, and linear discriminant analysis, LDA, in the objectives for optimization in cluster analysis is discussed. Such criteria as Wilks’ lambda, Lawley–Hotelling trace, and Pillai’s trace for building objectives for finding clusters parameters produces the best distinguishing between the clusters. Solutions can be found in a nonlinear optimization procedure with the multinomial parameterization which yields estimates for the cluster centers and sizes.

This approach can be especially useful for big data-sets. Indeed, for a big data with possible hundreds of thousands or millions of observations any regular clustering algorithm presents a hard computational burden, while the suggested method operates with the data already compressed into covariance matrix. When the cluster centers and sizes are estimated, the actual clustering, or assignment of each observation to one or another cluster can be performed by allocating them to the closest cluster due to the shortest distance to the centers. The described approach employs only the sample covariance matrix and not the observations themselves, so it can be valuable in difficult clustering tasks on huge data-sets from data bases, data warehouses, and data mining problems. It can also be useful for finding cluster structure when the data itself is already unavailable for any reason and only the covariance matrix can be used.

Acknowledgments

Stan Lipovetsky would like to thank two reviewers for the comments which improved the paper.

Additional information

Funding

Funding. The author received no direct funding for this research.

Notes on contributors

Stan Lipovetsky

Stan Lipovetsky, PhD, senior research director, GfK, Marketing Sciences. Stan has numerous publications in multivariate statistics, multiple criteria decision-making, econometrics, microeconomics, and marketing research.

The methods of multivariate statistical analysis are widely applied in marketing research. Clustering technique described in the current paper can be especially useful for big data with possible hundreds of thousands or millions of observations when regular clustering algorithms presents a hard computational burden. The suggested method operates with the data already compressed into covariance matrix. When the cluster centers and sizes are estimated, the actual clustering, or assignment of each observation to one or another cluster can be performed by allocating them to the closest cluster.

References

  • Bishop, C. M. (2006). Pattern recognition and machine learning. New York, NY: Springer.
  • Brusco, M. J., & Steinley, D. (2007). A comparison of heuristic procedures for minimum within-cluster sums of squares partitioning. Psychometrika, 73, 125–144.
  • DeSarbo, W. S., Carroll, J. D., Clark, L. A., & Green, P. E. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika, 49, 57–78.10.1007/BF02294206
  • Dillon, W. R., & Goldstein, M. (1984). Multivariate analysis: Methods and applications. New York, NY: Wiley.
  • Eldén, L. (2007). Matrix methods in data mining and pattern recognition. Philadelphia, PA: SIAM.10.1137/1.9780898718867
  • Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–18810.1111/j.1469-1809.1936.tb02137.x
  • Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315, 972–976.10.1126/science.1136800
  • Friedman, J. H., & Meulman, J. J. (2004). Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66, 815–849.10.1111/rssb.2004.66.issue-4
  • Gan, G., Ma, C., & Wu, J. (2007). Data clustering: Theory, algorithms, and applications. Philadelphia, PA: SIAM.10.1137/1.9780898718348
  • Härdle, W. K., & Simar, L. (2012). Applied multivariate statistical analysis. New York, NY: Springer.10.1007/978-3-642-17229-8
  • Hastie, T., Friedman, J., & Tibshirani, R. (2001). The elements of statistical learning: Data mining, inference, and prediction. New York, NY: Springer.10.1007/978-0-387-21606-5
  • Heiser, W. J., & Groenen, P. J. F. (1997). Cluster differences scaling with a within-clusters loss component and a fuzzy successive approximation strategy to avoid local minima. Psychometrika, 62, 63–83.10.1007/BF02294781
  • Izenman, A. J. (2008). Modern multivariate statistical techniques. New York, NY: Springer.10.1007/978-0-387-78189-1
  • Joreskog, K. G. (1977). Factor analysis by least-squares and maximum-likelihood methods. In K. Enslein, A. Ralston, & H. S. Wilf (Eds.), Statistical methods for digital computers (pp. 125–153). New York, NY: John Wiley & Sons.
  • Jöreskog, K. G. (1967). Some contributions to maximum likelihood factor analysis. Psychometrika, 32, 443–482.10.1007/BF02289658
  • Jöreskog, K. G., & Goldberger, A. S. (1972). Factor analysis by generalized least squares. Psychometrika, 37, 243–260.10.1007/BF02306782
  • Ladd, G. W. (1966). Linear probability functions and discriminant functions. Econometrica, 34, 873–885.10.2307/1910106
  • Lawley, D. N., & Maxwell, A. E. (1971). Factor analysis as a statistical method. New York, NY: American Elsevier.
  • Lipovetsky, S. (2009a). Linear regression with special coefficient features attained via parameterization in exponential, logistic, and multinomial–logit forms. Mathematical and Computer Modelling, 49, 1427–1435.10.1016/j.mcm.2008.11.013
  • Lipovetsky, S. (2009b). PCA and SVD with nonnegative loadings. Pattern Recognition, 42, 68–76.10.1016/j.patcog.2008.06.025
  • Lipovetsky, S. (2012). Total odds and other objectives for clustering via multinomial-logit model. Advances in Adaptive Data Analysis, 4, doi:10.1142/S1793536912500197
  • Lipovetsky, S. (2013a). Additive and multiplicative mixed normal distributions and finding cluster centers. International Journal of Machine Learning and Cybernetics, 4(1), 1–11. doi:10.1007/s13042-012-0070-3
  • Lipovetsky, S. (2013b). Finding cluster centers and sizes via multinomial parameterization. Applied Mathematics and Computation, 221, 571–580.10.1016/j.amc.2013.06.098
  • Lipovetsky, S., & Conklin, M. (2005). Regression by data segments via discriminant analysis. Journal of Modern Applied Statistical Methods, 4, 63–74.
  • Lipovetsky, S., Tishler, A., & Conklin, W. M. (2002). Multivariate least squares and its relation to other multivariate techniques. Applied Stochastic Models in Business and Industry, 18, 347–356.10.1002/(ISSN)1526-4025
  • Liu, H., & Motoda, H. (Eds.). (2008). Computational methods of feature selection. Boca Raton, FL: Chapman & Hall/CRC.
  • Maxwell, A. E. (1983). Factor analysis. In S. Kotz & N. L.Johnson (Eds.), Encyclopedia of Statistical Sciences (Vol. 3, pp. 2–8). New York, NY: John Wiley & Sons.
  • Nowakowska, E., Koronacki, J., & Lipovetsky, S. (2014). Clusterability assessment for Gaussian mixture models. Applied Mathematics and Computation, 256, 591–601. doi:10.1016/j.amc.2014.12.038
  • Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press.10.1017/CBO9780511812651
  • S-PLUS’2000. (1999). Seattle, WA: MathSoft.
  • Szekely, G. J., & Rizzo, M. L. (2005). Hierarchical clustering via joint between–within distances: Extending ward’s minimum variance method. Journal of Classification, 22, 151–183.10.1007/s00357-005-0012-9
  • Timm, N. H. (1975). Multivariate analysis with applications in education and psychology. Monterey, CA: Brooks/Cole.

Appendix

Estimation via multinomial parameterization

Dividing relation (7) by the total number of observations N and denoting the sample variance–covariance matrix as C = Stot/N, let us present (8) in a more convenient form:

(A1) F=C-q=1KNqN(mq-M)(mq-M)2min,(A1)

where Nq/N are the q-th cluster’s size shares in the total base. For a data with n variables xj, and for a chosen number of clusters K, with the restrictions (2) and (4), there are K − 1 free parameters Nq, and K − 1 vectors mq with (K − 1)n parameters of means mjq, so the total number of parameters is (K − 1)(n + 1). Taking (4) into account we represent the total outer products of distances in (A1) as follows:

(A2) q=1KNqN(mq-M)(mq-M)=q=1KNqN(mq)(mq)-MM=Mq=1KdiagmqNqMNNNqdiagmqNqMN-1M.(A2)

In (A2) we have share parameters: Nq/N with their total equal to one due to (2), and the means mqNq/(MN) with the total equal to one due to (4). The multinomial parameterizations for these shares can be defined by the new sets of unknown parameters as following. Instead of the shares Nq/N, let us use the multinomial parameterization

(A3) γq=exp(αq)1+p=2Kexp(αp),α1=0,(A3)

with the first parameter put to zero and needed K−1 parameters αq. Similarly, in place of the shares mjqNq/(MjN) in (A2), for each variable xj we define a new multinomial parameterization:

(A4) gjq=exp(βjq)1+p=2Kexp(βjp),βj1=0,j=1,2,,n,(A4)

with the needed (K−1)n parameters βjq. Using parameterization (A3) and (A4) in place of the shares in the diagonal matrices (A2) and substituting the expression (A2) as the outer product into objective (A1) yields:

(A5) F=C-q=1K(gqM)1γq(gqM)+MM2min,(A5)

with gq denoting a vector of n-th order of multinomial shares (A4) for each q-th cluster, and the expression gqM corresponds to the element-wise product of two vectors.

The objective (A5) corresponds to the nonlinear regression of the dependent variable presented by the values of elements in the matrix C by the values of the total means within the complex structure of the unknown parameters. Minimization (A5) by the parameters αq and βjq of the multinomial shares (A3) and (A4) can be performed by any software for nonlinear optimization, or attained directly by the Newton–Raphson procedure which is described in more detail in Lipovetsky (Citation2013b). When the parameters αq and βjq are found, they are used for calculating the share values (A3) and (A4). With γq and gjq estimates, the quotients Nq/N are known, so the cluster sizes are

(A6) Nq=γqN,(A6)

and the cluster centers equal to

(A7) mjq=NMjgjq/Nq=Mjgjq/γq.(A7)

Having the cluster centers and sizes, the actual clustering can be performed well.