2,405
Views
2
CrossRef citations to date
0
Altmetric
Original Articles

A comparison of two estimation methods for common principal components

Abstract

Common principal components (CPCs) are often estimated using maximum likelihood estimation through an algorithm called the Flury–Gautschi (FG) Algorithm. Krzanowski proposed a simpler estimation method via a principal component analysis of a weighted sum of the sample covariance matrices. These methods are compared for real-world datasets and in a Monte Carlo simulation. The real-world data is used to compare the selection of a common eigenvector model and the estimated coefficients. The simulation study investigates how the accuracy of the methods is affected by autocorrelation, the number of covariance matrices, dimensions, and sample sizes for multivariate normal and chi-square distributed data. The findings in this article support the use of Krzanowski’s method in situations where the CPC assumption is appropriate.

1 Introduction

Multivariate analysis deals with making inferences of many, usually intercorrelated, variables simultaneously. Applications include fields such as mechanics, biology, engineering, and economics, to mention a few. The straight-forward and intuitive understanding of univariate problems is partly lost as several variables are involved in the analysis. This difficulty in comprehending the meaning and role of multidimensional quantities becomes even greater when involving groups of such multivariate measurements. To make sense of such quantities, one may utilize relations between variables to impose certain simplifying restrictions on the models. One such relationship is the extension of principal component analysis (PCA) to multiple groups, called common principal components analysis (CPCA). Introduced and developed by Flury (Citation1984, Citation1988), CPCA assumes that some rotation can diagonalize the covariance matrices simultaneously; that is, they have a common eigenvector structure but may have group-specific eigenvalues. This relationship between covariance matrices is commonly known as the common principal component (CPC) model. In one sample PCA, the eigenvectors are ordered by the rank order of their corresponding eigenvalue. For CPCA, no natural order of the set of eigenvectors exists because the rank order of the eigenvalues may vary across groups. When the CPC model is suitable, more accurate estimations of the covariance matrices are provided than if each covariance matrix is estimated individually.

CPCs are commonly estimated using maximum likelihood estimation (MLE). An algorithm is required to solve the maximum likelihood (ML) equations. The Flury–Gautschi (FG) algorithm was proposed by Flury and Gautschi (Citation1986) and is an iterative numerical algorithm that requires considerable computational power. It loops through all pairs of columns of the eigenvector matrix to complete a single iteration, which makes it ineffective and sometimes infeasible in higher dimensional settings (Browne and McNicholas Citation2014). The asymptotic distribution of the maximum likelihood estimators was derived by Flury (Citation1986).

Krzanowski (Citation1984) proposed a simpler approximation of the CPCs which requires only the standard computer packages for PCA. Under the CPC setting, any weighted sum of the marginal covariance matrices has the same eigenvectors as the marginal covariance matrices. Thus, the CPC coefficients can be estimated by performing a PCA of any weighted sum (or product) of the sample covariance matrices. The asymptotic behavior of PCA was derived by Girshick (Citation1939), Lawley (Citation1953, Citation1956), and Anderson (1963). A summary of the applications and properties of PCA is given by Jolliffe (Citation2002).

Flury (Citation1988) introduced a hierarchy of similarities among several covariance matrices. This hierarchy consists of 5 levels, ranging from equality between covariance matrices (level 1) to completely unrelated covariance matrices (level 5). Level 2 is the proportionality of covariance matrices, studied by Pillai, Al-Ani, and Jouris (Citation1969) and Rao (1973), level 3 is the mentioned CPC model, and the fourth level is the partial CPC model, which assumes that only q (<p) out of p eigenvectors are common for all groups, denoted as CPC(q) (it is possible to break this level down further into models CPC(p – 2) to CPC(1)). To identify an appropriate common eigenvector model for the data, two goodness-of-fit measurements were proposed by Flury (Citation1988): a log-likelihood ratio (LR) statistic and the Akaike information criterion (AIC). Pepler, Uys, and Nel (Citation2016) compared 8 methods for selecting suitable common eigenvector models, including the LR statistic and AIC. This article considers three model identification methods: the AIC, the Schwarz criterion [also known as the Bayesian information criterion (BIC)], and the LR statistic, to be used to identify an appropriate model for a given dataset.

For a detailed overview concerning CPCA of covariance matrices, see Flury (Citation1988).

This article compares the ML and Krzanowski’s estimators on two real-world datasets and in a Monte Carlo simulation study. The two datasets are, (i) annual Swedish municipal level data related to innovations, collected for 11 years, and (ii) the Iris flower data. Based on these datasets, the estimated CPC coefficients and the three model identification statistics are calculated and compared for the two estimation methods. The results obtained by Flury (Citation1984) and Krzanowski (1984) on the Iris data are confirmed. The simulation study examines how the accuracy of the estimation methods is affected by the number of groups, dimensions, sample sizes, and by autocorrelations covariance matrices for multivariate normal data and for chi-square-distributed data (with 2 and 10 degrees of freedom).

2 Preliminaries

Bold capital letters denote matrices (e.g., ARN×p), bold lower-case letters denote vectors (e.g., aRp), and normal lower-case letters denote constants (e.g., cR). The dimension p corresponds to the number of “variables,” N to the number of “observations,” and G to the number of groups.

Notation 1. Let |A| denote the determinant of a square matrix A.

Notation 2. Let A¯G=1G(A1+A2++AG).

Notation 3. Let A denote the transpose of A.

Notation 4. The continuous uniform distribution, the matrix normal distribution and the Wishart distribution are denoted as UU(a,b),XNN×p(M,Ψ,Σ) and WWp(N,Σ), respectively. Throughout this article Ψ=IN and the notation XNN×p(M,Σ) is used instead.

Definition

1 (Orthogonal group). The set of orthogonal p × p matrices is denoted Op and defined as Op={Op×p:OO=OO=Ip}. Op forms a group, with its group operation being matrix multiplication.

Definition

2. For any symmetric matrix ARp×p, we can find λR and πRp such that (2.1) Aπ=λπ.(2.1)

We say that λR and πRp are a (right) eigenvalue and eigenvector pair. The eigenvectors are orthogonal and usually normalized to unit length, i.e., πiπj=1 for i=j=1,,p and zero otherwise. Π collects the set of eigenvectors of A as columns, Π=(π1,,πp)Rp×p. Since the columns of Π are orthonormal vectors, the matrix is orthogonal. The diagonal matrix Λ collects the corresponding set of eigenvalues in decreasing order, i.e., Λ=diag(λ1,,λp) where λ1λp.

Definition

3 (Spectral decomposition). Any symmetric matrix A can be factorized as (2.2) A=ΠΛΠ=i=1pλiπiπi.(2.2)

2.1 Common principal components analysis

CPCA is the generalization of PCA to multiple groups. Let X1,,XG be distributed as NNg×p(Mg,Σg) for g=1,,G, where the same set of variables are measured for all groups. In the CPC model, the associated covariance matrices, Σg, can simultaneously be diagonalized by the same orthogonal matrix Π, i.e., (2.3) Σg=ΠΛgΠ,g=1,,G,(2.3) where Π collect the eigenvectors of Σg as columns and Λg is a diagonal matrix whose jth element is the eigenvalue corresponding to the eigenvector in the jth column of Π. Consequently, all covariance matrices share the same set of eigenvectors but have group-specific eigenvalues.

3 The hierarchy levels and estimation

Definition 4.

The five levels of Flury’s (Citation1988) hierarchy of similarities among covariance matrices are defined below. Consider the positive definite covariance matrices Σ1,,ΣG of dimension p × p:

Level 1: Equality of all Σg, i.e., Σ1==ΣG.

The principal components can be obtained by treating the groups as a single one, pooling the sample covariance matrices, and performing a PCA. The number of estimated parameters is thus 12p(p+1).

Level 2: Proportionality of all Σg, i.e.,

(3.1) Σg=ρgΣ1,g=2,,G,ρ2,,ρG>0.(3.1)

The number of estimated parameters is 12p(p+1)+G1.

Level 3: CPC model

(3.2) Σg=ΠΛgΠ,g=1,,G.(3.2)

The number of estimated parameters for the CPC model is 12p(p1) for the orthogonal matrix plus Gp for the eigenvalues.

Level 4: Partial common principal component model (CPC(q))

(3.3) Σg=Π(g)ΛgΠ(g),g=1,,G,(3.3)

where Π(g)=(Πc:Πs(g)),Πc is the common p × q part across all groups and Πs(g) is the p×(pq) part specific to group g. The number of estimated parameters for the CPC(q) model is: Gp parameters for the eigenvalues, p(p1)/2(pq)(pq1)/2 parameters for the common eigenvectors Πc, and G(pq)(pq1)/2 for the specific eigenvectors Πs(g). The total number of estimated parameters is therefore 12p(p1)+Gp+12(G1)(pq)(pq1).

Level 5: Unrelated Σ1,,ΣG

The number of estimated parameters is 12Gp(p+1) and the covariance matrices are analyzed separately.

3.1 Maximum likelihood estimation

Assume G independent matrices X1,,XG to be normally distributed as NNg×p(Mg,Σg), where the same p variables have been measured across all groups for sample sizes Ng=ng+1,Mg=1Ngμg,μg=(μg1,,μgp) and 1Ng is a Ng×1 vector of ones. Let Sg denote the usual unbiased covariance estimator (3.4) Sg=1ngXgQgXg=1ng(Xg1Ngx¯g)(Xg1Ngx¯g),(3.4) where Qg=INg1Ng1Ng1Ng and x¯g=1Ng1NgXg, further ngSgWp(ng,Σg). The likelihood function of Σ1,,ΣG given S1,,SG is (3.5) L=L(Σ1,,ΣG|S1,,SG)=C×g=1Gexp[tr(ng2Σg1Sg)]|Σg|ng/2,(3.5) where C is a constant that does not depend on any Σg. The likelihood function attains its maximum at the same point as the log-likelihood function. Thus for convenience reasons the function 2logL is minimized.

Level 2:

If the proportionality assumption holds, by imposing the orthogonal restriction on the eigenvectors and letting ρ1=1, the likelihood equations are (Flury Citation1988; Pepler Citation2014) (3.6) ρg=1pj=1pπjSgπjλj,g=2,,G,(3.6) (3.7) λj=1ng=1GngρgπjSgπj,j=1,,p,(3.7) where n=n1++nG, and (3.8) (1λi1λj)πj(g=1GngρgSg)πi=0,ij.(3.8)

The likelihood equations can be solved by the iterative algorithm proposed by Flury (Citation1988), outlined in Appendix A.1.

Level 3:

By assuming that the CPC assumption holds and imposing the orthogonal restriction on the eigenvectors, the minimization problem can be simplified to one that involves solving the following system of equations (Flury Citation1984) (3.9) πi(g=1GngλgiλgjλgiλgjSg)πj=0,i,j=1,,p;ij,(3.9) where λgj=πjSgπj.

Flury and Gautschi (Citation1986) developed an algorithm known as the FG-algorithm to solve the equation system. The ML solution is denoted as Π̂=(π̂1,,π̂p) and λ̂gj, and let Λ̂g=diag(Π̂SgΠ̂)=diag(λ̂g1,,λ̂gp) and Σ̂g=Π̂Λ̂gΠ̂, for g=1,,G and j=1,,p.

In PCA (the special case of G = 1) the columns of Π are ordered according to the rank order of its eigenvalues. For the CPC model, no self-evident order exists. To be able to discuss the first, second, or other principal components, some arrangement is needed. For the ML estimation, Flury’s (Citation1986) suggestion of using the rank order of the first group is adopted, i.e., (3.10) π1Σ1π1>>πpΣ1πp.(3.10)

Level 4:

The CPC model risks being rejected due to non-common components which are discarded in the succeeding analysis, even if the components of interest are common throughout all groups. Thus, the partial CPC model could be selected if some of the components with low variance can be abandoned. Obtaining its exact ML estimates are substantially more complicated than in the CPC model, no simple method is available. However, given the ML estimates of the CPC model, approximative estimates can easily be obtained by the following procedure (Flury Citation1988):

  • Step 1: Compute Π̂=(Π̂1:Π̂2)Op through the CPC model, where Π̂1 contains the q columns common across the groups. Some reordering of the columns of Π̂ might be required to obtain the common components as the first q columns.

  • Step 2: Diagonalize the matrix Π̂2SgΠ̂2, i.e., find QgOpq such that QgΠ̂2SgΠ̂2Qg is diagonal, and let Π̂2(g)=Π̂2Qg(for all g=1,,G).

  • Step 3: The approximate ML estimated partial CPCs are then

(3.11) Π̂̂(g)=(Π̂1:Π̂2(g)),(3.11) (3.12) Λ̂̂g=diag(Π̂̂(g)SgΠ̂̂(g))=diag(λ̂̂g1,,λ̂̂gp),(3.12) (3.13) Σ̂̂g=Π̂̂(g)Λ̂̂gΠ̂̂(g),g=1,,G.(3.13)

3.2 Krzanowski’s estimation of CPC

Krzanowski (Citation1984) proposed a simple method for obtaining estimates of Π and Λg which are available in any standard computer package for PCA. Under the CPC assumption (3.14) ΠΣ¯Π=G1(ΠΣ1Π++ΠΣGΠ)=G1(Λ1++ΛG)=Λ¯,(3.14) where Λ¯ and Π are the collection of eigenvalues and eigenvectors of Σ¯. Estimates of Π can thus be obtained from a PCA of the sample counterpart S¯, i.e., S¯=PD¯P, and estimates of Λg are established as Dg=diag(PSgP). Note that any weighted sum (or product) of the marginal covariance matrices will have the same eigenvectors. For this estimator, it is natural to rank order the eigenvectors according to the eigenvalues of the arithmetic mean of the sample covariance matrices, i.e., (3.15) p1S¯p1>>ppS¯pp.(3.15)

To obtain partial CPC estimates using this method, substitute the ML estimated Π̂ with P and follow the procedure given above.

Clearly, this is not an exact solution to the ML maximization problem; thus, it will always produce lower valued likelihood functions than the ML estimates.

4 Identification of common eigenvector models

Comparing the ML functions under various models is one way of assessing which model better fits the data. The LR test for comparing model A (the null hypothesis) to a hierarchically lower model B is (4.1) χ2(A|B)=g=1Gnglog|Σ̂g(A)||Σ̂g(B)|,(4.1) where the estimates Σ̂g(A) refer to model A and Σ̂g(B) to model B. χ2 is chi-square distributed with degrees of freedom equal to the difference in dimensionality between model A and model B. The covariance matrices are assumed to be independent if an overall test of equality is rejected. The hypotheses for such test are (4.2) H0:Σ1=Σ2==ΣG,H1:Unrelated covariance matrices,(4.2) and the LR statistics is (Flury Citation1988) (4.3) χtotal2=χ2(equality | unrelated)=g=1Gnglog|Sp||Sg|,(4.3) where Sp=1ng=1GngSg is the pooled covariance matrix. Flury (Citation1988) showed that (4.3) can be decomposed into partial LR statistics, following the hierarchy levels, as (4.4) χtotal2=χ2(equality | proportionality)+χ2(proportionality |CPC)+χ2(CPC | CPC(q))+χ2(CPC (q)|unrelated).(4.4)

The decomposition is summarized in .

Table 1 Decomposition of the log-likelihood ratio statistic into partial log-likelihood ratio statistics.

He further mentions that if one lets the data decide which model to use, hypothesis testing might not be appropriate. Instead, he suggested using the partial LR statistics as descriptive and comparing their relative sizes after dividing them with the associated degrees of freedom. The value closest to one indicates the most appropriate model. Note that the statistic for unrelated covariance matrices cannot be calculated explicitly, which is a major shortcoming.

To be able to compare all the possible models and to avoid the multiple testing problem, two alternative methods are also proposed, namely, AIC and BIC.

Proposed by Akaike (Citation1973), AIC is a criterion used for selecting models for a given set of data, essentially measuring the quality of each model for the given dataset. AIC is defined as follows: (4.5) AIC*=2log(maximum of loglikelihood function)+2(number of estimated parameters).(4.5)

The measurement has a close relationship with the ML estimation and penalizes models based on the number of estimated parameters. Unlike hypothesis testing, the AIC is not intended to find a “true” model; it compares the fit of competing models and the model that attains the lowest value is considered to best fit the data.

Flury (Citation1988) proposed a modified version of the AIC. Assume r competing models, let k1<k2<<kr denote the number of estimated parameters, and L1L2Lr the maxima of the likelihood functions for the r models. The modified AIC for model i is then (4.6) AIC(i)=2logLiLr+2(kik1).(4.6)

The model with the lowest AIC is selected.

Based on Bayesian argumentation Schwarz (Citation1978) proposed an alternative to the AIC, known as the Schwarz criterion or the BIC, which is defined as (4.7) BIC*=2logL+klogN,(4.7) where N is the number of observations and k is the number of estimated parameters. The model with the lowest BIC is selected. Notice that BIC penalizes higher-dimensional models more than AIC does (when N8).

Modifying BIC in the same way as Flury did for the AIC measure, the BIC calculation becomes (4.8) BIC(i)=2logLiLr+(kik1)logN.(4.8)

5 Simulation study

A Monte Carlo simulation study was conducted to compare the performance of the two estimation methods. Groups of G = 2, 4, 10, and 50 are considered, with sample sizes of N = 50, 100, 500, 1,000, and 10,000 collected for p = 5, 10, 20, and 50 variables simulated from multivariate normal and chi-square distributions (with 2 and 10 degrees of freedom). Furthermore, autocorrelation between groups is implemented through an AR(1) process where the autocorrelation parameter is set to ϕ=0.9, 0, and 0.9. Because the AR(1) process is a summation of variables, it is generally at least approximately normally distributed. To obtain autocorrelated data from a chi-square distribution, let (5.1) Yt=ϕYt1+δt,(5.1) where Yt={Ytij}j=1,,pi=1,,N,δtiidN(0,Ip) and |ϕ|<1. Further, let Y˜t=(y˜t1,,y˜tp)=Yt/σY2 where σY2=Var(Ytij)=1/(1ϕ2). Thus, Y˜tij2χ(1)2 independently of ϕ with autocorrelation ρY˜tij2(k)=ϕ2k, and the sum of r independent χ(1)2 variables are χ(r)2-distributed (Holgersson Citation2005). To obtain normally distributed data, do not square Y˜tij. To start up the process, 40 iterations are executed.

For group g, let Xg=(xg1,,xgp) denote the data matrix obtained from the AR(1) process above where t = g. When the independent observation vectors xgj are χ(r)2-distributed, they are normalized to have unit variance, i.e., (5.2) zgj=xgj2r,g=1,,G;j=1,,p.(5.2)

Then, let Wg=ZgΣg1/2 where Zg=(zg1,,zgp),Σg1/2=ΠΛg1/2Π and Λg1/2=diag(λg1,,λgp). For the normally distributed data, the normalization process (5.2) is skipped and Wg=XgΣg1/2. For a sufficiently large sample size the covariance matrix of Wg is approximately Σg (Pepler, Uys, and Nel Citation2016).

The underlying common eigenvectors (Π) are obtained from Ω=AA/8 where A:8×p is simulated from a multivariate normal distribution with a zero mean and with an identity covariance matrix. The group-specific eigenvalues (Λg) are sorted in decreasing magnitude and simulated from (0.5+U)2 where UU(0,1). Pepler (Citation2014) showed that increased separation of the eigenvalues per group translates to an improved performance in identifying a suitable common eigenvector structure. It also likely increases the accuracy of the two estimation methods in this simulation study because it simplifies the ordering of the eigenvectors. The performance of the estimation methods is measured using the p-normalized Frobenius norm (5.3) ||Θ̂Θ||Fp=1ptr(Θ̂Θ)(Θ̂Θ)(5.3) for matrices and the Euclidean vector norm (5.4) θ̂θ=(θ̂1θ1)2++(θ̂pθp)2(5.4) for vectors. The average Euclidean vector norm and the average p-normalized Frobenius norm across groups are denoted by ||·||AVG ||·||FpAVG, respectively.

For the results presented below Flury’s suggestion of ordering the eigenvectors according to the eigenvalues of the first group is adopted.

Each result is based on 5,000 simulations, and the benchmark parameter values are G = 4, N = 100, p = 10, and ϕ=0. The results for normally distributed data are summarized in , and the results for the chi-square distributions are summarized in and .

Table 2 Simulation results for normally distributed data.

Table 3 Simulation results for chi-square distributed data with 10 degrees of freedom.

Table 4 Simulation results for chi-square distributed data with 2 degrees of freedom.

First the estimation of the eigenvector matrix and the group specific eigenvalues is considered. Generally, both estimation methods perform best with multivariate normal data and worst with the chi-square data with 2 degrees of freedom. However, the performance differences between normal and chi-square data with 10 degrees of freedom are mostly very small, not surprising as the chi square distribution becomes more symmetric as the degrees of freedom increases. Krzanowski’s estimation clearly outperforms the ML estimation for all distributions and for all simulated parameter value combinations. Moreover, the accuracy of both methods improves with increasing sample sizes and groups, and deteriorates with increasing dimensions. However, by the results it seems like the overall precision of the estimation of Π increases when p increases. This result is contradictory on an intuitively basis and goes against general results found in multivariate applications. This could possibly be a result of the normalization constant for the Frobenius norm being selected as 1p. The autocorrelation between groups has a negative effect on the accuracy of Krzanowski’s estimation; the ML estimator of Π is not notably affected by autocorrelation, however, the precision of the estimated eigenvalues are negatively affected. Krzanowski’s estimator still performs better than the ML estimator for all parameter combinations.

The simulation study also investigates how well the covariance matrices are simultaneously diagonalized. No surprising results are found: the diagonalization using both methods are working better with increasing sample sizes and worse with increasing number of groups and dimensions. In the presence of autocorrelation the diagonalization works better and it can be argued that when autocorrelation is present the covariance matrices are more similar to each other. There is no big difference between the two estimation methods but ML consistently diagonalizes the covariance matrices slightly better.

It can be expected that the ordering of the estimated CPCs is influencing the simulation result, thus it is interesting and important to investigate how the estimators performs if they are ordered according to the same rule. The results obtained when the ML estimator too is ordered by the eigenvalues of S¯ are presented in the Appendix (). This reordering of the ML estimator increased its accuracy and it seems preferable to order the eigenvectors based on the eigenvalues of the arithmetic mean sample covariance matrix rather than the eigenvalues of the first group. The conclusions of this simulation study are the same as above except that the autocorrelation now have a clear negative effect on the ML estimation of the eigenvalues and eigenvectors. One important observation is that for large sample sizes (roughly N > 500) the ML estimate of Π performs better than Krzanowski’s estimate, most likely due to a slow convergence rate of the ML method. Although, the difference is small and both methods work well.

6 Empirical analysis

CPCA is conducted for two datasets. The first dataset contains Swedish municipality level innovation data collected for 18 variables over 11 years, and the other dataset is one of the most used dataset in multivariate statistics, namely the Iris flower dataset, consisting of four variables collected for three different species of Iris flowers. The Iris data is not the most dynamic one but has yet become almost classic in demonstrating multivariate methods and is readably available for any researcher (through standard software or the literature). It is include here for the purpose of comparison and supplementing of Flury (Citation1984) and Krzanowski (1984) who used this dataset as an example of a practical application of CPCA.

Each year in the innovation data is treated as a group and the CPCA is conducted across years. In the Iris data, each species constitutes a group: thus, the CPCA is conducted between the different species of Iris flowers.

The two datasets serve the purpose of comparing the (dis)similarity of the two estimation methods. Although, this could be addressed with simulations, there is always a pedagogical dimension in using real data since the data structure cannot be adapted in any sense.

6.1 Data and variables

The innovation dataset consists of annual Swedish municipality level data related to innovations, collected from 2000 to 2010. The dataset were used by Holgersson and Kekezi (Citation2017) as they developed a multi-dimensional innovation index, able to reflect innovation enablers and outputs. PCA and CPCA are methods based on correlation or covariance matrices which in turn are sensitive to outliers. The municipality of Stockholm was removed from the analysis since the city is of substantially larger size than other Swedish cities. All other municipalities are included. For the years 2000–2002 a total of 289 municipalities existed in Sweden; however, since 2003, there have been 290 municipalities. Normality was assessed via Q-Q plots and histograms. The variables are presented in . All the variables except for average age and related variety, were log-transformed to ensure at least approximately normal marginal distributions.

Table 5 List of the innovation related variables included in the analysis.

Since the dataset consist of yearly data there is some time dependency present. The matter of time dependence is highly interesting, although, no theoretical results are (to my knowledge) available. The nature of the CPC model lies in the simultaneous diagonalization, and hence the possible time dependence should occur in the eigenvalues, leaving the eigenvector space unaffected. Incorporation of eigenvalue time dependence into the CPC model is no doubt a challenging problem and overlooked in this empirical example. However, time dependency is considered in the simulation study in the previous section.

The Iris flower dataset was collected by Edgar Anderson and first published in Anderson (Citation1935). Fisher (Citation1936) used it in his article the use of multiple measurements in taxonomic problem as an example of linear discriminant analysis and it has since then been used by several other authors.

The dataset consists of four variables; (1) sepal length, (2) sepal width, (3) petal length, and (4) petal width, and includes 50 samples from each of three Iris species (Iris Setosa, Iris Virginica, and Iris Versicolor).

6.2 Common principal component analysis

For the analysis of the innovation data, each year of municipal data constitutes a group, and the analysis is conducted across 11 years (2000–2010). The model selection procedure is based on the AIC, BIC, and LR statistics and presented in , where the value marked with a star (*) indicate the best fit.

Table 6 Model selection for the innovation data (G=11,p=18).

The CPC(q) models evaluated for this dataset are based on the correlation matrix of the CPCs (described in Sec. A.2 in the Appendix) for each group. For the CPC assumption to be valid, the correlation matrices must be close to the identity matrix. The different CPC(q) models are based on stepwise deeming one of the CPCs with the highest correlations to be non-common, until all components with correlations exceeding an absolute value of 0.3 are considered non-common throughout all groups. The component of the two correlated components that accounts for the least amount of variation is omitted. The absolute correlation value of 0.3 is arbitrary chosen but is argued not to be high from a practical point of view, e.g., Flury (Citation1988, 99) stated the following regarding a highest CPC correlation of –0.23: “This correlation is almost negligible from a practical point of view, indicating that the CPC model fits the data well.” In the CPC(14) model, all components with correlations higher than an absolute values of 0.3 are removed, thus the components in the CPC(14) model all have neglectable correlations indicating that the model indeed fits the data well and the more complicated CPC(q) models (i.e., q < 14) is therefore not included.

The model selected by the goodness-of-fit measures is the same regardless of which estimation technique is used. The LR statistic selects the CPC model as the best fit, AIC selects the partial CPC model with 14 common components (CPC(14)) as the best fit and BIC selects the proportional model as the best fit.

in the Appendix list the ML estimated CPC coefficients (ordered according to (3.10)) as well as the difference between the two estimation techniques. The difference in estimated eigenvalues is almost nonexistent and presented in . To present the differences more clearly, the eigenvalue differences are multiplied by 100, and only integers are shown. The two estimation techniques produce very similar results.

The Iris dataset analysis is conducted across the three different Iris flowers species. The model selection procedure is summarized in . For both estimation methods, the AIC selects the unrelated model and BIC selects the CPC(2) model as the best fit. The LR statistic selects the CPC(2) model as the best fit for ML estimation and the proportionality model for Krzanowski’s estimation method. However, for Krzanowski’s method, the LR statistic for the proportionality model (value 4.29) and the CPC(2) model (value 4.70) are close to each other. Note that the unrelated model is not testable with the LR statistic, which possibly overlooks the best-fitted model indicated by the AIC.

Table 7 Model selection for the Iris data (G=3,p=4).

The ML-estimated coefficients are presented in as well as the differences (×100) in coefficient estimation. The estimated eigenvalues are presented in in the Appendix. Some non-trivial differences in the CPC estimation can be seen, probably because the Iris data is not suitable for the CPC model as can be seen from the correlation coefficients of the CPCs presented in in the Appendix.

The effectiveness of the simultaneous diagonalization for the two datasets are measured using the Frobenius norm and displayed in . The methods diagonalize the municipality data equally well but the ML estimation diagonalize the Iris data better than Krzanowski’s method. The correlation coefficients among CPCs for the Iris data are larger for Krzanowski’s estimates, possibly indicating that the ML estimation performs better for this particular dataset.

Table 8 The effectiveness of the simultaneous diagonalization for the two datasets measured using the Frobenius norm.

7 Summary

The aim of this article was to compare two different estimation methods of the CPC model, namely the ML estimation (Flury Citation1984; Flury and Gautschi Citation1986; Flury Citation1988) and Krzanowski’s estimation method (Krzanowski Citation1984). The methods were compared for two real-world datasets and in a Monte Carlo simulation study. The two real-world datasets were; (i) annual Swedish municipality level data related to innovations, collected from 2000 to 2010, and (ii) the Iris flower dataset. AIC, BIC, and LR statistics were used to facilitate the model selection procedure for both methods. The Monte Carlo simulation study investigated the performance of the methods under different combinations of the number of groups, sample sizes, and variables from multivariate normal and chi-square distributions, with and without the presence of autocorrelation. The performance was evaluated by a p-normalized Frobenius norm and the Euclidean norm.

In the Monte Carlo study, both estimation methods performed best with normally distributed data, and they deteriorated as the data became more asymmetric. Expectedly, the accuracies of both methods improved with increasing sample sizes and groups, and worsened with increasing dimensions. Autocorrelation between the groups had a negative effect on the estimators. The ordering of the eigenvectors affected the result in the simulation study to a large extent. When the ML-estimated eigenvectors too were ordered by the magnitude of the eigenvalues of the arithmetic mean sample covariance matrix rather than according to the eigenvalues of the first group, the accuracy heavily increased. This result indicates that it is preferable to order the eigenvectors according to the eigenvalues of the arithmetic mean sample covariance matrix rather than those of the first group suggested by Flury. Krzanowski’s estimator of the eigenvectors and eigenvalues outperformed the ML estimator for all parameter combinations when the sample size was smaller than roughly 500 (with p = 10). For larger sample sizes (keeping p = 10) the ML estimator preformed better, indicating a slow convergence of the ML estimator. However, the difference between the two methods for large sample sizes were small and both methods performed well. The ML estimator did a slightly better job of simultaneously diagonalizing the covariance matrices, although, the difference was very minor.

Both methods produced very similar results for the Swedish innovation data; the estimated coefficients, eigenvalues, and selection criteria only exhibited trivial differences. For the Iris data, AIC and BIC indicated the same model as the best fit regardless of the estimation method, while the LR statistic indicated different models depending on the estimation procedure. The estimated coefficients displayed nontrivial differences, possibly due to the data not having a clear CPC structure across groups.

To conclude, Krzanowski’s estimation method is simple, readily available, requires significantly less computational power, and outperformed the ML estimation method in all simulations when the sample size was lower. Even if the ML estimator performed better in high sample sizes the difference was small. The simplicity and intuition of Krzanowski’s estimation method, the computational power required for ML and the findings in this article support and promote the use of Krzanowski’s estimation method for CPC models over ML in situations where the CPC assumption is appropriate.

References

  • Akaike, H. 1973. “Information Theory and an Extension of the Maximum Likelihood Principle.” In 2nd International Symposium on Information Theory, 267–81.
  • Anderson, E. 1935. “The Irises of the Gaspe Peninsula.” Bulletin of American Iris Society 59:2–5.
  • Anderson, T. W. 1963. “Asymptotic Theory for Principal Component Analysis.” The Annals of Mathematical Statistics 34 (1):122–48. doi: 10.1214/aoms/1177704248.
  • Browne, R. P., and P. D. McNicholas. 2014. “Orthogonal Stiefel Manifold Optimization for Eigen-Decomposed Covariance Parameter Estimation in Mixture Models.” Statistics and Computing 24 (2):203–10. doi: 10.1007/s11222-012-9364-2.
  • Essletzbichler, J. 2007. “Diversity, Stability and Regional Growth in the United States, 1975–2002.” In Applied Evolutionary Economics and Economic Geography, edited by Koen Frenken, 203. Cheltenham: Edward Elgar.
  • Fisher, R. A. 1936. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Human Genetics 7 (2):179–88. doi: 10.1111/j.1469-1809.1936.tb02137.x.
  • Flury, B. N. 1984. “Common Principal Components in k Groups.” Journal of the American Statistical Association 79 (388):892–8. doi: 10.2307/2288721.
  • Flury, B. N. 1986. “Asymptotic Theory for Common Principal Component Analysis.” The Annals of Statistics 14 (2):418–30. doi: 10.1214/aos/1176349930.
  • Flury, B. N. 1988. Common Principal Components & Related Multivariate Models. New York: Wiley.
  • Flury, B. N., and W. Gautschi. 1986. “An Algorithm for Simultaneous Orthogonal Transformation of Several Positive Definite Symmetric Matrices to Nearly Diagonal Form.” SIAM Journal on Scientific and Statistical Computing 7 (1):169–84. doi: 10.1137/0907013.
  • Frenken, K., F. Van Oort, and T. Verburg. 2007. “Related Variety, Unrelated Variety and Regional Economic Growth.” Regional Studies 41 (5):685–97. doi: 10.1080/00343400601120296.
  • Girshick, M. 1939. “On the Sampling Theory of Roots of Determinantal Equations.” The Annals of Mathematical Statistics 10 (3):203–24. doi: 10.1214/aoms/1177732180.
  • Holgersson, T. 2005. “Simulation of Non-Normal Autocorrelated Variables.” Journal of Modern Applied Statistical Methods 5 (2):15. doi: 10.22237/jmasm/1162354440.
  • Holgersson, T., and O. Kekezi. 2017. “Towards a Multivariate Innovation Index.” Economics of Innovation and New Technology 27 (3):254–72. doi: 10.1080/10438599.2017.1331788.
  • Jolliffe, I. 2002. Principal Component Analysis. Hoboken, NJ: Wiley Online Library.
  • Krzanowski, W. 1984. “Principal Component Analysis in the Presence of Group Structure.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 33 (2):164–8. doi: 10.2307/2347442.
  • Lawley, D. 1953. “A Modified Method of Estimation in Factor Analysis and Some Large Sample Results.” In Uppsala Symposium on Psychological Factor Analysis, Vol. 17, 35–42. Taylor & Francis.
  • Lawley, D. 1956. “Tests of Significance for the Latent Roots of Covariance and Correlation Matrices.” Biometrika 43 (1/2):128–36. doi: 10.2307/2333586.
  • Pepler, P. T. 2014. The Identification and Application of Common Principal Components. PhD thesis, Stellenbosch University, Stellenbosch.
  • Pepler, P. T., D. Uys, and D. Nel. 2016. “A Comparison of Some Methods for the Selection of a Common Eigenvector Model for the Covariance Matrices of Two Groups.” Communications in Statistics—Simulation and Computation 45 (8):2917–36. doi: 10.1080/03610918.2014.932801.
  • Pillai, K., S. Al-Ani, and G. Jouris. 1969. “On the Distributions of the Ratios of the Roots of a Covariance Matrix and Wilks’ Criterion for Tests of Three Hypotheses.” The Annals of Mathematical Statistics 40 (6):2033–40. doi: 10.1214/aoms/1177697283.
  • Rao, C. R. 1973. Linear Statistical Inference and Its Applications, Vol. 2. New York: Wiley.
  • Schwarz, G. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics 6 (2):461–4. doi: 10.1214/aos/1176344136.

A Appendix

A.1 The algorithm for proportional covariance matrices

The following iterative algorithm was proposed by Flury (Citation1988) to solve the likelihood equations for proportional covariance matrices.

  • Step 1:Define ρ0=(1,ρ2,,ρG) as an initial approximation and let rg=ngn, where n=n1++nG, in this article ρg=1 for g=1,,G, is used.

  • Step 2: Put

(A.1) Sg=1GrgSgρg,(A.1) (A.2) π1,,πpnormalized eigenvectors of S,(A.2) (A.3) agjπjSgπj,g=1,,G,j=1,p.(A.3)
  • Step 3: Put

(A.4) λjg=1Grgagjρg,j=1,,p.(A.4)
  • Step 4: Put

(A.5) ρg*1pj=1pagjλg,g=1,,G,(A.5)

and (A.6) ρgρg*ρ1*g=1,,G.(A.6)

  • Step 5: Put

(A.7) ρt(1,ρ2,,ρG).(A.7)

Steps 2–5 are repeated until ρtρt1*<ϵ for some predetermined vector norm, where ρt is the obtained ρ for the tth iteration. When convergence is reached, the current values of ρg, λj, and πj are the ML estimates. Flury (Citation1988) used the absolute value of the largest element as the vector norm with ϵ=104. In this study, the same vector norm is used and the value of ϵ is set to 106.

A.2 Sample covariances and correlations of sample CPCs

The sample common principal components is defined as (A.8) Yg=XgΠ̂,g=1,,G,(A.8) with the associated sample covariance matrix Fg defined as (A.9) Fg=Π̂SgΠ̂,g=1,,G.(A.9)

At times it is more beneficial to observe the correlation matrices, (A.10) Rg=Λ̂g1/2FgΛ̂g1/2,g=1,G,(A.10) where Λ̂g=diag(Fg). For the CPC assumption to hold, Rg must be close to Ip (Flury Citation1984).

For the innovation data, the correlations above an absolute value of 0.3 for both estimation methods are presented in , respectively. The correlation matrix for the Iris flower data is presented in its entirety in .

Table A.1 Correlations with absolute values higher than 0.3 for the innovation data using the ML estimated CPC model.

Table A.2 Correlations with absolute values higher than 0.3 for the innovation data using the Krzanowski’s estimated CPC model.

Table A.3 Correlation matrix for the Iris flower data using the ML estimated CPC model.

Table A.4 Correlation matrix for the Iris flower data using the Krzanowski’s estimation CPC model.

Table A.5 The first nine ML-estimated CPC coefficients for the innovation data, consisting of 18 variables and 11 groups (years 2006–2010).

Table A.6 The last six ML-estimated CPC coefficients for the innovation data, consisting of 18 variables and 11 groups (years 2000–2010).

Table A.7 Estimated eigenvalue difference for the innovation data using the CPC model.

Table A.8 ML estimated CPC for the Iris data.

Table A.9 Estimated eigenvalue difference for the Iris flower data using the CPC model.

Table A.10 Simulation results for normally distributed data.

Table A.11 Simulation results for chi-square distributed data with 10 degrees of freedom.

Table A.12 Simulation results for chi-square distributed data with 2 degrees of freedom.