553
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Regularized estimation of Kronecker structured covariance matrix using modified Cholesky decomposition

, , &
Received 15 Dec 2022, Accepted 28 Nov 2023, Published online: 11 Dec 2023

Abstract

In this paper, we study a Kronecker structured model for covariance matrices when data are matrix-valued. Using the modified Cholesky decomposition for Kronecker structured covariance matrix, we propose a regularized covariance estimator by imposing shrinkage and smoothing penalties on the Cholesky factors. A regularized flip-flop (RFF) algorithm is developed to produce a statistically efficient estimator for a large covariance matrix of matrix-valued data. Asymptotic properties are investigated and the performance of the estimator is evaluated by simulations. The results presented are applied to real data example.

1. Introduction

Matrix-valued data are commonly encountered in many practical problems such as environmental science [Citation1], image analysis [Citation2] and electrical engineering [Citation3], to mention a few, when the observations are collected from two different dimensions such as spatial-temporal data with measurements at different locations observed repeatedly over a few times. Kronecker product covariance matrix, which can be represented as a Kronecker product of two covariance matrices, plays an important role when separability of the covariance structure is assumed for the matrix-valued data. The term ‘separability’ comes from the assumption that the temporal correlation is the same for all locations and similarly the spatial correlation is the same for all time points (Kronecker product covariance matrix is also called separable covariance matrix). That is saying, for the spatial-temporal data, the covariance matrix can be represented as a Kronecker product of a covariance matrix Σ which modelling the dependence over the locations and another covariance matrix Ψ which modelling the dependence over the time points.

Using a Kronecker product covariance matrix whenever possible is advantageous since compared with the totally unstructured covariance matrix, it consists of less number of unknown parameters, and at the same time, the separability property for a covariance structure captures the important dependence present in the data. More recently, Bucci et al. [Citation4] proposed a method of forecasting covariance matrices through a parametrization of the matrices. Yang and Kang [Citation5] introduced a way for matrix decomposition for the large banded covariance matrix. Jurek and Katzfuss [Citation6] investigated an algorithm as a hierarchical Vecchia approximation to estimate both covariance matrix and the precision matrix. It should be mentioned, that in some circumstances, it will be reasonable to assume further structures on both Σ and Ψ, which further reduces the number of unknown parameters, see Filipiak and Klein [Citation7], Hao et al. [Citation8], Szczepańska-Álvarez et al. [Citation9] for instance. However, the statistical inference of Kronecker product covariance matrix has faced some challenges, especially regarding the estimating procedure. Among others, this type of structure has a non-identifiability issue which means the estimators will not be unique without imposing a restriction on the parameters. Srivastava et al. [Citation10] imposed a restriction on the matrix Ψ, i.e. ψ11=1 and proved the uniqueness of maximum likelihood estimators (MLEs) of Σ and Ψ via the flip-flop algorithm.

Nowadays datasets are becoming increasingly large and may be high-dimensional, that is, the number of variables is bigger than the sample size. In a high-dimensional situation, obtaining a stable and efficient estimator for the covariance matrix is difficult, which is one major obstacle in modelling covariance matrices. It has been already pointed out in the 1950s by Stein [Citation11] that the MLE of the covariance matrix, the sample covariance matrix, is not a good estimator when the matrix dimension is relatively or very large with respect to the sample size. The other major obstacle is to preserve positive definiteness of covariance matrices. Many good methods have been developed during the years under the assumption of sparsity of the high-dimensional covariance matrix, for example, the generalized thresholding method [Citation12] and the blocked thresholding method [Citation13]. However, these regularized methods do not guarantee the covariance estimator is positive definite.

To overcome these obstacles, the study of high-dimensional covariance estimation has paid a lot of attention to the framework of the modified Cholesky decomposition (MCD). MCD is a completely unconstrained and interpretable reparameterization of a covariance matrix [Citation14,Citation15]. Bickel and Levina [Citation16] showed that banding the Cholesky factor can produce a consistent estimator in the operator norm under weak assumptions on the covariance matrix. Huang et al. [Citation17] showed that shrinkage is very useful in providing a more stable estimate of a covariance matrix, especially when the dimension is high. Dai et al. [Citation18] proposed a new type of Mahalanobis distance based on the regularized estimator of precision matrix.

Imposing only regularization on MCD ignores the dependence among the neighbouring elements of the covariance matrix which is a natural property for longitudinal data. In this paper, we adopt the idea of Tai [Citation19] and employ not only shrinkage but also smoothing penalty. The proposed regularization scheme of Dai et al. [Citation18] was motivated by two observations of the Cholesky factor. Firstly, the Cholesky factor is likely to have many off-diagonal elements that are zero or close to zero. Secondly, continuity is a natural property among neighbouring elements of the Cholesky factor for the covariance matrices of longitudinal data. Taking smoothness into account can help to provide more efficient covariance matrix estimates.

This paper aims to obtain regularized MLEs for Σp×p and Ψq×q which are computationally feasible when p, q are large, as well as to develop exploratory tools to investigate further structures of column/row covariance matrices of the matrix-valued data. The novelty of this paper is that the separability assumption is taken into account and the regularization techniques are incorporated when estimating the covariance of matrix-valued data using MCD. A so-called regularized flip-flop algorithm (RFF) will be proposed in a later section.

The rest of the paper is organized as follows. In Section 2, we introduce the modified Cholesky decomposition and the extension when it is applied to the Kronecker product covariance structures as well as review the MLE procedure of this type of covariance structure. In Section 3, we propose regularized estimators of Σ and Ψ based on combining shrinkage and smoothing. The main theoretical results are presented in Section 4. A simulation study is conducted in Section 5 to evaluate the performance of the proposed estimators. In Section 6 the obtained results are illustrated with a real data example. Some discussions are given in Section 7.

Throughout the paper, we use the following notations. For a matrix A, A denotes the transpose, tr(A) is the trace, A1 is the inverse matrix, det(A) is the determinant, diag(A) is the diagonalizing operator and vec(A) is the vectorizing operator of A, respectively. Moreover, AF2=tr(AA), A=λmax1/2(AA) and the matrix AB denotes the Kronecker product of A and B. As such, A2 denotes the Kronecker product of A and A. A0 denotes the true parameter of the underlying model. Operators O() and o() are the infinitely large quantity and infinitely small quantity, and OP() and oP() denote that relationships hold with probability tending to 1.

2. Modified Cholesky decomposition of a Kronecker product covariance structure

In this section, we will first give a short review of MCD. Then we present the MLE of the Kronecker product covariance structure. The MLE is provided by using an iterative algorithm which is referred to as the ‘flip-flop algorithm’ [Citation20]. At the end of this section, the MCD of the Kronecker product covariance structure is given.

2.1. Modified Cholesky decomposition

Suppose that Y is a random matrix which consists of n independent p-dimensional random vectors with mean 0 and the covariance matrix Λ. Let Y follow Np,n(0,Λ,In). With a covariance matrix Λ of order p, the modified Cholesky decomposition [Citation14] of Λ is specified by (1) TΛT=D,(1) and thus, Λ=T1DT1, where T is a unique unit lower triangular matrix with ones on the main diagonal and D is a unique diagonal matrix, i.e. T=(1000ϕ21100ϕ31ϕ3210ϕp1ϕp2ϕp(p1)1)andD=(σ12000σ22000σp2).Based on Equation (Equation1) and the property that normality is preserved under linear transformations, we have (2) TY=E,EN(0,D,In),(2) which turns out that Equation (Equation2) can be represented as the following linear model: (3) yt=k=1t1ϕtkyk+et,etN(0,σk2In),t=2,,p,(3) where ϕtk and σk2 are the generalized autoregressive parameters and innovation variances, respectively. Observing Equation (Equation2) has been split into p independent simple linear regression models and the MLEs of the unknown parameters in (Equation3) are actually ordinary least squares (OLS) estimators. It can be seen that MCD provides an alternative MLE estimation procedure of Λ through a sequential regression scheme, which reduces the challenge of estimating a covariance matrix.

2.2. Maximum likelihood estimation

Let Xi be a random matrix which is normally distributed and we can write (4) XiNp,q(μ,Σ,Ψ),i=1,,n,(4) where μ is an p×q unstructured mean, Σ is an p×p unstructured covariance between rows of Xi at any given column and Ψ is an q×q unstructured covariance between columns of Xi at any given row. Both Σ and Ψ are unknown, positive definite matrices. Without loss of generality, we consider the case μ=0. Equivalently, the vectorized version of model (Equation4) can be written as (5) vecXiNpq(0,ΨΣ).(5) How to estimate the covariance structure ΨΣ in Equation (Equation5) has been discussed intensively in the statistical literature, especially in the MLE framework, see, e.g. Galecki [Citation21], Lu and Zimmerman [Citation20], Manceur and Dutilleul [Citation22], Naik and Rao [Citation23], Roy and Khattree [Citation24]. The log-likelihood function of (Equation5) is (6) 2lnL=c+nln[det(ΨΣ)]+i=1n[vecXi(Ψ1Σ1)vecXi],(6) where c=npqln(2π). It has been noticed that there is no explicit expression for the MLE of both Σ and Ψ, and the reason is that there are no analytical solutions to the system of two matrix equations below: (7) npΣ^=i=1nXicΨ^1Xic,nqΨ^=i=1nXicΣ^1Xic,(7) where Xic=Xiμ^ and μ^ is 1ni=1nXi, i=1,2,,n. An iterative algorithm, the flip-flop algorithm, is therefore required and it proceeds by iteratively updating Σ and Ψ, and iterations continue until some convergence criterion is fulfilled, e.g. the Euclidean distance between successive Kronecker product estimates is smaller than some given threshold. Regarding conditions on the sample size n for the existence of the MLE of a Kronecker product covariance matrix, see Table 1 of Dutilleul [Citation25] for a summary. Nevertheless, for positive definiteness of the estimators Σ and Ψ, the sample size n must be greater than p and greater than q, giving n>max(p,q), which was suggested by Srivastava et al. [Citation10].

It is worth mentioning that there is an unidentifiable issue in ΨΣ since (cΨ)(c1Σ)=ΨΣ for any constant c, see, e.g. Galecki [Citation21], Naik and Rao [Citation23]. Thus, for the purpose of identifiability, we shall assume the first element of Ψ as ψ11=1. Denote Ψ as Ψ under the restriction ψ11=1.

2.3. MCD of a Kronecker product covariance structure

The estimating equations in (Equation7) will look different depending on the structures of Σ and Ψ. Szczepańska-Álvarez et al. [Citation9] studied four Kronecker product covariance structures by imposing different restrictions on the pair of matrices Σ, Ψ. The authors have obtained likelihood estimation equations for these covariance structures which give the possibility to utilize flip-flop type algorithms with clear stopping rules.

In this subsection, we consider MCD of the covariance matrix Λ=ΨΣ. The following theorem provides us with an important result concerning the parameter space of ΨΣ when the MCD is performed.

Theorem 2.1

Let Λ=ΨΣ, the unique MCD for Λ such that TΛT=D has the following pattern: T=T2T1,D=D2D1,where T1=(ϕtj):p×p and T2=(ηsk):q×q are two different unique unit lower triangular matrices, 1tp, 1jt1, 1sq, 1ks1. The matrices D1=diag(σ12,,σp2) and D2=diag(1,γ22,,γq2) are two different diagonal matrices.

Proof.

Since T is a unique unit lower triangular matrix, it can be reformulated as T=T2T1, where T1 and T2 are also triangular matrices. Then we have TΛT=(T2T1)(ΨΣ)(T2T1)=T2ΨT2T1ΣT1=D2D1.Observing that under the identifiable restriction ψ11=1, it follows that D2=T2ΨT2, which implies that γ12=1. On the other hand, if γ12=1, based on the fact Ψ=T21D2(T21), where T21 is also a unit triangular matrix [Citation26], hence we have ψ11=1 in Ψ, i.e. Ψ.

The results in Theorem 2.1 reveal that the MCD of Kronecker product covariance structure also has a Kronecker product pattern and it is a one-to-one transformation of the original parameter space. Moreover, the identifiable restriction ψ11=1 only affects the diagonal matrix D2 and does not affect the matrices T1 and T2, which is important when we introduce the penalty terms in the next section.

3. Regularized estimation of Kronecker product covariance structure

Using Theorem 2.1 and the property that matrix-variate normality is preserved under linear transformations, model (Equation4) gives T1XiT2Np,q(0,D1,D2). With the facts Σ=T11D1(T11), Ψ=T21D2(T21), Ψ1Σ1=(T2T1)(D21D11)(T2T1) and (T2T1)vecXi=vec(T1XiT2), the log-likelihood function given in (Equation6), can be expressed as (8) lnL(D1,T1,D2,T2)=c+nln[det(T21T11)]+nln[det(D2D1)]+i=1n[vecXi(T2T1)(D21D11)(T2T1)vecXi]=c+nln[det(D2D1)]+i=1ntr(XiT1D11T1XiT2D21T2)=c+nps=1qlnγs2+nqt=1plnσt2+i=1ntr[(T1XiT2)D21(T1XiT2)D11].(8) We will now discuss how to estimate Λ=ΨΣ when both Σ and Ψ have patterned structures (certain parsimonies) such as first-order autoregressive (AR(1)) and compound symmetry (CS). Parsimony in a covariance matrix leads to parsimony in the Cholesky factor, which corresponds to many zeros or small elemetns in the regression coefficients. For example, for an AR(1) covariance matrix, there could be considerable amount of zeros in the subdiagonals of the T matrices. In particular, the first subdiagonal of T has the same entry in all places, and the rest of the subdiagonals are either zero or contain minor values that are close to zero. For a CS covariance matrix, the elements of each row of the Cholesky factor matrix are the same, and one can expect that there are many small value elements in the last few rows of the lower triangular part of T. One can observe that, with a specific pattern for Σ and Ψ, the corresponding Cholesky factor matrix components in Theorem 2.1, Tm matrices where m = 1, 2, are potentially sparse or nearly sparse.

In order to take into account both sparsity and smoothness, we apply the L1 penalty which shrinks the estimates of ϕtj's and ηsk's toward zero, and at the same time introduce the smoothing penalties in the framework of penalized likelihood. The shrinkage penalties are given as follows: (9) λsh,1t=2pj=1t1|ϕtj|,λsh,2s=2qk=1s1|ηsk|,(9) where λsh,1 and λsh,2 are the shrinkage tuning parameters. The smoothing penalties have the forms (10) λsm,1t=2p2j=1t1(Δdiag2ϕt+2,t+2j)2,λsm,2s=2q2k=1s1(Δdiag2ηs+2,s+2k)2,(10) where λsm,1 and λsm,2 are the smoothing tuning parameters, and Δdiag2ϕt+2,t+2j=(ϕt+2,t+2jϕt+1,t+1j)(ϕt+1,t+1jϕt,tj),Δdiag2ηs+2,s+2k=(ηs+2,s+2kηs+1,s+1k)(ηs+1,s+1kηs,sk),are the second order differences of T1 and T2, respectively. Combining the penalties given in (Equation9) and (Equation10), the penalized likelihood function can be presented as: (11) 2logL+λsh,1t=2pj=1t1|ϕtj|+λsm,1t=2p2j=1t1(Δdiag2ϕt+2,t+2j)2+λsh,2s=2qk=1s1|ηsk|+λsm,2s=2q2k=1s1(Δdiag2ηs+2,s+2k)2,(11) where 2logL is given in (Equation8).

The maximizer of (Equation11) does not have a closed-form expression. Hence, we extend the flip-flop algorithm [Citation20,Citation27] to incorporate the L1 penalties in (Equation9) and smoothing penalties in (Equation10). We estimate the matrices T1,T2,D1,D2 based on the following steps.

Step 1: With fixed T2 and D2, update T1, D1 with t=1p(nqlogσt2+utk=1t1ϕtkuk2σt2)+λsh,1t=2pk=1t1|ϕtk|+λsm,1t=2p2k=1t1(Δdiag2ϕt+2,t+2k)2,Step 2: With fixed T1 and D1, update T2, D2, with s=1q(nplogγs2+vsk=1s1ηskvk2γs2)+λsh,2s=2qk=1s1|ηsk|+λsm,2s=2q2k=1s1(Δdiag2ηs+2,s+2k)2,where ut is the tth row of the mode 1 unfolding of U=[X1::Xn]T2D21/2 and vs is the sth column of the mode 2 unfolding of V=D11/2T1[X1::Xn]. The pseudo-code of the algorithm, the regularized flip-flop (RFF) algorithm, is described in Algorithm 1 in Appendix 3.

4. Main results: theoretical convergence

This section shows the asymptotic properties of the proposed estimator. We start with some assumptions that are needed for the theoretical analysis of the penalized likelihood function.

Let s1 and s2 be the number of nonzero elements in the lower-triangular part of T1 and T2, respectively. Suppose that

(C1)

There exists a constant d such that: 0<1/d<λmin(Σ0)λmax(Σ0)<d< and 0<1/d<λmin(Ψ0)λmax(Ψ0)<d< where λmax(.) and λmin(.) are the maximum and minimum eigenvalues of the enclosed matrix respectively.

(C2)

s1s2log(pq)/n=o(1) as n.

(C3)

max{s1,s2}max{p,q}log(pq)/n=o(1) as n.

Under the above assumptions, we can establish the following theorem.

Theorem 4.1

Suppose that Assumptions (C1) to (C3) hold. Let the shrinkage tuning parameters satisfy λsh,1=O(s2qlog(pq)/n) and λsh,2=O(s1plog(pq)/n). Let the smoothing tuning parameters satisfy λsm,1t,j|Atj,0(1)Btj,0(1)|2O(qs1s2log(pq)/n),λsm,2s,k|Ask,0(2)Bsk,0(2)|2O(ps1s2log(pq)/n),where Atj,0(1)=ϕt+2,t+2j,0ϕt+1,t+1j,0, Btj,0(1)=ϕt+1,t+1j,0ϕt,tj,0, Ask,0(2)=ηs+2,s+2k,0ηs+1,s+1k,0, and Bsk,0(2)=ηs+1,s+1k,0ηs,sk,0. Then, we have T1T1,0F2=OP(s1s2log(pq)/n),T2T2,0F2=OP(s1s2log(pq)/n),D1D1,0F2=OP(plog(pq)/n),D2D2,0F2=OP(qlog(pq)/n).

The proof of this theorem is relegated to Appendix 2. Theorem 4.1 shows convergence rates for the decomposed components of the covariance matrices Σ and Ψ. This result suggests a small scope for choosing the candidate values for λsh,λsk. On the other hand, adding smoothing terms together with shrinkage will avoid unnecessary disturbances that may be caused by very few but irregular roughness values in the covariance matrices. This problem is not obvious in the estimation of the covariance matrix. But it can lead to some difficulties when inverting covariance matrix is needed. Jiang [Citation28] considered the MCD of the covariance matrix Λ, without the Kronecker product of Ψ and Σ. It gives the following convergence rates, TT0F2=OP{[(p+s1)(q+s2)pq]log(pq)/n},DD0F2=OP(pqlog(pq)/n).Theorem 4.1 implies that, when taking the Kronecker structure into account, the convergence rates are faster than the rates obtained by [Citation28].

5. Simulations

In this section, we conduct a simulation study in order to assess the finite-sample performance of the proposed estimation presented in Section 3.

We generate XiNp,q(μ,Σ,Ψ),i=1,,n as defined in (Equation4), from a matrix-normal distribution with μ=0 with the covariance matrices Σ and Ψ. We assume that Σ has an AR(1) structure with a correlation coefficient ρ={0.3,0.5,0.7} and Ψ has a CS structure with a correlation coefficient of 0.5. Additionally, we allow both covariance matrices to have different variances along the main diagonals, i.e. Σ has a heterogeneous AR(1) and Ψ has a heterogeneous CS structure, respectively. They are denoted ARH(1) and CSH. The correlation matrix of ARH(1) is AR(1) and correspondingly the correlation matrix of CSH is CS.

The combinations of dimensions (n, p, q) are chosen to be (50,10,10), (10,30,30), (30,30,30), and (10,50,10). The first combination corresponds to the scenario where n>max{p,q}, and the second combination corresponds to the scenario where n<min{p,q}. The third and fourth scenarios correspond to n=min{p,q}. The fourth scenario also allows us to test the performance of the estimator in an extreme case with n = q<<p. For each combination, the number of replications is 1000. The estimators considered include the L1 penalized estimator with smoothing. As a comparison, we also include the unpenalized MLE, the L1 penalized estimator without smoothing, and only smoothing penalties. To select the tuning parameters, the candidate values of the tuning parameters λsk,λsm are chosen from the interval of (0.1,106) with an increment of 10 each time. A five-fold cross-validation is utilized in order to find out the optimal values of λsk and λsm.

Let Λ^ be the estimated covariance matrix of ΨΣ. In order to compare different estimators we use the following entropy and quadratic loss of two matrices to assess the estimator error: fE(Λ,Λ^)=tr(Λ1Λ^)ln[det(Λ1Λ^)]pq,fQ(Λ,Λ^)=tr(Λ1Λ^I)2.For each estimated correlation matrix, we compute the risk function which is the mean of the entropy loss across 1000 replications. The simulation results are presented in Tables . The cases which have the smallest risk are in bold face.

Table 1. Risk functions based on the entropy and quadratic loss for the corresponding estimated covariance matrices Λ^ with ρ=0.3.

Table 2. Risk functions based on the entropy and quadratic loss for the corresponding estimated covariance matrices Λ^ with ρ=0.5.

Table 3. Risk functions based on the entropy and quadratic loss for the corresponding estimated covariance matrices Λ^ with ρ=0.7.

Based on the estimated risks as shown in the tables above, it is evident that shrinkage combined with smoothing along the subdiagonals outperforms the unpenalized MLE, using only shrinkage, and only smoothing, for every combination of (ρ,n,p,q) under both loss functions. The case where n>max{p,q} is a classical set of dimensions and the differences between MLE and other penalized methods are relatively small since MLE works satisfactorily when n is large. It is interesting to see that combining the shrinkage L1 penalty and the smoothing outperforms when p is much larger than n and q and the improvement is substantial especially. This setting is introduced with the consideration that both the ratio q/n=1 and p/n=0.2 could be troublesome from the estimation point of view.

Only shrinkage with the L1 penalty works better than only smoothing along the subdiagonals because there are many zero entries in the T matrix and shrinkage is preferable. For the Kronecker structured covariance with AR(1) and CS, smoothing contributes less to the proposed combined method compared to shrinkage. Smoothing the subdiagonals performs somehow better than MLE due to the subdigonals of T are also smooth and it takes the smoothness of T into account which the shrinkage method entirely ignores.

In general, combining smoothing and shrinkage gives satisfactory results for all settings of dimensions while using shrinkage alone works considerably well in several cases. The reduction in bias has become clear when we combine smoothing and shrinkage or only shrinkage on both Σ^ and Ψ^. But the difference between these two methods is minor especially when the n = p = q. In this case, the shrinkage method is recommended due to the simplicity of implementation and computational efficiency. On the other hand, combining smoothing and shrinkage offers a handy tool when dealing with some extreme situations when n = q<<p.

6. A real data example

In this section, we apply the RFF algorithm with MCD to fit the correlation matrix between channels of Electroencephalography (EEG) data which has been analysed in [Citation29]. The data set is collected from a large study to examine EEG correlates of genetic predisposition to alcoholism. EEG voltages are recorded at 256 Hz for 1 second following the presentation of Figure . There are two groups of subjects: alcoholic and control. Each subject is exposed to either a single stimulus (S1) or two stimuli (S1 and S2) which are pictures of objects. The data contains 26 parietal and occipital channel data with 5 trials (replications) from 10 alcoholic and 10 control subjects.

Figure 1. Presentation of recording on EEG voltages.

Figure 1. Presentation of recording on EEG voltages.

We calculate the means of voltage series. The proposed method is applied to analyse correlations of the mean voltages among channels. Thus, there are two datasets of size p = 26, q = 5, and n=10, respectively. In this application, we use a five-fold cross-validated log-likelihood criterion to choose optimal values of tuning parameters. The estimated covariance matrix and correlation matrix are plotted in Figures  where the elements in the matrix are represented by the darkness of two colours. The bluer the colour is, the closer the value of the elements to 1. On the other hand, the redder the colour is, the closer the value of the elements to −1.

Figure 2. Estimated correlation matrices for the alcoholic group using MLE.

Figure 2. Estimated correlation matrices for the alcoholic group using MLE.

Each figure consists of two parts, the left-hand side is the plot of the elements in the matrix of Ψ^. The right-hand side is the plot of Σ^. The MLEs of the covariance matrices are given in Figures and  as a benchmark. In Figure , there are some non-zero elements exist after a few empty elements in the off-diagonal areas, for example, the further upper-right area and lower-left area of Σ^. This is the sparsity issue for longitudinal data. Intuitively, a further period implies a weaker connection between two-time points. For some practical needing, one would like to get rid of the effect of noises with long-term memory in the repeated measurements to reduce the bias of estimations. The same results can be observed from the MLEs of Ψ^ as well. The noises in Σ^ at control groups are more obvious as we move from the diagonal to the upper-right and lower-left corners. The MCD-based estimations are given in Figures  and . The noises for both Σ^ and Ψ^ are shrunk and smoothed out on the further corners from the diagonal respectively. On the other hand, the information which is close to the diagonal is kept as shown in those figures.

Figure 3. Estimated correlation matrices for the control group using MLE.

Figure 3. Estimated correlation matrices for the control group using MLE.

Figure 4. Estimated correlation matrices for the alcoholic group using the proposed method.

Figure 4. Estimated correlation matrices for the alcoholic group using the proposed method.

Figure 5. Estimated correlation matrices for the control group using the proposed method.

Figure 5. Estimated correlation matrices for the control group using the proposed method.

7. Conclusion

We propose a novel method to estimate Kronecker structured covariance matrix for matrix-valued data. The considered matrix is decomposed by MCD and each component from the decomposition is considered separately. Shrinkage and smoothing techniques are employed to take the typical features of the Cholesky factors into account and reduce the bias caused by the sparsity from large longitudinal data. The performance of RFF is illustrated by both simulations and empirical studies, which shows that the combination is useful. The theoretical results give the estimators' convergence rates and the upper bounds to the Kronecker product of components via MCD are derived.

The proposed approach is developed using MCD method which requires a pre-knowledge on the variable ordering. Natural ordering is typical for longitudinal data. When the situation arises that the variables do not have a natural ordering among themselves, this problem is of great interest and will be studied separately.

In this article we only study the Kronecker estimate for matrix-valued data. One possible extension is to consider the Kronecker covariance structure on array data, i.e. the Kronecker products of three or more components matrices, in this case model (Equation5) becomes the array-normal distribution [Citation30,Citation31]. It is worthwhile to develop regularized methods further under this framework.

Acknowledgments

The authors thank the editor, the associate editor and two anonymous reviewers for their insightful comments which have resulted in significant improvement of this manuscript.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • Srivastava MS, von Rosen T, von Rosen D. Estimation and testing in general multivariate linear models with Kronecker product covariance structure. Sankhyā: Indian J Stat Ser A. 2009;71(2):137–163.
  • Dryden IL, Kume A, Le H, et al. A multi-dimensional scaling approach to shape analysis. Biometrika. 2008;95:779–798. doi: 10.1093/biomet/asn050
  • Werner K, Jansson M, Stoica P. On estimation of covariance matrices with Kronecker product structure. IEEE Trans Signal Process. 2008;56:478–491. doi: 10.1109/TSP.2007.907834
  • Bucci A, Ippoliti L, Valentini P. Comparing unconstrained parametrization methods for return covariance matrix prediction. Stat Comput. 2022;32:90. doi: 10.1007/s11222-022-10157-4
  • Yang W, Kang X. An improved banded estimation for large covariance matrix. Commun Stat – Theory Methods. 2023;52:141–155. doi: 10.1080/03610926.2021.1910839
  • Jurek M, Katzfuss M. Hierarchical sparse Cholesky decomposition with applications to high-dimensional spatio-temporal filtering. Stat Comput. 2022;32:15. doi: 10.1007/s11222-021-10077-9
  • Filipiak K, Klein D. Approximation with a Kronecker product structure with one component as compound symmetry or autoregression. Linear Algebra Appl. 2018;559:11–33. doi: 10.1016/j.laa.2018.08.031
  • Hao C, Liang Y, Mathew T. Testing variance parameters in models with a Kronecker product covariance structure. Stat Probab Lett. 2016;118:182–189. doi: 10.1016/j.spl.2016.06.027
  • Szczepańska-Álvarez A, Hao C, Liang Y, et al. Estimation equations for multivariate linear models with Kronecker structured covariance matrices. Commun Stat – Theory Methods. 2017;46:7902–7915. doi: 10.1080/03610926.2016.1165852
  • Srivastava MS, von Rosen T, Von Rosen D. Models with a Kronecker product covariance structure: estimation and testing. Math Methods Stat. 2008;17:357–370. doi: 10.3103/S1066530708040066
  • Stein C. Some problems in multivariate analysis. Stanford University, Department of Statistics; 1956. (Part I, Technical Report 6).
  • Rothman AJ, Levina E, Zhu J. Generalized thresholding of large covariance matrices. J Am Stat Assoc. 2009;104:177–186. doi: 10.1198/jasa.2009.0101
  • Cai TT, Yuan M. Adaptive covariance matrix estimation through block thresholding. Ann Statist. 2012;40:2014–2042.
  • Pourahmadi M. Joint mean-covariance models with applications to longitudinal data: unconstrained parameterisation. Biometrika. 1999;86:677–690. doi: 10.1093/biomet/86.3.677
  • Pourahmadi M. Maximum likelihood estimation of generalised linear models for multivariate normal covariance matrix. Biometrika. 2000;87:425–435. doi: 10.1093/biomet/87.2.425
  • Bickel PJ, Levina E. Covariance regularization by thresholding. Ann Statist. 2008;36:2577–2604.
  • Huang JZ, Liu N, Pourahmadi M, et al. Covariance matrix selection and estimation via penalised normal likelihood. Biometrika. 2006;93:85–98. doi: 10.1093/biomet/93.1.85
  • Dai D, Pan J, Liang Y. Regularized estimation of the Mahalanobis distance based on modified Cholesky decomposition. Commun Stat Case Stud Data Anal Appl. 2022;8(4):1–15.
  • Tai W. Regularized estimation of covariance matrices for longitudinal data through smoothing and shrinkage [PhD thesis]. Columbia University; 2009.
  • Lu N, Zimmerman DL. The likelihood ratio test for a separable covariance matrix. Stat Probab Lett. 2005;73:449–457. doi: 10.1016/j.spl.2005.04.020
  • Galecki AT. General class of covariance structures for two or more repeated factors in longitudinal data analysis. Commun Stat – Theory Methods. 1994;23:3105–3119. doi: 10.1080/03610929408831436
  • Manceur AM, Dutilleul P. Maximum likelihood estimation for the tensor normal distribution: algorithm, minimum sample size, and empirical bias and dispersion. J Comput Appl Math. 2013;239:37–49. doi: 10.1016/j.cam.2012.09.017
  • Naik DN, Rao SS. Analysis of multivariate repeated measures data with a Kronecker product structured covariance matrix. J Appl Stat. 2001;28:91–105. doi: 10.1080/02664760120011626
  • Roy A, Khattree R. On implementation of a test for Kronecker product covariance structure for multivariate repeated measures data. Stat Methodol. 2005;2:297–306. doi: 10.1016/j.stamet.2005.07.003
  • Dutilleul P. A note on necessary and sufficient conditions of existence and uniqueness for the maximum likelihood estimator of a Kronecker-product variance–covariance matrix. J Korean Stat Soc. 2021;50:607–614. doi: 10.1007/s42952-020-00066-5
  • Harville DA. Matrix algebra from a statistician's perspective; New York, NY: Springer; 1998.
  • Dutilleul P. The MLE algorithm for the matrix normal distribution. J Stat Comput Simul. 1999;64:105–123. doi: 10.1080/00949659908811970
  • Jiang X. Joint estimation of covariance matrix via Cholesky decomposition [PhD thesis]. National University of Singapore; 2012.
  • Sykacek P, Roberts SJ. Adaptive classification by variational Kalman filtering. Adv Neural Inf Process Syst. 2002;15:753–760.
  • Hoff PD. Separable covariance arrays via the tucker product, with applications to multivariate relational data. Bayesian Anal. 2011;6:179–196.
  • Leng C, Pan G. Covariance estimation via sparse Kronecker structures. Bernoulli. 2018;24:3833–3863. doi: 10.3150/17-BEJ980
  • Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. Ann Stat. 2009;37:4254. doi: 10.1214/09-AOS720

Appendices

Appendix 1.

Some lemmas

Lemma A.1

Suppose that the eigenvalues of T1,0, T2,0, D1,0 and D2,0 are bounded by d1 and d, and that pq/n[0,1). Consider T1T1,0F2=O(s1s2log(pq)/n) and T2T2,0F2=O(s1s2log(pq)/n), such that Assumption (C2) holds. Then, for any ϵ>0, there exists C1,ϵ and C2,ϵ such that P(1log(pq)/nmaxi,j|{(T2T1)(S(Ψ0Σ0))(T2T1)}ij|C1,ϵ)>1ϵ,P(1log(pq)/nmaxi,j|{[S(Ψ0Σ0)](T2,0T1,0)(D2,01D1,01)}ij|<C2,ϵ)>1ϵ,for all n1.

Proof of Lemma A.1.

Since the eigenvalues of T1,0 and T2,0 are bounded, T2,0T1,0 is also bounded. Then, T2T1T2,0T1,0=T2,0ΔT(1)+ΔT(2)T1,0+ΔT(2)ΔT(1)T2,0ΔT(1)+ΔT(2)T1,0+ΔT(2)ΔT(1)ΔT(1)T2,0+T1,0ΔT(2)+ΔT(1)ΔT(2)ΔT(1)FT2,0+T1,0ΔT(2)F+ΔT(1)ΔT(2).By Assumption (C2), T2T1T2,0T1,0=oP(1). Hence, by Lemma 3 in [Citation32], we have maxi,j|{(T2T1)(S(Ψ0Σ0))(T2T1)}ij|=OP(log(pq)/n),which establishes the first claim in this lemma.

Note that maxi,j|{[S(Ψ0Σ0)](T2,0T1,0)(D2,01D1,01)}ij|=maxi,j|{(D2,01D1,01)(D2,0D1,0)[S(Ψ0Σ0)](T2,0T1,0)(D2,01D1,01)}ij|=maxi,j|{(D2,0D1,0)[S(Ψ0Σ0)](T2,0T1,0)}ij(D2,0D1,0)ii(D2,0D1,0)jj|d4maxi,j|{(D2,0D1,0)[S(Ψ0Σ0)](T2,0T1,0)}ij|.Hence, by Lemma 3 in [Citation32], we have maxi,j|{(D2,0D1,0)[S(Ψ0Σ0)](T2,0T1,0)}ij|=OP(log(pq)/n).Consequently, maxij|{[S(Ψ0Σ0)][T2,0T1,0](D2,01D1,01)}ij|d4OP(log(pq)/n),which establishes the second claim in this lemma.

Lemma A.2

For the lower triangular matrices T2,0 and T1,0 in MCD, (T2,0+ΔT(2))(T1,0+ΔT(1))T2,0T2,0F2qΔT(1)F2+pΔT(2)F2.

Proof

Proof of Lemma A.2

Note that the diagonal entries of T2 and T2,0 are all 1's. Then, for i=1,,q, the (i,i)th block of (T2,0+ΔT(2))(T1,0+ΔT(1))T2,0T2,0 is simply ΔT(1). When ij, the (i,j)th block matrix is T2,ij(T1,0+ΔT(1))T2,0,ijT1,0. Its diagonal entries are all T2,ijT2,0,ij, yielding T2,ij(T1,0+ΔT(1))T2,0,ijT1,0F2p(T2,ijT2,0,ij)2. Hence, (T2,0+ΔT(2))(T1,0+ΔT(1))T2,0T2,0F2i=1qΔT(1)F2+ijT2,ij(T1,0+ΔT(1))T2,0,ijT1,0F2, which establishes the lemma.

Appendix 2.

Proof of Theorem 4.1

In this section, we will prove Theorem 4.1. Our proof by and large follows the steps in [Citation28], but takes the MCD structure into account.

Under Assumption (C1), Lemma 2 of Jiang [Citation28] implies that 0<1/d<ψmin(Ti,0)ψmax(Ti,0)<d<,0<1/d<λmin(Di,0)λmax(Di,0)<d<,for i = 1, 2, where ψ(.) denotes the singular value of T. Set Ei=Di1, Ei,0=Di,01, i = 1, 2. Since the eigenvalues of Di,0 are bounded, we also have 0<1/d<λmin(Ei,0)λmax(Ei,0)<d<.Let logL(D1,T1,D2,T2) be the log-likelihood, and G(ΔD(1),ΔT(1),ΔD(2),ΔT(2)) be 2 times the difference between the penalized likelihood function of the estimated covariance matrix and the true covariance matrix, where D1=D1,0+ΔD(1), T1=T1,0+ΔT(1), D2=D2,0+ΔD(2), and T2=T2,0+ΔT(2). Consider A1={ΔT(1):ΔT(1)F2U12s1s2log(pq)/n}, A2={ΔT(2):ΔT(1)F2U22s1s2log(pq)/n}, B1={ΔD(1):ΔD(1)F2V12plog(pq)/n}, and B2={ΔD(2):ΔD(2)F2V22qlog(pq)/n}. We will show that, for every ΔT(1), ΔT(2), ΔD(1), and ΔD(2) at the boundaries of respective A1, A2, B1, and B2, the probability P(G(ΔD(1),ΔT(1),ΔD(2),ΔT(2)))>0 tends to 1 for sufficiently large U1,U2 and V1,V2.

Define ΔE(1,2)=E2E1E2,0E1,0, ΔT(1,2)=T2T1T2,0T1,0, and ΔD(1,2)=D2D1D2,0D1,0. Let S be the sample covariance matrix of size pq×pq. Then, 2 times the difference between the unpenalized likelihood function of the estimated covariance matrix and the true covariance matrix can be expressed as 2(logL(D1,T1,D2,T2)logL(D1,0,T1,0,D2,0,T2,0))=log|D2D1|+tr{S(T2T1)(D21D11)(T2T1)}log|D2,0D1,0|tr{S(T2,0T1,0)(D2,01D1,01)(T2,0T1,0)}=tr{[E2,0E1,0]1ΔE(1,2)}+01(1v)vec(ΔE(1,2))[E2,0E1,0+vΔE(1,2)]2vec(ΔE(1,2))dv+tr{S(T2T1)(D21D11)(T2T1)}tr{S(T2,0T1,0)(D2,01D1,01)(T2,0T1,0)}=01(1v)vec(ΔE(1,2))[E2,0E1,0+vΔE(1,2)]2vec(ΔE(1,2))dvtr{[D2,0D1,0]ΔE(1,2)}+tr{S(T2T1)(D21D11)(T2T1)}tr{S(T2,0T1,0)(D2,01D1,01)(T2,0T1,0)},where we have used Taylor expansion with the integral form of the remainder in the second equality.

We can partition G(ΔD(1),ΔT(1),ΔD(2),ΔT(2)) as G(ΔD(1),ΔT(1),ΔD(2),ΔT(2))=M1+M2+M3+S1+S2+S3,where M1=01(1v)vec(ΔE(1,2))[E2,0E1,0+vΔE(1,2)]2vec(ΔE(1,2))dv,M2=tr{ΔE(1,2)(T2T1)(SΨ0Σ0)(T2T1)},M3=tr{(D2,01D1,01)[(T2T1)S(T2T1)(T2,0T1,0)S(T2,0T1,0)]}+tr{ΔE(1,2)(T2T1)(Ψ0Σ0)(T2T1)}tr{(D2,0D1,0)ΔE(1,2)},S1=λsh,1t=2pk=1t1|T1,tk|+λsh,2s=2qk=1s1|T2,sk|λsh,1t=2pk=1t1|T1,tk,0|λsh,2s=2qk=1s1|T2,sk,0|,S2=λsm,1t=2p2j=1t1[(ϕt+2,t+2jϕt+1,t+1j)(ϕt+1,t+1jϕt,tj)]2λsm,1t=2p2j=1t1[(ϕt+2,t+2j,0ϕt+1,t+1j,0)(ϕt+1,t+1jϕt,tj,0)]2,S3=λsm,2s=2q2k=1s1[(ηs+2,s+2kηs+1,s+1k)(ηs+1,s+1kηs,sk)]2λsm,2s=2q2k=1s1[(ηs+2,s+2k,0ηs+1,s+1k,0)(ηs+1,s+1k,0ηs,sk,0)]2.We consider these terms separately.

A2.1. The term M1

Using triangular inequality, we have E2,0E1,0+vΔE(1,2)E2,0E1,0+vΔE(1,2), where ΔE(1,2)F=o(1) by Assumption (C3). Since E2,0E1,0 is bounded from above and below, for a sufficiently large n, vΔE(1,2) is dominated by E2,0E1,0 for any v[0,1]. That is saying, (A1) E2,0E1,0+vΔE(1,2)E2,0E1,0+vΔE(1,2)2E2,0E1,02d2.(A1) Therefore, we have M101(1v)λmin{[E2,0E1,0+vΔE(1,2)]2}vec(ΔE(1,2))vec(ΔE(1,2))dv01(1v)λmin{[E2,0E1,0+vΔE(1,2)]2}vec(ΔE(1,2))22dv=01(1v)E2,0E1,0+vΔE(1,2)2ΔE(1,2)F2dv18d4ΔE(1,2)F2,where the last inequality holds because of (EquationA1). It is straightforward to see that (A2) ΔE(1,2)F2=i,j(1σi2γj21σi,02γj,02)2=i,j(σi,02γj,02σi2γj2σi2γj2σi,02γj,02)2d4i,j(σi,02γj,02σi2γj2)2=d4ΔD(1,2)F2,(A2) where σi2 is the ith diagonal element of D1, γj2 is the jth diagonal element of D2, and the subscript ‘0’ indicates the true value of the element. Hence M1ΔD(1,2)F2/(8d8).

A2.2. The term M2

For M2, |M2|=|tr{ΔE(1,2)(T2T1)(SΨ0Σ0)(T2T1)}|=|i=1pqΔE,i(1,2)[(T2T1)(SΨ0Σ0)(T2T1)]ii|(maxi,j|[(T2T1)(SΨ0Σ0)(T2T1)]ij|)(i|ΔE,i(1,2)|).By Lemma A.1, for a sufficiently large n, there exists a C1,ϵ such that 1ϵ<P(1log(pq)/n|M2|i|ΔE,i(1,2)|C1,ϵ)P(|M2|d4ΔD(1,2)Fpqlog(pq)/nC1,ϵ),where the second inequality holds since i|ΔEi(1,2)|pqΔE(1,2)Fd2pqΔD(1,2)F.

A2.3. The term M3

M3 can be partitioned into L1+L2+L3, where L1=tr{(D2,01D1,01)(T2,0T1,0)[SΨ0Σ0](ΔT(1,2))}+tr{(D2,01D1,01)ΔT(1,2)[SΨ0Σ0](T2,0T1,0)},L2=tr{(D2,01D1,01)ΔT(1,2)[SΨ0Σ0](ΔT(1,2))},L3=tr{(D21D11)ΔT(1,2)(Ψ0Σ0)(ΔT(1,2))}.We will consider L1, L2, and L3 separately. For L1, |L1||tr(D2,01D1,01)(T2,0T1,0)[SΨ0Σ0](ΔT(1,2))|+|tr{(D2,01D1,01)ΔT(1,2)[SΨ0Σ0](T2,0T1,0)}|=2|tr{(D2,01D1,01)ΔT(1,2)[SΨ0Σ0](T2,0T1,0)}|=2|i,jΔT,ij(1,2)[(SΨ0Σ0)(T2,0T1,0)(D2,01D1,01)]ji|2(i,j|ΔT,ij(1,2)|)maxi,j|[(SΨ0Σ0)(T2,0T1,0)(D2,01D1,01)]ji|.Hence, by Lemma A.1, for a sufficiently large n, there exists a C2,ϵ such that with probability greater than 1ϵ we have |L1|2(i,j|ΔT,ij(1,2)|)log(pq)/nC2,ϵ.Let Zi={j,k} be the set of indices such that Ti,jk,0 is nonzero. Then, with a probability of greater than 1ϵ, |L1|2(ijZ1CorskZ2C|T1,ij||T2,sk|)log(pq)/nC2,ϵ+2s1s2ΔT(1,2)Flog(pq)/nC2,ϵ.For L2, |L2|=|vec[ΔT(1,2)]{[SΨ0Σ0]D2,01D1,01}vec[ΔT(1,2)]|[SΨ0Σ0](D2,01D1,01)ΔT(1,2)F2,where the inequality holds from the result that the maximum value of xAx with a symmetric matrix A and a unit vector x is λmax(A). Hence, |L2|SΨ0Σ0D2,01D1,01ΔT(1,2)F2=oP(1)ΔT(1,2)F2,where the equality holds since the proof of Lemma 3 in [Citation32] shows that S(Ψ0Σ0)=oP(1). For L3, L3=vec(ΔT(1,2))[Ψ0Σ0D21D11]vec(ΔT(1,2))λmin(Ψ0Σ0)λmin(D21D11)ΔT(1,2)F2=1d4ΔT(1,2)F2,where the inequality holds from the result that the minimum value of xAx with a symmetric matrix A and a unit vector x is λmin(A). Hence, L2 is dominated by L3, such that L3|L2|>0 for a sufficiently large n.

A2.3.1. The shrinkage penalty term

The shrinkage penalty term is S1=Q1+Q2+Q3, where Q1=λsh,1t,kZ1c|T1,tk|+λsh,2s,kZ2c|T2,sk|,Q2=λsh,1t,kZ1|T1,tk|λsh,1t,kZ1|T1,tk,0| and Q3=λsh,2s,kZ2|T2,sk|λsh,2s,kZ1|T2,sk,0|. It is easy to see that Q1>0. For Q2, |Q2|λsh,1t,kZ1T1,tk||T1,tk,0λsh,1t,kZ1|T1,tkT1,tk,0|λsh,1s1ΔT(1)F. Likewise, |Q3|λsh,2s2ΔT(2)F.

A2.3.2. The smoothing penalty term

Let Atj=ϕt+2,t+2jϕt+1,t+1j be the difference between two neighbouring elements on the off-diagonal of T. Likewise, let Btj=ϕt+1,t+1jϕt,tj, Atj,0=ϕt+2,t+2j,0ϕt+1,t+1j,0 and Btj,0=ϕt+1,t+1j,0ϕt,tj,0. Then, (A3) |S2|=λsmt,j|(AtjBtjAtj,0+Btj,0)(AtjBtj+Atj,0Btj,0)|λsmt,j|AtjBtjAtj,0+Btj,0|2t,j|AtjBtj+Atj,0Btj,0|2λsm2[t,j|AtjBtjAtj,0+Btj,0|2+t,j|AtjBtj+Atj,0Btj,0|2]λsm2t,j[3|AtjAtj,0(BtjBtj,0)|2+8|Atj,0Btj,0|2],(A3) where the last inequality holds since |AtjBtj+Atj,0Btj,0|2=|AtjAtj,0(BtjBtj,0)+2(Atj,0Btj,0)|22|AtjAtj,0(BtjBtj,0)|2+8(Atj,0Btj,0)2.Note, t,j|AtjAtj,0|2t,j|ϕt+2,t+2jϕt+2,t+2j,0|2+2t,j|ϕt+2,t+2jϕt+2,t+2j,0||ϕt+1,t+1jϕt+1,t+1j,0|+t,j|ϕt+1,t+1jϕt+1,t+1j,0|22ΔTF2+2t,j|ϕt+2,t+2jϕt+2,t+2j,0||ϕt+1,t+1jϕt+1,t+1j,0|4ΔTF2.Likewise, t,j|BtjBtj,0|24ΔTF2. By Cauchy-Schwarz inequality, t,j|AtjAtj,0||BtjBtj,0|t,j|AtjAtj,0|2t,j|BtjBtj,0|24ΔTF2.Then we obtain |S2|24λsm,1ΔT(1)F2+4λsm,1t,j|Atj,0(1)Btj,0(1)|2. Similarly, we achieve the same conclusion that |S3|24λsm,2ΔT(2)F2+4λsm,2t,j|Ask,0(2)Bsk,0(2)|2.

A2.4. Combine all terms

G(ΔD(1),ΔT(1),ΔD(2),ΔT(2))M1|M2||L1|+L3|L2|+Q1|Q2||Q3||S2||S3|18d8ΔD(1,2)F2d2ΔD(1,2)Fpqlog(pq)/nC1,ϵ+1d4ΔT(1,2)F22ΔT(1,2)Fs1s2log(pq)/nC2,ϵλsh,1s1ΔT(1)Fλsh,2s2ΔT(2)F24λsm,1ΔT(1)F24λsm,1t,j|Atj,0(1)Btj,0(1)|224λsm,2ΔT(2)F24λsm,2s,k|Ask,0(2)Bsk,0(2)|2+[λsh,1(2skZ2|ΔT,sk(2)|+skZ2C|T2,sk|)log(pq)/nC2,ϵ]i,jZ1c|T1,ij|+[λsh,2(2ijZ1|ΔT,ij(1)|+ijZ1C|T1,ij|)log(pq)/nC2,ϵ]s,kZ2c|T2,sk|.Note that T2,0F2 is the sum of squared eigenvalues of T2,0. Then, T2,0F2qd2. As a result, for a sufficiently large n, 2skZ2|T2,sk|+skZ2C|ΔT,sk(2)|2skZ2|T2,sk,0|+2skZ2|ΔT,sk(2)|+skZ2C|ΔT,sk(2)|2sk|T2,sk,0|+2sk|ΔT,sk(2)|2s2T2,0F+2q(q1)2sk|ΔT,sk(2)|22s2T2,0F+2qΔT(2)F2s2qd+2qU22s1s2log(pq)/n.By Assumption (C3), we have (2skZ2C|ΔT,sk(2)|+skZ2|T2,sk|)log(pq)/nC2,ϵ(2s1qlog(pq)/nU2+2d)s2qlog(pq)/nC2,ϵ3ds2qlog(pq)/nC2,ϵ,for a sufficiently large n. Then, we can let λsh,1=3ds2qlog(pq)/nC2,ϵ, and consequently, [λsh,1(2skZ2|ΔT,sk(2)|+skZ2C|T2,sk|)log(pq)/nC2,ϵ]i,jZ1c|T1,ij|>0.Likewise, we can let λsh,2=3ds1plog(pq)/nC2,ϵ and [λsh,2(2ijZ1|ΔT,ij(1)|+ijZ1C|T1,ij|)log(pq)/nC2,ϵ]s,kZ2c|T2,sk|>0.Again, using triangular inequality (A4) ΔT(1,2)ΔT(1)FT2,0+T1,0ΔT(2)F+ΔT(1)ΔT(2)dΔT(1)F+dΔT(2)F+ΔT(1)ΔT(2).(A4) Then, by Lemma A.2, (EquationA4), and the rates of λsh,1 and λsh,2, we obtain 12d4ΔT(1,2)F22ΔT(1,2)Fs1s2log(pq)/nC2,ϵλsh,1s1ΔT(1)Fλsh,2s2ΔT(2)F[12d4qU13dC2,ϵq2dC2,ϵU2C2,ϵs1s2log(pq)/n]U1s1s2log(pq)/n+[12d4pU23dC2,ϵp2dC2,ϵU1C2,ϵs1s2log(pq)/n]U2s1s2log(pq)/n,which is positive for sufficiently large n by Assumption (C2). Following the same way, we have 12d4ΔT(1,2)F224λsm,1ΔT(1)F24λsm,1t,j|Atj,0(1)Btj,0(1)|224λsm,2ΔT(2)F24λsm,2s,k|Ask,0(2)Bsk,0(2)|2(q2d424λsm,1)s1s2log(pq)/nU124λsm,1t,j|Atj,0(1)Btj,0(1)|2+(p2d424λsm,2)s1s2log(pq)/nU224λsm,2s,k|Ask,0(2)Bsk,0(2)|2.If λsm,1=o(1), then 21d4q>24λsm,1 for a sufficiently large n. Hence, we can choose a sufficiently small λsm,1 such that U1216d4qs1s2log(pq)/n>λsm,1t,j|Atj,0(1)Btj,0(1)|2.Likewise, we can choose a sufficiently small λsm,2 such that U2216d4ps1s2log(pq)/n>λsm,2s,k|Ask,0(2)Bsk,0(2)|2.At last, we only need to show 18d8ΔD(1,2)F[ΔD(1,2)F8d12pqlog(pq)/nC1,ϵ]>0.It suffices to show the part inside the square brackets as follows, ΔD(1,2)F264d24pqlog(pq)/nC1,ϵ2>0.Expanding ΔD(1,2)F2 yields ΔD(1,2)F2=[(id1,i,02)V22q+(jd2,j,02)V12p+V12V22pqlog(pq)/n]log(pq)/n+2(id1,i,0Δ1,i)V22qlog(pq)/n+2(jd2,j,0Δ2,j)V12plog(pq)/n+2id1,i,0Δ1,ijd2,j,0Δ2,j.By Lagrange multiplier, the minimum of ΔD(1,2)F2 subject to the boundaries ΔD(1):ΔD(1)F2=V12plog(pq)/n, and ΔD(2):ΔD(2)F2=V22qlog(pq)/n is attained when (id1,i,0Δ1,i)2=(id1,i,02)V12plog(pq)/n,(jd2,j,0Δ2,j)2=(jd2,j,02)V22qlog(pq)/n.Evaluated at such stationary points, |(id1,i,0Δ1,i)V22qlog(pq)/n|=(id1,i,02)plog(pq)/nV1V22qlog(pq)/n,|(jd2,j,0Δ2,j)V12plog(pq)/n|=(jd2,j,02)qlog(pq)/nV12V2plog(pq)/n.They are dominated by |id1,i,0Δ1,ijd2,j,0Δ2,j|=(id1,i,02)(jd2,j,02)pqV1V2log(pq)/nsince |(id1,i,0Δ1,i)V22qlog(pq)/n||id1,i,0Δ1,ijd2,j,0Δ2,j|=qlog(pq)/nV2jd2,j,02=o(1),|(jd2,j,0Δ2,j)V12plog(pq)/n||id1,i,0Δ1,ijd2,j,0Δ2,j|=plog(pq)/nV1id1,i,02=o(1),by Assumption (C3). Hence, ΔD(1,2)F2[(id1,i,02)V22q+(jd2,j,02)V12p+V12V22pqlog(pq)/n]log(pq)/n2(id1,i,02)(jd2,j,02)pqV1V2log(pq)/n(d2V12+d2V222d2V1V2)pqlog(pq)/n,which will be larger than 64d24C1,ϵ2 for sufficiently large V1 and V2 such that d2V12+d2V222d2V1V2 is sufficiently large.

Above all, with probability larger than 1ϵ, G(ΔD(1),ΔT(1),ΔD(2),ΔT(2))>0 which completes the proof.

Appendix 3.

The regularized flip-flop (RFF) algorithm

Denote Dt and Es be the (p2t)×(pt) and (q2s)×(qs) matrix representations of the difference operator Δdiag2, (121000000121)such that j=1t1(Δdiag2ϕt+2,t+2j)2=ϕtDtDtϕt and k=1s1(Δdiag2ηs+2,s+2k)2=ηsEsEsηs, respectively. We summarize in Algorithm 1 for the regularized estimation of the Kronecker structured covariance matrix by combining shrinkage and smoothing subdiagonals of T1 and T2 matrices.