4,500
Views
2
CrossRef citations to date
0
Altmetric
Theory and Methods

Ridge Regression Under Dense Factor Augmented Models

ORCID Icon
Pages 1566-1578 | Received 23 Sep 2021, Accepted 15 Apr 2023, Published online: 05 Jun 2023

Abstract

This article establishes a comprehensive theory of the optimality, robustness, and cross-validation selection consistency for the ridge regression under factor-augmented models with possibly dense idiosyncratic information. Using spectral analysis for random matrices, we show that the ridge regression is asymptotically efficient in capturing both factor and idiosyncratic information by minimizing the limiting predictive loss among the entire class of spectral regularized estimators under large-dimensional factor models and mixed-effects hypothesis. We derive an asymptotically optimal ridge penalty in closed form and prove that a bias-corrected k-fold cross-validation procedure can adaptively select the best ridge penalty in large samples. We extend the theory to the autoregressive models with many exogenous variables and establish a consistent cross-validation procedure using the what-we-called double ridge regression method. Our results allow for nonparametric distributions for, possibly heavy-tailed, martingale difference errors and idiosyncratic random coefficients and adapt to the cross-sectional and temporal dependence structures of the large-dimensional predictors. We demonstrate the performance of our ridge estimators in simulated examples as well as an economic dataset. All the proofs are available in the supplementary materials, which also includes more technical discussions and remarks, extra simulation results, and useful lemmas that may be of independent interest.

1 Introduction

Improving the bias-variance tradeoff is crucial in many high-dimensional forecasting applications. The classical least-squares regression can suffer from substantial predictive variance when the number of predictors is large compared with the sample size. One useful variance reduction strategy is summarizing massive raw features into a few leading principal components and then predicting the response with these low-dimensional predictors. When the large-dimensional covariates obey a factor structure, the principal component estimator can provide consistent forecasts of the regression function if the factors suffice to predict the target variable. The principal component regression (PCR) often shows better performance than the LASSO regression and ensemble algorithms such as bagging for macroeconomic forecasting; see, for example, Stock and Watson (Citation2002), De Mol, Giannone, and Reichlin (Citation2008), Stock and Watson (Citation2012), Castle, Clements, and Hendry (Citation2013), and Carrasco and Rossi (Citation2016). The PCR is also widely used when studying DNA microarrays (Alter, Brown, and Botstein Citation2000), medical shapes (Cootes et al. Citation1995), climate (Preisendorfer Citation1998), robust synthetic control (Agarwal et al. Citation2021) and many other fields (see Chapter 6 of James et al. Citation2021). While the PCR model is dense at the variable level (see, e.g., the discussions in Ng Citation2013), it relies on only low-dimensional principal components in forecasting. However, when many idiosyncratic components are useful, even though each is negligible, the PCR is not optimal and leaves room for improvement in many empirical applications.

Consider a toy example with p=100 covariates xi=(xi,1,,xi,p)Rp, where we denote the transpose of any matrix or vector A by A, and n=100 observations from a factor model given by (1.1) xi=Λ1fi,1+Λ2fi,2+ei,i=1,,n.(1.1)

The latent factors fi,1R and fi,2R, the entries of idiosyncratic variables ei=(ei,1,,ei,p), and the entries of loading coefficients Λ1=(Λ1,1,,Λp,1) and Λ2=(Λ1,2,,Λp,2) are all generated independently from the standard normal distribution. One may think of the data vector xi as a set of economic variables at a given time. We then generate the target variables yi from a factor-augmented predictive regression model given by (1.2) yi=0.6fi,1+0.2fi,2+eiβ+εi=:μi+εi,(1.2) where the regression errors εi are generated from the standard normal distribution independently of the regression means μi. We generate the direction of the coefficient vector β=(β1,,βp) uniformly over the Rp unit sphere, and thus each entry βi is stochastically negligible for large p. Then we scale the total idiosyncratic signal β2=i=1pβi2 through the grid {0,0.1,,1}. The PCR model is the special case with β2=0. The idiosyncratic information, at aggregate level, becomes more important as β2 increases.

We plot the median, over 5000 replications, of the training predictive loss n1i=1n(μiμ̂i)2 as a function of idiosyncratic signal length β2 on the left-hand-side of , where μi and μ̂i denote the population and estimated means of the target variable yi, respectively. We consider both the oracle PCR estimator using the true number of factors r=2, and the feasible PCR estimator using a number of PCs selected by 5-fold cross-validation. We compare these PCR estimators with the ridge regression estimator, hereinafter abbreviated as RR estimator, that uses all the raw variables xi and does not require selecting the number of factors. We select the ridge penalty using 5-fold cross-validation. The median predictive loss of the PCR estimators grows (almost) linearly in the idiosyncratic signal length β2, but that of the RR estimator grows significantly slower. The cross-validated RR estimator is only slightly worse than the oracle PCR estimator but comparable to the cross-validated PCR estimator for small β2, whereas the advantage of RR estimator emerges quickly as β2 grows.

Figure 1: Median predictive loss (left) and change in median sample variance of estimated means (right) as functions of idiosyncratic signal β2.

Figure 1: Median predictive loss (left) and change in median sample variance of estimated means (right) as functions of idiosyncratic signal ‖β‖2.

To illustrate the sensitivity of RR to the idiosyncratic information, we plot the median of the sample variance of the estimated means (which equals to the sum of squared estimated coefficients on all standardized components) over the replications on the right-hand-side of . To emphasize the changes, we only plot the differences to the values for β2=0. The variance of RR estimates grow quickly with β2, while that of the oracle PCR coefficients hardly change. The cross-validation selection improves the sensitivity of PCR to β2 only slightly because a few extra components are actually asymptotically negligible in our dense models.

compares the training and testing predictive losses for ridge regression, averaged over 1000 replications, as functions of the effective degrees of freedom (see Hastie, Tibshirani, and Friedman Citation2009, p. 64) implied by the ridge penalty. We define testing predictive loss as the expected squared forecast error of the out-of-sample regression mean μoos=foosθ+eoosβ with new variables foos and eoos using the estimated regression function and the new covariates xoos, where θ=(0.6,0.2) and β are the population coefficients in (1.2). Again, the optimal ridge estimator (dashed line) for testing predictive loss matches the performance of PCR when β2=0 but outperforms PCR when β2>0. More importantly, these plots illustrate an important relation between training and testing predictive losses for performance evaluation: they share (almost) the same optimal ridge penalty. We formally establish this universal property in Section 2 using a general notation of predictive loss, and generalize the results to time series data in Section 3.

Figure 2: Average training and testing predictive losses for RR and Oracle PCR.

Figure 2: Average training and testing predictive losses for RR and Oracle PCR.

Our toy example illustrates that omitting idiosyncratic information can lead to nontrivial loss in predictive efficiency, and the ridge regression is more robust against PCR capturing both factor and idiosyncratic information. The contribution of this article is a comprehensive theory of the optimality, robustness, and cross-validation consistency of the ridge regression under a general factor-augmented regression model given by (1.3) X= FΛ+EΓ(1.3) (1.4) Y= Fθ+Eβ+ϵ(1.4) where the left-hand side of the equations represent a large sample of observations in high dimensions: X=[x1x¯,,xnx¯]Rn×p is the demeaned design matrix of predictors xiRp with large n and p, and Y=(y1y¯,,yny¯) is the demeaned vector of univariate targets yi. Throughout, a¯=1ni=1nai denotes the sample mean of any variable a. The low-rank matrix F=[f1f¯,,fnf¯] collects low-dimensional latent factors fiRr, r finite, and E=[e1e¯,,ene¯] is a large-dimensional random matrix of idiosyncratic components eiRp. The matrix Γ is a rotation matrix, possibly unknown. The fixed-effects θ=(θ1,,θr) on the factors and the random-effects β=(β1,,βp) on the idiosyncratic components are both unknown. The noise component ϵ=(ε1ε¯,,εnε¯) is a demeaned vector of uncorrelated (but possibly dependent) regression errors εi. One may also interpret (1.3) and (1.4) as an error-in-variables regression model (6.1) but we postpone the discussions to Section 6.

Many datasets support the low-rank perturbation form (1.3); see the survey by Johnstone and Paul (Citation2018) for many examples in electrocardiogram tracing, microarrays, satellite images, medical shapes, climate, signal detection, and finance. We assume that the low-rank p×r coefficient matrix Λ=[Λ1,,Λr] has a divergent strength ap=tr(ΛΛ) at an arbitrary rate to separate the spectrum of the factor component FΛ from the stochastically bounded spectrum of the idiosyncratic component EΓ in (1.3). Therefore, we can analyze the spectral regularization using the standard eigenbasis without suffering from PCA inconsistency issues (Paul Citation2007; Johnstone and Lu Citation2009; Onatski Citation2012). Following Cai, Han, and Pan (Citation2020), we allow the unobservable idiosyncratic components to enter the factor model (1.3) subject to a possibly unknown rotation ΓRp×p for generality. We discuss the possibility of extending our results for ap=O(1), especially when PCA becomes inconsistent, as future works in Appendix A.9 in the supplementary materials.

Our regression model (1.4) is the factor-augmented model proposed by Kneip and Sarda (Citation2011). Substituting EΓ=XFΛ under model (1.3) and using the identity ΓΓ=Ip, we can reparameterize the regression model (1.4) as follows: (1.5) Y=Fθ¯+Xβ¯+ϵ,β¯=Γβ, θ¯=θΛβ¯,(1.5) where the entries of θ¯ and β¯ are called common and specific effects, respectively. One might also interpret the last equation as a misspecified model where we only observe the design matrix X but not the latent factor matrix F. Throughout the article, we refer factor-augmented regression models to these two equivalent representations (1.4) and (1.5) exchangeably. Note that our model augments the purely variable-based model (with θ¯=θΛΓβ=(0,,0)) in, for example, Castle, Clements, and Hendry (Citation2013), Fan, Ke, and Wang (Citation2020) by adding extra factor information.

Unlike the aforementioned papers, we study the cases where the coefficient vector β¯ is possibly dense. Even if the true regression vector β on the idiosyncratic components is sparse, the regression vector β¯ on the variables may contain little (or even no) zero entries due to a nontrivial rotation Γ. In general, we are interested in the models that are possibly dense at both variable and principal component levels. We consider every idiosyncratic component to be asymptotically negligible but allow them to have nontrivial aggregate explanatory power; see Jolliffe (Citation1982), Dobriban and Wager (Citation2018), Giannone, Lenza, and Primiceri (Citation2021), Hastie et al. (Citation2022), and many references therein for examples of applications.

Based on spectral analysis of large-dimensional random matrices, this article shows that the ridge regression is optimal in capturing factor and idiosyncratic information simultaneously among the entire class of spectral regularized estimators under the mixed-effects hypothesis. Ridge regression has a long tradition in statistics tracing back at least to Tikhonov (Citation1963a, Citation1963b) and Hoerl and Kennard (Citation1970); see Liu and Dobriban (Citation2020) for an excellent survey. To our knowledge, we are the first to study the ridge regression model with factor augmentation. Our analysis of large-dimensional covariance matrices under factor models requires a different approach from that in Dicker (Citation2016) and Dobriban and Wager (Citation2018). In particular, our results extend to more general data generating processes of covariates by analyzing limiting predictive loss instead of predictive risk (i.e., the expected predictive loss); see Appendix A.1 in the supplementary materials for comparisons with Dobriban and Wager (Citation2018).

We prove that the k-fold cross-validation selects the asymptotically optimal ridge penalty, as illustrated in , subject to a straightforward bias correction. The bias-corrected selection is uniformly consistent for all positive ridge penalties bounded away from zero, including those diverging to infinity. We extend the scope of the k-fold cross-validation to non-iid data and provide theoretical guarantee for dealing with martingale difference regression errors. Furthermore, we propose a new method called “double” ridge regression and an asymptotically correct k-fold cross-validation method in the presence of nuisance variables, in particular autoregressors, under mild conditions.

The rest of the article is organized as follows. We develop the asymptotic theories for independent errors in Section 2 and then generalize the results to martingale difference errors and autoregressive models in Section 3. More simulated and real-life examples are discussed in Sections 4 and 5, respectively. We conclude the article in Section 6. All the proofs of the theorems are postponed to the supplementary materials. The supplementary materials also provides more technical discussions and remarks about the conditions and auxiliary results, together with additional simulation results and some important lemmas that may be of independent interest.

2 Main Results

First, consider the factor model (1.3) in introduction. We are interested in the asymptotic regime where the number of predictors is comparable to the sample size:

Assumption 1

(Large-dimensional Asymptotics).

The number of covariates p=p(n) and the sample size n are comparable in such a way that p/nc(0,) as n.

The concentration ratio p/n, possibly larger than 1, plays a central role in our asymptotic theory. Unless specified otherwise, all asymptotic results hold as n,p simultaneously in the probability space, enlarged if necessary, with the largest sigma-algebra.

Without loss of generality, we treat both the in-sample factor scores matrix F and the loading matrix Λ in (1.3) as fixed parameters. When factors are random, our approach is equivalent to conditioning on a particular realization of F. The number of factors r is finite but not necessarily known.

Assumption 2

(Factor Model).

The following conditions hold:

  1. The total factor loading ap:=tr(ΛΛ), and ap=O(p).

  2. (b) ΣΛ:=ΛΛ/ap converges to some r×r diagonal matrix with diagonal elements σiiσjj>0 for i<j. Without loss of generality, FF/n=Ir and ΓΓ=Ip where Id denotes d×d identity matrix.

  3. (c) λmax(ΩE)=OP(1) where ΩE=1nEE is the (unobservable) sample covariance matrix of the idiosyncratic components.

Conditions (a)–(c) are the weaker factor models in Bai and Ng (Citation2021) allowing an arbitrarily slow rate of total factor strength ap. When all entries, say, of the loading matrix Λ={Λ(i,j)} are bounded away from infinity, we readily have ap=i=1pj=1rΛ2(i,j)=O(pr)=O(p) in condition (a). We allow the loading matrix Λ to have many vanishing entries and one may use appα for any α(0,1]; see, for example, Bai (Citation2003), Hallin and Liska (Citation2007), and Agarwal et al. (Citation2021) for the linear case α=1 and Onatski (Citation2009) for the case α>2/3. Via the factor decomposition (1.3) of X=FΛ+EΓ, these conditions ensure that singular values of the factor part FΛ to be much larger than (hence, separated from) those of the idiosyncratic part EΓ; see, for example, Cai, Han, and Pan (Citation2020) for discussions and comparisons with the generalized spiked model in Johnstone (Citation2001).

Remarkably, our condition (b) relaxes some identification conditions in the previous papers. We allow ties in the proportion of total loading {σii:i=1,,r} and require only the subspace consistency of PCA in the spirit of Jung and Marron (Citation2009). When there are ties, all bases of the span of corresponding latent factors satisfy our conditions and therefore we are not able to identify the true ones in the data generating process. Nevertheless, we could estimate the subspace as a whole consistently and unify the further factor analysis for cross-validation and autoregressive models, as train-test splits and residual-based estimation can make latent factors indistinguishable in subsamples or after projections. One limitation of condition (b) is that all eigenvalues of Λ must diverge at the same rate of ap, but we can easily relax this requirement and allow different rates if necessary.

Condition (c) holds for a separable random matrix E=B1/2ςA1/2 in (1.3) where ARp×p and BRn×n are positive-definite matrices with bounded spectral norms and ς={ςij}Rn×p is a random matrix of iid standardized entries with a bounded fourth moment; see, for example, Paul and Silverstein (Citation2009) and Chapter 5 of Bai and Silverstein (Citation2010). Condition (c) also holds for high-dimensional linear time series satisfying the conditions in Remark 8 of Onatski and Wang (Citation2021). See Appendix A.2 of the supplementary materials for more detailed discussions.

For out-of-sample analysis, we assume that the data obeying (1.3) and (1.4) are from the following data generating process (DGP): (2.1) xi=Λfi+Γei,(2.1) (2.2) yi=ϕ0+fiθ+eiβ+εi.(2.2) where ϕ0=ϕ0(n) is an unspecified intercept possibly indexed by n and the regression errors εi satisfy the following condition:

Assumption 3

(Independent Errors).

The regression errors {εi} are mutually independent with zero mean Eεi=0 and a common variance σ2=Eεi2 bounded away from zero and infinity, and they are independent of {fi,ei}. They are uniformly integrable in second moment such that supiZE|εi|2+2ι is bounded away from infinity for some ι>0.

This assumption is common in statistical learning literature. We will relax it to handle uncorrelated but dependent errors in Section 3. See also Remark 2 for the relaxation of the homoscedasticity condition. We consider strongly exogenous predictors {xi} to allow for a standard factor estimation using the entire sample of raw predictors as in, for example, Stock and Watson (Citation2012).

Throughout the article, we consider a mixed-effects model where the factor coefficients θ are arbitrary fixed-effects and the entries of idiosyncratic coefficient vector β are stochastic. In particular, we make the following assumption throughout the article for β=(β1,,βp):

Assumption 4

(Random Effects).

The idiosyncratic coefficient vector β=τp1/2b, where b=(b1,,bp) is independent of the predictors {xi} and the regression errors {εi}, and has uncorrelated entries such that E[bibj,ji]=0, E[bi2bj,ji]=1,and max1ipEbi2+2ι is bounded away from infinity for some ι>0. The total idiosyncratic signal length τ2=τn2 is unknown, possibly dependent on {xi}, and τ2=OP(1). When τ2=0, the augmented model (2.2) reduces to the principal component regression model.

This condition is based on the hypothesis of the random effect in, for example, Dobriban and Wager (Citation2018), but here we relax the dependence structure between the entries βi and allow them to be non-identically distributed. See Appendix A.3 in the supplementary materials for more examples and comparisons. We emphasize that this assumption is only needed for modeling the idiosyncratic coefficients β, and we allow for arbitrary fixed-effects θ on the latent factors in our factor-augmented model (2.2).

We split our main results into two sections: the first one shows the efficiency of the ridge regression among the general class of spectral regularized estimators; the second one establishes the uniform consistency of the bias-corrected k-fold cross-validation for selecting the best ridge penalty.

2.1 On the Optimality of Ridge Regression

In this section, we show that the ridge regression is asymptotically optimal among the spectral regularized estimators in capturing idiosyncratic information and it is more robust than principal component regression against factor augmentation.

Consider the singular value decomposition n1/2X=i=1Kλi1/2uivi, where λi denotes the ith largest eigenvalue (counting multiplicities) and K{n1,p} denotes the rank of X. Note that {ui} are all orthogonal to the n-dimensional unit vector proportional to the all-ones vector, denoted by ιn:=n1/2(1,,1)Rn, due to data centering. To reduce the estimation variance of regression means μ=Fθ+Eβ in (1.4), we shrink the projections of the (uncentered) target vector y:=(y1,,yn)Rn onto the eigenbasis {ιn}{uiRn:1iK} via a shrinkage function δ:(0,)[0,1]: (2.3) μ̂(δ)=ιnιny+i=1Kδ(λi)uiuiy=(m̂δ˜(x1),,m̂δ˜(xn))(2.3) where m̂δ˜ is the estimated regression function given by (2.4) m̂δ˜(x)=y¯+(xx¯)β̂(δ˜),(2.4) and β̂(δ˜) is a penalized least-squares estimator (Tikhonov Citation1963a, Citation1963b) given by (2.5) β̂(δ˜)=(XX+nW(δ˜))+XY(2.5)

with A+ denoting the Moore-Penrose inverse of any matrix A and the weight matrix (2.6) W(δ˜)=i=1Kδ˜(λi)vivi,δ˜(x)=x1δ(x)δ(x).(2.6)

In particular, choosing δ˜(x)γ and δ(x)=xx+γ for some constant γ>0 in (2.5) yields the ridge (Hoerl and Kennard Citation1970) estimator β̂(γ)=(XX+nγIp)1XY. We allow for infinite penalties for dimension reduction purpose: one should interpret δ˜(x)= if and only if δ(x)=0, and then we must set β orthogonal to vi in estimation. For more discussions we refer to Hastie, Tibshirani, and Friedman (Citation2009, chap. 3). Henceforth we call (2.3) a spectral regularized estimator of the regression means.

We define the predictive loss as the reducible mean squared forecast error (see James et al. Citation2021 for decomposition into reducible and irreducible errors), conditional on a realization of the training set, given by L˜(δ)=E[(μoosm̂δ˜(xoos))2Sn],where xoos=Λfoos+Γeoos and μoos=ϕ0+foosθ+eoosβ are out-of-sample (OOS) random covariates and the regression mean from the DGPs (2.1) and (2.2) with the same factor model parameters and regression coefficients. The sigma-algebra Sn is generated by the in-sample variables {fi,ei,εi:in} and the random coefficients β.

We use the following paradigm to evaluate the spectral regularized estimators.

Assumption 5

(Evaluation Paradigm).

The new variables foos and eoos are independent of both in-sample regression errors {εi:in} and random coefficients β, given the in-sample random variables {fi,ei:in}. The conditional covariance matrices of aoos centered around the sample mean given by Ω˜a=E[(aoosa¯)(aoosa¯)|Sn] have spectral norms bounded in n for both a{f,e}.

This paradigm is flexible enough to include both the training and testing predictive losses used in :

  • If one draws the pair of new variables foos and eoos uniformly over the sample {fi,ei:1in}, then we obtain the training predictive loss given by (2.7) L˜(δ)=1ni=1n(μimδ˜(xi))2=1nμμ̂(δ)2,(2.7) with the sample covariance matrices Ω˜a=1ni=1n(aia¯)(aia¯) for a{f,e}.

  • A more typical implementation is to generate foos and eoos, independent of Sn, from some population distribution or a random draw on a test set in order to assess the testing predictive loss. Then Ω˜a=Σa+(μaa¯)(μaa¯), where μa and Σa denote the mean and covariance matrix of aoos for a{f,e}.

Our estimated regression function (2.4) uses oracle estimators Λβ̂(δ˜) and Γβ̂(δ˜) of θ and β, respectively via the decomposition: m̂δ˜(xoos)=y¯+(foosf¯)Λβ̂(δ˜)+(eoose¯)Γβ̂(δ˜).

The following theorem relates L˜(δ) to the estimation error of linear coefficients and gives further stochastic approximations based on spectral analysis.

Theorem 1.

Under Assumptions 1–5, for any sequence of candidate shrinkage function δ=δn:(0,)[0,1] may or may not depend on the covariates in the training set, (2.8) L˜(δ)θΛβ̂(δ˜)Ω˜f2βΓβ̂(δ˜)Ω˜e2P0,(2.8)

where vΩ2=vΩv denotes the weighted norm of any vector v. Furthermore, (2.9) θΛβ̂(δ˜)Ω˜f2θDδΩ˜fDδθP0,(2.9) (2.10) βΓβ̂(δ˜)Ω˜e20λr+1{τ2(1δ(x))2+σ2pnδ2(x)x}dF˜p(x)τ2F˜p(0)P0.(2.10) where Dδ=diag(1δ(λ1),,1δ(λr)) and F˜p(x)=1pi=1pviΓΩ˜eΓvi1(λix), if there is no tie values in {σii:i=1,,r} and the penalty function δ˜ is bounded away from 0. The probability mass F˜p(0)>0 when the data matrix X is column-rank deficient with p>n. More generally, the result remains true if δ(λi)=δ(λj)+oP(1)P0 for every tie values σii=σjj with 1i<jr.

The estimation errors of fixed and random effects are asymptotically orthogonal in (2.8) because their interaction vanishes through aggregation of the uncorrelated mean-zero random coefficients and regression errors. Via the approximation (2.8) one can interpret different predictive loss functions using different weight matrices: (a) the training errors weight the coefficient estimation errors by the sample covariance matrices; (b) the testing errors weight the coefficient estimation errors by the population covariance matrices (centered around sample means); (c) the L2 norm of coefficient estimation error θΛβ̂(δ˜)2+βΓβ̂(δ˜)2 uses identity weight matrices Ω˜a=Ip for a{f,e}. The asymptotic limit in (2.9) shows the bias of shrinkage estimator of fixed-effects θ, and that in (1) shows the bias-variance tradeoff for random-effects estimation.

Remark 1

(PCR).

Plugging in the shrinkage function δ(x)=1[x>λr+1] for the oracle PCR estimator yields a limit increasing linearly in β2 as depicted in : L˜(δ)=0θ2+0λr+1{τ2(10)2+σ2pn02x}dF˜p(x)+τ2F˜p(0)+oP(1)=τ2F˜p(λr+1)+oP(1)=β21ptr(Ω˜e)+oP(1)where the last equation is due to the facts that β2=τ2+oP(1) and the random slope F˜p(λr+1)=p1tr(Ω˜e)+oP(1). In fact, this linear pattern generalizes for any PCR estimator using a finite number of r̂r principal components as adding a few idiosyncratic components does not change its asymptotic behavior in dense models.

Note that the candidate shrinkage function δ is arbitrary on the positive line as long as it is “honest” in the sense that it only depends on the covariates but not the target variable before selection. Our limit theorem shows that shrinking the projections of the target vector onto the first r principal components can only increase the predictive loss asymptotically via the squared bias (2.9). We should keep the leading principal components unshrunk asymptotically by choosing a shrinkage function such that 1δ(λi)P0 for all i=1,,r and thus the fixed-effects estimator is consistent in the sense that θΛβ̂(δ˜)2P0.

We will discuss how to choose the optimal (ridge) estimator satisfying this condition later on. Therefore, now we focus on this subset of shrinkage functions and study the limiting predictive loss in the following corollary:

Corollary 1.

Suppose that 1δ(λi)P0 for all i=1,,r. Theorem 1 implies that (2.11) L˜(δ)0{τ2(1δ(x))2+σ2pnδ2(x)x}dF˜p(x)τ2F˜p(0)P0,(2.11) where we integrate the limiting error over the entire domain of F˜p for notational convenience although it is asymptotically equivalent to integrate over only the sub-interval [0,λr+1].

If further provided that, with probability one, λmax(Ω˜e) is bounded by some absolute constant and F˜p converges vaguely to a nonrandom limit F˜, then Corollary 1 remain true with F˜p replaced by the nonrandom limit F˜ for every continuous shrinkage function δ. In particular, we have the following special cases.

Corollary 2.

Let the signal length τ2 and error variance σ2 be constants, and generate the new variables eoos independent of the training data. Suppose that all the entries of ei and eoos are from an iid array with mean zero and variance σe2. Then, Corollary 1 implies that for every continuous shrinkage functions δ such that limxδ(x)=1, (2.12) L˜(δ)P0{τ2(1δ(x))2+σ2cδ2(x)x}dF(x)+τ2F(0),(2.12)

where F is the Marčenko and Pastur (Citation1967) law with a point mass F(0)=max{0,11/c} at the origin, and the positive density function F(x)=12πxyσe2(bx)(xa) on [a, b] (and zero density elsewhere) with a=σe2(1c)2 and b=σe2(1+c)2. More examples satisfying (2.12) with various limits F are available in Appendix A.4 in the supplementary materials.

For generality, we use the stochastic approximation (2.11) according to the random nature of the predictive loss. We shall show that the optimal (ridge) estimator is asymptotically invariant to the random measure F˜p (or its limit F if any), and therefore the optimality of ridge regression does not rely on the spectral convergence of large-dimensional covariance matrices.

First, consider the degenerate scenario in that the PCR model is correct with τ2=0, or more generally τ2P0. One may still use a ridge estimator to minimize the limiting error (2.11) with the shrinkage function δ(x)=xx+γ such that the ridge penalty γ=γn is diverging at a slower rate of the smallest spiked value λrP. To our knowledge, this finding first appeared formally in De Mol, Giannone, and Reichlin (Citation2008) who studied the PCR model with ap diverging at a linear rate of p. Such ridge estimator is a smoothed approximation of the PCR estimator performing an asymptotic selection of principal components in the sense that δ(λi)=λiλi+γλrλr+γP1 as γ/λr=OP(γ/ap)P0 for i=1,,r, while δ(λi)=λiλi+γλr+1λr+1+γP0 for all the non-spiked eigenvalues {λi:ir+1} that are stochastically bounded as γP. In practice, one may select such a ridge penalty coefficient by using cross-validation and we postpone the details to next section. The optimal predictive loss vanishes to zero in probability as we can estimate the fixed-effects consistently and the random-effects are negligible.

A more interesting case is the augmented model with an idiosyncratic signal length τ2>0 bounded away from zero (stochastically), minimizing the asymptotic limit in (2.11) is equivalent to minimizing the integrand τ2(1δ(x))2+σ2pnδ2(x)x for each x(0,).

That is, for each x(0,), we choose δ(x) as the minimum point of the convex quadratic function Δx(δ):=τ2(1δ)2+σ2pnδ2x, δ[0,1].

By solving the first-order condition Δx(δ)=0, it is elementary to show that the optimal shrinkage function is given by (2.13) δ*(x)=xx+γ*,γ*=pnσ2τ2,δ˜*(x)γ*(2.13) which corresponds to a ridge estimator with the penalty coefficient γ*, that is, with W(δ˜*)=γ*Ip in (2.5). One can interpret γ* as the noise-to-signal ratio σ2/τ2 scaled by the aspect ratio p/n. Note that the optimal predictive loss is not vanishing in this case as the high-dimensional random-effects β cannot be estimated consistently. Since the ridge coefficient γ* is now (stochastically) bounded, the optimal shrinkage function (2.13) naturally satisfies the conditions that 1δ(λi)P0 as the spiked eigenvalues λiP for 1ir, leading to a consistent estimation of the fixed-effects θ.

Remarkably, the optimal shrinkage function (2.13) is the same for all our predictive loss functions sharing the approximate form (2.11). We summarize the optimality of ridge regression into the following corollary:

Corollary 3

(Optimality of ridge regression).

Consider any predictive loss L˜ admitting the asymptotic approximation in Theorem 1. For an arbitrarily small ζ>0 and any candidate shrinkage function δ=δn from Theorem 1, P(L˜(δ)>L˜(δ*)ζ)1,where δ*(x)=xx+γ* denotes the shrinkage function for the optimal ridge estimator with the penalty coefficient γ*=pnσ2τ2 when τ2 is bounded away from 0 with probability approaching 1, or otherwise an arbitrary diverging penalty γ*P but γ*=oP(ap) if τ2=oP(1). In the latter case, L˜(δ*)P0 and we may take γ*=exp(i=1rmax+1wilogλi) where wi(0,1) are arbitrary fixed weights such that i=1rmax+1wi=1 for any pre-specified bound rmaxr. However, in general, only the fixed-effects are consistently estimated in the sense that θΛβ̂(δ˜)2P0 and the optimal ridge estimator trades off the estimation bias and variance of random-effects.

Remark 2.

Corollary 3 generalizes for heteroscedastic errors with different variance σi2=var(εi) under more strengthened conditions on the covariates {xi}, where one can relax expression (2.13) by replacing σ2 with the average variance σ¯2=1ni=1nσi2. Due to the space limit, we postpone the details to Appendix A.8 in the supplementary materials.

2.2 Uniform Consistency of Bias-Corrected k-Fold Cross-Validation

This section establishes the uniform consistency of the k-fold cross-validation (CV) algorithm for selecting the best ridge estimators, subject to a straightforward bias correction. Our asymptotic limits of the cross-validated mean squared error is uniform for all positive ridge penalties bounded away from zero, including those diverging to infinity. Throughout we use a fixed number of folds k2. Generally speaking, there are both computational and statistical advantages to use k-fold CV rather than the leave-one-out CV; see, for example, Section 5.1.3 of James et al. (Citation2021). On the other hand, using a small number of folds may introduce a bias in the selection if the optimal parameter depends on the sample size of the training set.

To fix ideas, the procedure starts from partitioning the entire index sets {1,,n} into k mutually exclusive subsets I1,,Ik. For generality, we treat the partition {Ij:j=1,,k} as fixed for any given n. When the folds are randomly created, our setup is equivalent to conditioning on a particular realization. One simple choice is to divide I into blocks, using consecutive indexes in each fold. In general, it is unnecessary to use blocks, and we may assign indexes randomly to the folds. Following the common practice, we generate folds of approximately equal size such that nj/n1/k where nj:=|Ij| denotes the cardinality of the index subset Ij.

Let y˜i=yiy¯ and x˜i=xix¯ be the demeaned observations, y¯ and x¯ are the sample means using all data. For any fold Ij treated as a validation set, we fit a ridge estimator β̂j(γ) on the remaining k1 folds by minimizing the L2 penalized error given by (2.14) β̂j(γ)=argminβ{1njiIj(y˜ix˜iβ)2+γβ2},(2.14) where nj=nnj denotes the number of training observations, and γ>0 is a candidate ridge penalty coefficient. Recall the demeaned design matrix X=[x˜1,,x˜n] and demeaned target vector Y=(y˜1,,y˜n). Let Dj=diag(1[1Ij],,1[nIj]) be the n×n diagonal matrices that selects the observations on the training set Ij:=Ij. Solving the quadratic optimization problem (2.14) yields a close-form solution given by (2.15) β̂j(γ)=(XDjX+njγIp)1XDjY.(2.15)

The mean squared error on the observations in the held-out fold Ij is given by MSEj(γ)=1njiIj(y˜ix˜iβ̂j(γ))2.

We repeat the procedure for j=1,,k, and average these errors to compute the k-fold cross-validation MSE given by CV(k)(γ)=j=1knjnMSEj(γ)=1nj=1kiIj(y˜ix˜iβ̂j(γ))2.

Then we choose the value of γ over a sufficiently fine grid that minimizes CV(k)(γ). Note that we do not need to estimate the factor scores nor to determine the number of factors with ridge regression.

To apply factor analysis to the training subsamples rather than the full sample, we assume one additional regularity condition:

Assumption 6.

Let Ωf,j:=nj1iIj(fif¯)(fif¯) denote the training covariance matrix latent factor scores. For every j=1,,k, the eigenvalues of the r×r matrix Ωf,j1/2ΣΛΩf,j1/2 converge to some limits σ11,jσrr,j>0.

The condition allows us to reparameterize the factor scores and loadings on the training set to maintain similar orthogonal structures in Assumption 2. More specifically, consider the spectral decomposition Ωf,j1/2ΣΛΩf,j1/2=QjΣjQj where Σj is a diagonal matrix and Qj is an orthogonal matrix. Define the demeaned matrix of the idiosyncratic information E:=[e1e¯,,ene¯]. The training design matrix obeys the approximate factor model DjX=DjFΛ+DjEΓ=FjΛj+DjEΓwhere Fj:=DjFΩf,j1/2Qj and Λj:=ΛΩf,j1/2Qj satisfy the orthogonal conditions: FjFj/nj=Ir,andΛjΛj/ap=Σjdiag(σ11,j,,σrr,j).

Assumption 6 holds under a stronger condition that nj1tIj(fif¯)(fif¯)Ir for all j=1,,k, in which case the limiting eigenvalue σii,j=σii. If fi are iid random vectors with an identity covariance matrix, the conditions holds almost surely by strong law of large numbers. When {fi} is a (multivariate) time series and the folds are blocks, that is, each time index subset Ij corresponds to a continuous time period, the condition follows from ergodicity: 1njiIj(fif¯)(fif¯)=nnj1ni=1n(fif¯)(fif¯)njnj1njtIj(fif¯)(fif¯)=nnjIrnjnjIr+oP(1)=Ir+oP(1)for any stationary ergodic process {fi:iZ} with zero mean and identity covariance matrix, where Z denotes the set of integers; see Blum and Eisenberg (Citation1975) for more discussions. Assumption 6 then holds with probability approaching 1.

Furthermore, when the folds are randomly generated, the results remain true if the projection process {wfi:iZ} is stationary and strong mixing for every unit vectors wRr by combining the theorem in Blum and Hanson (Citation1960) and Cramér–Wold device. See Bradley (Citation1986) and Bradley (Citation2005) for the various types of sufficient conditions for strong mixing.

Theorem 2.

Suppose that Assumptions 1–4 and 6 hold. Uniformly for all positive ridge penalties γ bounded away from 0, CV(k)(γ)σ2(γap+γ)2θWγθ0{τ2γ2(x+γ)2+σ2pneffx(x+γ)2}dFp(k)(x)τ2Fp(k)(0)P0,

where Fp(k) is some (improper) empirical distribution depending only on the covariates {xi}, the effective training sample size is given by neff=1kj=1knj provided that nj/n converges to the same limit for all j, and WγRr×r is a random weight matrix being stochastically bounded and positive-definite with probability tending to 1 when r>0. The closed-form expressions of Fp(k) and Wγ are available in Section E.3 of the supplementary materials.

The first term σ2 is irreducible but irrelevant to the choice of γ. The second term (γap+γ)2θWγθ represents the cross-validated estimation error of the fixed-effects at the same (stochastic) order of (γap+γ)2θ2. The rest represents the cross-validated error for the random-effects where we use neffk1kn due to the train-test split.

Let γ̂cv denote the optimal ridge penalty selected by the regular k-fold CV. When τ2P0, the k-fold CV chooses some divergent penalty γ̂cvP to shrink the random-effects estimator toward zero as much as possible but γ̂cv=oP(ap) (unless θ2=0) such that the fixed-effect estimation errors vanish as well. Note that any divergent penalty sequence proportional to γ̂cv is also asymptotically optimal. When τ2 is bounded away from 0, by the same arguments above Corollary 3, one should again minimize the integrand τ2γ2(x+γ)2+σ2pneffx(x+γ)2 by using the optimal penalty γcv*=pneffσ2τ2=nneffγ*kk1γ* for all x>0 where γ*=pnσ2τ2 denotes the optimal ridge penalty in Corollary 3. To conclude, the bias-corrected estimator of the optimal ridge penalty given by γ̂:=k1kγ̂cvsatisfies the asymptotic optimality criterion in Corollary 3.

3 Extension of Main Results

This section discusses two extensions of the main results for handling dependent data. Since our applications are often (although not necessarily) based on time series data, we shall index the observation by t=1,,n instead of i to emphasize the concept of time.

Our first extension is to relax the independence Assumption 3 on regression errors and only require them to be martingale differences.

Assumption 7

(MD Errors).

The regression errors εt=σηt, where {ηt,Fnt:tZ} is a martingale difference sequence for each n such that Et1ηt=0 and E0ηt2=1, and Ej[]=E[|Fnj] denotes the conditional expectation given the sigma-algebra Fnj generated by {ηs:sj}, {ft} and {et}. The variance σ2=σn2=OP(1) is bounded away from zero almost surely, and may or may not depend on the covariates {xt}. For some ι>0, t=1nE|ηt|2+2ι=O(n).

This MD assumption is required for the theory of k-fold cross-validation with uncorrelated (but dependent) errors using the usual random train-test splits. Note that the conditional variance Et1ηt2 can be stochastic, meaning that we allow for conditionally heteroscedastic errors with stochastic volatilities such as the GARCH-type error processes (see, e.g., Davis and Mikosch Citation2009 for an excellent survey). The last part of the assumption, allowing for heavy-tailed distributions, is needed for the uniform integrability of {ηt2}. In general {ηt} are even not necessarily to be identically distributed, but if they are then the last part simplifies to be E|ηt|2+2ι=O(1).

Theorem 3.

All theorems and corollaries in Section 2 remain true by replacing the IID Assumption 3 with the MD Assumption 7, provided the temporal dependence conditions on {ηt} in Appendix A.5 in the supplementary materials.

The most shocking finding is that, at least for ridge regression, we can in fact split the time series arbitrarily without worrying about the so-called “time leakage” issues in machine learning. In other words, the time order between the training and validation data is unimportant and the usual random train-test splits are allowed.

Our second extension is the factor-augmented autoregressive model given by (3.1) xt=Λft+Γet(3.1) (3.2) yt=ϕ0+l=1qϕlytl+ftθ+etβ+εt,(3.2) where εt are martingale differences as in Assumption 7. The first equation is the same as (2.1), while the second equation adds the lagged target variables into (2.2). In forecasting applications, xt are often lagged variables that are observable before time t. Let zt=(yt1,,ytq) collect the lagged target values, and their coefficient vector be ϕ=(ϕ1,,ϕq). We can rewrite the model into a general form given by (3.3) yt=ϕ0+ztϕ+ftθ+etβ+εt.(3.3)

For simplicity, we assume that the order of autoregression q is finite and known. Unless specified otherwise, we assume that the autoregressive model satisfies the standard stability condition (see, e.g., Chapter 6 of Hayashi Citation2000, p. 373):

Assumption 8.

All the roots of the qth degree polynomial equation 1ϕ1zϕ2z2ϕqzq=0 are greater than 1 in absolute value, when q>0. The target variables has finite second moment, that is, E[yt2] is bounded uniformly over t.

We extend the ridge regression to allow for the nuisance variables zt and minimize the penalized mean squared errors of the parameters (α,ϕ,β) given by 1nt=1n(ytαztϕxtβ)2+γβ2,where γ>0 is some candidate ridge penalty that may depend on {xt}. When q=0, this reduces to the standard ridge problem. Let Z=[z1z¯,,znz¯] be the demeaned design matrix of autoregressors with z¯=n1t=1nzt, and define the projection matrix onto its column space by PZ=Z(ZZ)1Z. Solving the quadratic optimization problem yields a three-steps estimator (3.4) {β̂(γ)=(X(IPZ)X+nγIp)1X(IPZ)Y,ϕ̂(γ)=(ZZ)1Z(YXβ̂(γ)),α̂(γ)=y¯z¯ϕ̂(γ)x¯β̂(γ).(3.4)

We are interested in minimizing the predictive loss given by (3.5) L(γ)=E[(μn+hm̂γ(zn+h,xn+h))2Sn],for some given forecast horizon h1.(3.5)

The sigma-algebra Sn is generated by the past variables {ft,et,εt:tn} and the random coefficients β. An inevitable challenge here is that the target variables in training and test samples must be dependent due to autoregression (3.2). Nevertheless, we can generate new covariates in the future according to a similar paradigm in Section 2:

Assumption 9

(Testing Paradigm for AR Models).

The testing paradigm in Assumption 5 holds with foos=fn+h and eoos=en+h for all horizons h{1,2,}. For all 1lh1, E[(en+hle¯)(en+he¯)|Sn]=oP(p).

Theorem 4.

Under the conditions of Theorem 3, Assumptions 8–9, and the conditions in Appendix A.6 in the supplementary materials, the optimal ridge penalty remains the same as in Corollary 3 for the testing predictive loss (3.5) at all forecast horizons h1.

Nevertheless, the classic results for the standard k-fold cross-validation do not extend to the autoregressive models with large-dimensional covariates. We notice that there are some recent findings in Bergmeir, Hyndman, and Koo (Citation2018) for simple autoregressive models with no exogenous variable, but their results neither do not apply in the presence of large-dimensional covariates. Hence, we suggest to remove the autoregressive components before the cross-validation using the following algorithm:

  • Step 1 Let ϕ̂ be some preliminary estimator of the autoregressive coefficients ϕ, and construct the autoregressive residuals rt=(yty¯)(ztz¯)ϕ̂. Apply the k-fold cross-validation to select the optimal ridge penalty parameter by regressing the autoregressive residuals rt on the demeaned covariates xtx¯ using the standard ridge estimator.

  • Step 2 Calculate the bias-corrected ridge penalty γ̂=γ̂cv(11k), where γ̂cv is the optimal ridge factor selected in Step 1.

  • Step 3 Calculate the estimators β̂, ϕ̂, and α̂ in (3.4) using the penalty γ̂ from Step 2.

Theorem 5.

Under the conditions of Theorem 4, Theorem 2 remains true provided that the preliminary estimator ϕ̂ is consistent toward ϕ.

In the rest of this section, we discuss how to obtain a consistent estimator ϕ̂ under our mixed-effects models. We show in Appendix A.7 of the supplementary materials that the naive least-squares estimator using the autoregressors alone (i.e., neglecting the covariates) is consistent if the latent factors are asymptotically uncorrelated over time. To avoid this restrictive condition, we suggest to use the estimator ϕ̂ from a preliminary ridge regression (3.4) instead according to the following theorem. We call this approach “double” ridge regression because we need to apply ridge regression twice, once in Step 1 and one more time in Step 3 using different penalties.

Theorem 6

(Double ridge regression).

Suppose that the conditions of Theorem 4 hold. The ridge estimator ϕ̂ defined in (3.4) is consistent toward ϕ if γ=oP(ap) and γ is bounded away from 0 with probability approaching 1, where γ=γn may depend on the covariates {xt}. When i=1rθi2=0, the result remains true if γ=γn diverges at the same or even higher rate of ap. In particular, the following choices all satisfy the consistency condition:

  1. Any constant ridge penalty γ>0 not depending on n;

  2. Any diverging ridge penalty of the form γ=exp(i=1rmax+1wilogλi), where rmax is some prespecified upper bound of the true number of factors r and wi0 are arbitrary fixed constant weights such that i=1rmax+1wi=1 and wrmax+1>0.

An interesting implication of this proposition is that, even with a possibly inconsistent estimator of ϕ̂ in Step 1 (e.g., the least-squares estimator omitting covariates), the final estimator ϕ̂ in Step 3 can recover consistency as long as the ridge penalty from Step 2 is not too large in the sense of Theorem 6. Unless specified otherwise, we use the ridge penalty γ=λrmax+1 for our preliminary ridge regression in numerical analysis.

4 Monte Carlo Simulations

In this section, we compare the predictive performance of the ridge estimators and that of their competitors under factor augmented models using Monte Carlo experiments. We fix the number of predictors p=120 but vary the sample size n{60,120,240}, yielding concentration ratios p/n{2,1,0.5}. These dimensions are calibrated from our empirical analysis where np=120. The results for a fixed sample size n=120 but a much larger number of predictors p{480,840,1200} are available in Appendix C.3 in the supplementary materials.

We calibrate the data generating process of the latent factors {ft} from a vector autoregressive model (VAR) with four lags and the factor loading matrices Λ from a multivariate Gaussian distribution using the estimated scores and loadings of r=7 factors using the FRED-MD database in Section 5. The details of the calibration procedure is available in Appendix B.1 of the supplementary materials. We then generate exogenous predictors from the factor model give by xt=Λft+Γet=Λft+ΓΣ1/2ςt,where the entries ςt=(ςt,1,,ςt,p) are independently generated from the standard normal distribution, Σ is a diagonal covariance matrix from Bai and Silverstein (Citation1998) (with 20% of its eigenvalues equal to λ, 40% equal to 3λ, and 40% equal to 10λ) such that tr(Σ)=tr(ΛΛ), the idiosyncratic matrix Γ is generated uniformly from the class of orthogonal matrices. The factor scores ft are unobservable to the statistician, who always approximate the factor scores as n times the leading eigenvectors of XX when needed. We then generate the target variable from the factor augmented autoregressive model of order q=3 given by yt=0.29yt1+0.07yt2+0.11yt3+ftθ+etβ+εt, where we set a zero intercept ϕ0=0 without loss of generality, and the autoregressive coefficients ϕ=(0.29,0.07,0.11) are calibrated from the monthly growth in U.S. industrial production index. In the supplementary materials we repeat the analysis for the case with no autoregressor (i.e., q=0), and the conclusions remain qualitatively the same. The entries of factor coefficients θ=(θ1,,θ7) are generated from the uniform distribution on (0,1). All these coefficients are unknown to the statistician who always demeans the predictors in each sample and estimate all the coefficients. We consider two different data generating processes of the regression errors:

  1. independent errors εi from the standardized student t5 distribution;

  2. GARCH(1,1) errors εt=htvt, ht=1ab+aεt12+bht1 where a=0.1, b=0.8.

Without loss of generality, we maintain a long run variance σ2=1 in both cases. The errors are independent of the covariates. The results are similar for these two types of errors. To save space, we only report the results for GARCH errors in the main document and postpone the results for independent t5 errors to Appendix C in the supplementary materials.

To compare sparse and dense models in terms of idiosyncratic information, we generate the entries of the idiosyncratic regression coefficients β=(β1,,βp) independently from a “spike-and-slab” distribution (Mitchell and Beauchamp Citation1988) such that βi{N(0,5/p) with probability ρ,0with probability 1ρ,and the signal strength τ2=pEβi2=5ρ{0,0.2,0.5,1,2,5} controls the dense level; the larger τ2, the less zero entries in β in probability.

reports the median, over 5000 replications, of the testing errors (μn+hμ̂n+h)2 at the horizon h=1 outside brackets and training errors 1nt=1n(μtμ̂t)2 inside brackets for the following seven estimators:

Table 1: The testing errors (and training errors) over 5000 replications for GARCH errors.

  1. PCR: Principal component regression using true number of factors r=7.

  2. KS2011: Factor-adjusted LASSO estimator proposed by Kneip and Sarda (Citation2011) with L1 penalty parameter selected by k-fold cross-validation;

  3. PCR-CV: PCR using the number of factors selected by k-fold cross-validation.

  4. SRR: Single ridge regression with the ridge penalty parameter selected by the bias-corrected k-fold cross-validation procedure proposed in the last section, using the least-squares estimator of the autoregressive coefficients in Step 1.

  5. FaSRR: Factor-adjusted ridge regression that shrinks the eigenvalues as SRR except for the largest r eigenvalues.

  6. DRR: Double ridge regression with the ridge penalty parameter selected by the bias-corrected k-fold cross-validation procedure, using the preliminary ridge regression of the autoregressive coefficients in Step 1 given in Theorem 6.

  7. FaDRR: Factor-adjusted ridge regression that shrinks the eigenvalues as DRR except for the largest r eigenvalues.

The estimators SRR and DRR are what we propose in the last section, except they use different estimators of the autoregressive coefficients. The SRR estimator is slightly worse than DRR overall due to a larger estimation error of the autoregressive residuals in the cross-validation algorithm. The other estimators PCR, KS2011, FaSRR, and FaDRR are oracle in the sense that they use the true number of factors r=7. They can be feasible in practice when the number of factors can be consistently selected; see, for example, Bai and Ng (Citation2002). In contrast, the SRR and DRR estimators do not require estimating the number of factors but use all raw covariates. More implementation details of these algorithms are available in Appendix B.2 of the supplementary materials.

When the idiosyncratic signal τ2=0, the PCR and factor-adjusted Ridge estimators usually perform the best. Although the SRR and DRR estimator suffers from more estimation bias of fixed-effects due to the shrinkage, they still show good performance and remain robust against the absence of idiosyncratic signals. The LASSO estimator KS2011 is the worst and suffers from variable selection errors in dense models especially when p>n.

On the other hand, when τ2>0, our ridge estimators SRR and DRR consistently outperform all sparse learning methods PCR, KS2011 and PCR-CV in terms of both testing and training errors according to our asymptotic theory. The advantages of the ridge estimator becomes more significant as the signal length τ2 grows.

It is noteworthy that direct factor adjustments cannot improve the ridge regression when τ2>0. FaSRR and FaDRR violate the full-rank condition (Appendix A.6) on the residual factor matrix by removing all the in-sample factor information asymptotically. As a result, the residual ridge regression fails to shrink the divergent loading matrix Λ (via Lemma A7 in the supplementary materials), leading to an inflation of fixed-effects estimation errors.

To conclude, the simulation results confirm that our double ridge regression with bias-corrected cross-validation is a robust forecasting method for factor-augmented models with potential idiosyncratic information and nuisance variables. It delivers the best performance and nontrivial improvements in dense models with large τ2 consistently, while remaining robust against sparse idiosyncratic information with small τ2.

5 A Real-life Example

One empirical application of dense learning is to forecast the monthly growth rate of the U.S. industrial production index using a large set of macroeconomic variables from the FRED-MD database publicly available at: https://research.stlouisfed.org/econ/mccracken/fred-databases/.

This monthly database has similar predictive content as that in Stock and Watson (Citation2002), and it is regularly updated through the Federal Reserve Economic Data. The industrial production index is an important indicator of macroeconomic activity. We calculate its monthly grow rate in percentage by yt=log(IPt/IPt1)×100where IPt denotes the U.S. industrial production index for month t. Our sample size span from January, 1959 to August, 2021 and all the data are available on a monthly basis. We transform the nonstationary economic variables into stationary forms and remove the data outliers using the MATLAB codes provided on the website mentioned above; see also McCracken and Ng (Citation2016) for details.

In every month, we forecast the next-month growth ahead using q=0,1,,6 autoregressor(s) and (the principal components of) all the other covariates in the database. We estimate the factors and regression coefficients in rolling windows of a sample size n=120, which equals to a time span of ten years. In each window we use the lagged variables with no missing values for training and forecasting. The covariates are normalized within each window, and we transform the latest observation accordingly.

compares the out-of-sample mean squared forecast errors of the rolling-window predictions using the following six methods: purely autoregressive (AR), principal component regression (PCR), factor-adjusted LASSO regression (KS2011), principal component regression with the number of components selected by cross-validation (PCR-CV), the “single” ridge regression (SRR), and the “double” ridge regression (DRR) in Theorem 6. We report only the results using the 5-fold cross-validation after bias correction, which are slightly better than that before correction. The implementation details are available in Appendix B.3 of the supplementary materials. The mean squared errors are normalized by that of the moving average forecasts (i.e., autoregressive estimator with q=0) for comparison.

Table 2: Out-of-sample relative mean squared forecast errors for the growth rate of U.S. industrial production index.

Our double ridge estimator, namely DRR, systematically improves the PCR estimators over different choices of the autoregressive orders q1, while the factor-adjusted LASSO estimator KS2011 consistently underperform the PCR. The cross-validation procedure hardly improves PCR predictions. These findings suggest that the underlying model tend to be dense rather than sparse in terms of idiosyncratic information. The DRR estimator consistently improves the prediction errors against SRR, which is consistent with our theory that DRR is more robust against complex covariates structures; see Theorem 6 and the discussions above it.

6 Concluding Remarks

This article studies the high-dimensional ridge regression methods under factor models. Our theory of the efficiency and robustness of ridge regression in capturing both factor and idiosyncratic information simultaneously helps explain its excellent performance in many complicated applications. Our results are applicable to large-dimensional datasets showing spiked eigenstructure, serial dependence, conditional heteroscedasticity, and heavy tails simultaneously. In particular, we extend the theory of k-fold cross-validation beyond iid setting and provide theoretical support for its applications to more complex data.

One can compare the factor-augmented models (1.3)–(1.4) with the following synthetic control models studied by Agarwal et al. (Citation2021) (see also Abadie, Diamond, and Hainmueller Citation2010; Amjad, Shah, and Shen Citation2018; Arkhangelsky et al. Citation2021): (6.1) X=A+H,Y=Aθ*+ε+ϕ,(6.1) where X is the corrupted covariate matrix, the latent factor matrix A=FΛ is the true covariate matrix of low rank r with a strong signal AFHF, ε contains response noises, and ϕ are model misspecification errors. Rather than treating the misspecification error ϕ as deterministic and “inevitable” (see the interpretation below their Theorem 3.1 on p. 1735), here we model ϕ=HΓβ as a projection of noise matrix H through the random-effects β and an arbitrary rotation Γ. Therefore, the RR can outperform PCR by optimizing the bias-variance tradeoff on the misspecification errors ϕ. It is an exciting future research topic is to allow for weak factors in (6.1) such that AFHF.

One may wonder whether our findings generalize beyond the random matrix regime. In higher dimensions with κn:=max{p/n,1}=p/n, our results remain true by generalizing Assumption 2 as follows: ap/κn (which is trivial if ap is linear in p) and λmax(ΩE)=OP(κn). To see this, first apply our results to the rescaled covariates xixi/κn with corresponding coefficients ββκn, and then transform back to the original scales. For lower dimensions with p/n0 (but p sufficiently fast) and τ2>0, the optimal penalty γ*0 in (2.13), meaning that the optimal ridge estimator should approximate the least-squares estimator and outperform PCR. Otherwise, the optimal RR asymptotically reduces to the PCR when τ2=0; see in Appendix A.9 for illustration. Non-asymptotic comparison between PCR and RR is possible using the concentration inequalities in martingale limit theory, but we leave it as future works.

Finally, our analysis assumes that the factor strength ap so the standard PCA is (subspace) consistent. Is the RR still favorable otherwise? This is an interesting future research topic. Using the simulation example in the Introduction, we can illustrate that the ridge regression still consistently outperforms the PCR when the standard PCA becomes inconsistent. Due to the space limit, we leave the detailed illustrations to Appendix A.9 in the supplementary materials, where we also compare the regular and bias-corrected cross-validation and discuss a possible extension for “sparse + dense” idiosyncratic information.

Supplementary Materials

All the proofs of the theorems are available in the supplementary document, which also includes more technical discussions and remarks, extra simulation results, and useful lemmas that may be of independent interest.

Supplemental material

Supplemental Material

Download Zip (152.7 MB)

Acknowledgments

The author thanks the Editor, Professor Marina Vannucci, an Associate Editor, and two reviewers for their valuable comments that led to this improved version of the article.

Disclosure Statement

The author reports there are no competing interests to declare.

References

  • Abadie, A., Diamond, A., and Hainmueller, J. (2010). “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program,” Journal of the American Statistical Association, 105, 493–505. DOI: 10.1198/jasa.2009.ap08746.
  • Agarwal, A., Shah, D., Shen, D., and Song, D. (2021) “On Robustness of Principal Component Regression,” Journal of the American Statistical Association, 116, 1731–1745. DOI: 10.1080/01621459.2021.1928513.
  • Alter, O., Brown, P. O., and Botstein, D. (2000). “Singular Value Decomposition for Genome-wide Expression Data Processing and Modeling,” Proceedings of the National Academy of Sciences, 97, 10101–10106. DOI: 10.1073/pnas.97.18.10101.
  • Amjad, M., Shah, D., and Shen, D. (2018). “Robust Synthetic Control,” Journal of Machine Learning Research, 19, 802–852.
  • Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., and Wager, S. (2021). “Synthetic Difference-In-Differences,” American Economic Review, 111, 4088–4118. DOI: 10.1257/aer.20190159.
  • Bai, J. (2003), “Inferential Theory for Factor Models of Large Dimensions,” Econometrica, 71, 135–171. DOI: 10.1111/1468-0262.00392.
  • Bai, J., and Ng, S. (2002), “Determining the Number of Factors in Approximate Factor Models,” Econometrica, 70, 191–221. DOI: 10.1111/1468-0262.00273.
  • Bai, J., and Ng, S. (2021), “Approximate Factor Models with Weaker Loadings,” ArXiv Working Paper arXiv:2109.03773.
  • Bai, Z. D., and Silverstein, J. W. (1998). “No Eigenvalues Outside the Support of the Limiting Spectral Distribution of Large-Dimensional Sample Covariance Matrices,” The Annals of Probability, 26, 316–345. DOI: 10.1214/aop/1022855421.
  • Bai, Z. D., and Silverstein, J. W. (2010), Spectral Analysis of Large Dimensional Random Matrices, New York: Springer.
  • Bergmeir, C., Hyndman, R. J., and Koo, B. (2018), “A Note on the Validity of Cross-Validation for Evaluating Autoregressive Time Series Prediction,” Computational Statistics & Data Analysis, 120, 70–83. DOI: 10.1016/j.csda.2017.11.003.
  • Blum, J., and Eisenberg, B. (1975), “The Law of Large Numbers for Subsequences of a Stationary Process,” The Annals of Probability, 3, 281–288. DOI: 10.1214/aop/1176996398.
  • Blum, J., and Hanson, D.(1960), “On the Mean Ergodic Theorem for Subsequences,” Bulletin of the American Mathematical Society, 66, 308–311. DOI: 10.1090/S0002-9904-1960-10481-8.
  • Bradley, R. C. (1986), “Basic Properties of Strong Mixing Conditions,” in Dependence in Probability and Statistics: A Survey of Recent Results, eds. E. Eberlein and M. S. Taqqu, pp. 165–192, Boston: Birkhäuser.
  • Bradley, R. C. (2005), “Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions,” Probability Surveys, 2, 107–144. DOI: 10.1214/154957805100000104.
  • Cai, T. T., Han, X., and Pan, G. (2020), “Limiting Laws for Divergent Spiked Eigenvalues and Largest Nonspiked Eigenvalue of Sample Covariance Matrices,” The Annals of Statistics, 48, 1255–1280. DOI: 10.1214/18-AOS1798.
  • Carrasco, M., and Rossi, B. (2016), “In-Sample Inference and Forecasting in Misspecified Factor Models,” Journal of Business & Economic Statistics, 34, 313–338. DOI: 10.1080/07350015.2016.1186029.
  • Castle, J. L., Clements, M. P., and Hendry, D. F. (2013), “Forecasting by Factors, by Variables, by Both or Neither?,” Journal of Econometrics, 177, 305–319. DOI: 10.1016/j.jeconom.2013.04.015.
  • Cootes, T. F., Taylor, C. J., Cooper, D. H., and Graham, J. (1995). “Active Shape Models-Their Training and Application,” Computer Vision and Image Understanding, 61, 38–59. DOI: 10.1006/cviu.1995.1004.
  • Davis, R. A., and Mikosch, T. (2009). “Probabilistic Properties of Stochastic Volatility Models,” in Handbook of Financial Time Series, eds. T. G. Andersen, R. A. Davis, J. P. Kreiß, and T. V. Mikosch, pp. 255–267, Berlin: Springer.
  • De Mol, C., Giannone, D., and Reichlin, L. (2008), “Forecasting Using a Large Number of Predictors: Is Bayesian Shrinkage a Valid Alternative to Principal Components?,” Journal of Econometrics, 146, 318–328. DOI: 10.1016/j.jeconom.2008.08.011.
  • Dicker, L. H. (2016). “Ridge Regression and Asymptotic Minimax Estimation Over Spheres of Growing Dimension,” Bernoulli, 22, 1–37. DOI: 10.3150/14-BEJ609.
  • Dobriban, E., and Wager, S. (2018), “High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification,” The Annals of Statistics, 46, 247–279. DOI: 10.1214/17-AOS1549.
  • Fan, J., Ke, Y., and Wang, K. (2020), “Factor-Adjusted Regularized Model Selection,” Journal of Econometrics, 216, 71–85. DOI: 10.1016/j.jeconom.2020.01.006.
  • Giannone, D., Lenza, M., and Primiceri, G. E. (2021), “Economic Predictions with Big Data: The Illusion of Sparsity,” Econometrica, 89, 2409–2437. DOI: 10.3982/ECTA17842.
  • Hastie, T., Tibshirani, R., and Friedman, J. H. (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, New York: Springer.
  • Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2022), “Surprises in High-Dimensional Ridgeless Least Squares Interpolation,” The Annals of Statistics, 50, 949–986. DOI: 10.1214/21-aos2133.
  • Hayashi, F. (2000), Econometrics Princeton: Princeton University Press.
  • Hallin, M., and Liska, R. (2007). “The Generalized Dynamic Factor Model: Determining the Number of Factors,” Journal of the American Statistical Association, 102, 603–617. DOI: 10.1198/016214506000001275.
  • Hoerl, A. E., and Kennard, R. W. (1970), “Ridge Regression: Biased Estimation for Nonorthogonal Problems,” Technometrics, 12, 55–67. DOI: 10.1080/00401706.1970.10488634.
  • James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021), An Introduction to Statistical Learning: with Applications in R (2nd ed), New York: Springer.
  • Johnstone, I. M. (2001). “On the Distribution of the Largest Eigenvalue in Principal Components Analysis,” The Annals of Statistics, 29, 295–327. DOI: 10.1214/aos/1009210544.
  • Johnstone, I. M., and Lu, A. Y. (2009). “On Consistency and Sparsity for Principal Components Analysis in High Dimensions,” Journal of the American Statistical Association, 104, 682–693. DOI: 10.1198/jasa.2009.0121.
  • Johnstone, I. M., and Paul, D. (2018). “PCA in High Dimensions: An Orientation,” Proceedings of the IEEE, 106, 1277–1292. DOI: 10.1109/JPROC.2018.2846730.
  • Jolliffe, I. T. (1982). “A Note on the Use of Principal Components in Regression,” Journal of the Royal Statistical Society, Series C, 31, 300–303. DOI: 10.2307/2348005.
  • Jung, S., and Marron, J. S. (2009). “PCA Consistency in High Dimension, Low Sample Size Context,” The Annals of Statistics, 37, 4104–4130. DOI: 10.1214/09-AOS709.
  • Kneip, A., and Sarda, P. (2011), “Factor Models and Variable Selection in High-Dimensional Regression Analysis,” The Annals of Statistics, 39, 2410–2447. DOI: 10.1214/11-AOS905.
  • Liu, S., and Dobriban, E. (2020), “Ridge Regression: Structure, Cross-Validation, and Sketching,” in International Conference on Learning Representations.
  • Marčenko, V. A., and Pastur, L. A. (1967). “Distribution of Eigenvalues for Some Sets of Random Matrices,” Mathematics of the USSR-Sbornik, 1, 457–483. DOI: 10.1070/SM1967v001n04ABEH001994.
  • McCracken, M. W., and Ng, S. (2016), “FRED-MD: A Monthly Database for Macroeconomic Research,” Journal of Business & Economic Statistics, 34, 574–589. DOI: 10.1080/07350015.2015.1086655.
  • Mitchell, T. J., and Beauchamp, J. J. (1988), “Bayesian Variable Selection in Linear Regression,” Journal of the American Statistical Association, 83, 1023–1032. DOI: 10.1080/01621459.1988.10478694.
  • Ng, S. (2013), “Variable Selection in Predictive Regressions,” In Handbook of Economic Forecasting (Vol. 2), eds. G. Elliott and A. Timmermann, pp. 752–789, Amsterdam: Elsevier.
  • Onatski, A. (2009). “Testing Hypotheses About the Number of Factors in Large Factor Models,” Econometrica, 77, 1447–1479.
  • Onatski, A. (2012). “Asymptotics of the Principal Components Estimator of Large Factor Models With Weakly Influential Factors,” Journal of Econometrics, 168, 244–258.
  • Onatski, A., and Wang, C. (2021), “Spurious Factor Analysis,” Econometrica, 89, 591–614. DOI: 10.3982/ECTA16703.
  • Paul, D. (2007). “Asymptotics of Sample Eigenstructure for a Large Dimensional Spiked Covariance Model,” Statistica Sinica, 17, 1617–1642.
  • Paul, D., and Silverstein, J. W. (2009). “No Eigenvalues Outside the Support of the Limiting Empirical Spectral Distribution of a Separable Covariance Matrix,” Journal of Multivariate Analysis, 100, 37–57. DOI: 10.1016/j.jmva.2008.03.010.
  • Preisendorfer, R. W. (1988). Principal Component Analysis in Meteorology and Oceanography, Amsterdam: Elsevier.
  • Stock, J. H., and Watson, M. W. (2002), “Forecasting Using Principal Components from a Large Number of Predictors,” Journal of the American Statistical Association, 97, 1167–1179. DOI: 10.1198/016214502388618960.
  • Stock, J. H., and Watson, M. W. (2012), “Generalized Shrinkage Methods for Forecasting Using Many Predictors,” Journal of Business & Economic Statistics, 30, 481–493.
  • Tikhonov, A. N. (1963a), “Regularization of Incorrectly Posed Problems,” Soviet Mathematics Doklady, 4, 1624–1627 (English translation).
  • Tikhonov, A. N. (1963b), “Solution of Incorrectly Formulated Problems and the Regularization Method,” Soviet Mathematics Doklady, 4, 1035–1038 (English translation).