Interpretable Sparse Proximate Factors for Large Dimensions: Journal of Business & Economic Statistics: Vol 40 , No 4

Abstract

This article proposes sparse and easy-to-interpret proximate factors to approximate statistical latent factors. Latent factors in a large-dimensional factor model can be estimated by principal component analysis (PCA), but are usually hard to interpret. We obtain proximate factors that are easier to interpret by shrinking the PCA factor weights and setting them to zero except for the largest absolute ones. We show that proximate factors constructed with only 5%–10% of the data are usually sufficient to almost perfectly replicate the population and PCA factors without actually assuming a sparse structure in the weights or loadings. Using extreme value theory we explain why sparse proximate factors can be substitutes for non-sparse PCA factors. We derive analytical asymptotic bounds for the correlation of appropriately rotated proximate factors with the population factors. These bounds provide guidance on how to construct the proximate factors. In simulations and empirical analyses of financial portfolio and macroeconomic data, we illustrate that sparse proximate factors are close substitutes for PCA factors with average correlations of around 97.5%, while being interpretable.

Keywords:

Acknowledgments

We thank Bryan Kelly, Kay Giesecke, Serena Ng, José Luis Montiel Olea, Dacheng Xiu and seminar and conference participants at Stanford University, Columbia University, NBER-NSF Time-Series Conference, SoFiE Meeting, California Econometrics Conference, SoFiE Summer School on Machine Learning and Empirical Asset Pricing and INFORMS for helpful comments.

Supplementary Materials

The supplementary material contains all proofs and additional simulation and empirical results.

Notes

1 The application of latent factor models goes beyond the applications listed above, e.g. inferring missing values in matrices (Xiong and Pelger Citation2020; Candès and Tao Citation2010; Candès et al. Citation2011).

2 A common method to interpret low-dimensional factor models is to find a rotation of the common factors with a meaningful interpretation. This approach uses the insight that factor models are only identified up to an invertible transformation and represent the same model after an appropriate rotation. The $varimax$ criterion proposed by Kaiser (Citation1958) is a popular way to select factors whose factor weights have groups of large and negligible coefficients. However, in large-dimensional factor models with non-sparse factor weight structure, finding a “good” rotation becomes considerably more challenging. It is generally easier with our sparse proximate factors to find a rotation that has a meaningful interpretation.

3 Generalized correlation (also called canonical correlation) measures how close two vector spaces are. It has been studied by Anderson (Citation1962) and applied in large-dimensional factor models (Bai and Ng Citation2006; Andreou et al. Citation2019; Pelger Citation2019; Pelger and Xiong Citation2021).

4 In simulations, we verify that the lower bound has good finite sample properties.

5 The degree of sparsity can also be chosen to obtain a sufficiently high generalized correlation with the estimated PCA factors or by cross-validation arguments to explain a sufficient amount of variation in the data. Our theoretical bound provides an alternative to select the tuning parameter based on arguments that should also hold out-of-sample.

6 We impose assumptions similar to Bai and Ng (Citation2002).

7 Choi, Oehlert, and Zou (Citation2010), Lan et al. (Citation2014), and Kawano et al. (Citation2015) estimated the sparse loadings by minimizing the sum of the negative log-likelihood of the data with a soft thresholding term.

8 Another method to increase the understandability and interpretability of factors is to associate latent factors or factor loadings with observed variables. Some latent factors can be approximated well by observed economic factors, such as Fama–French factors for equity data (Fama and French Citation1992) or level, slope, and curvature factors for bond data (Diebold and Li Citation2006). Fan, Ke, and Liao (Citation2016) proposed robust factor models to exploit the explanatory power of observed proxies on latent factors. Another approach is to model how the factor loadings relate to observable variables. Connor and Linton (Citation2007), Connor, Hagmann, and Linton (Citation2012), and Fan, Liao, and Wang (Citation2016) at least partially employ subject-specific covariates to explain factor loadings, such as market capitalization, price–earning ratios, and other firm characteristics. However, in order to explain latent factors by observed variables, it is necessary to include all the relevant variables, some of which might not be known. Our sparse proximate factors can provide discipline on which assets and covariates to focus on.

9 We assume that we have consistently estimated the number of factors K.

10 We refer to $\frac{1}{N T} X X^{⊤}$ as a sample covariance matrix and as in our applications we first demean the data. However our theoretical and empirical results are only based on the spectral decomposition of a second moment matrix and hold for general nonzero means. Lettau and Pelger (Citation2020a) provide an answer when to demean the data.

11 Formally, denote $M = [{\underline{M}}_{1}, {\underline{M}}_{2}, \dots, {\underline{M}}_{K}] \in R^{N \times K}$ as a mask matrix indicating which factor weights are set to zero. ${\underline{M}}_{j}$ has m 1s and N – m 0s. The sparse factor weights $\tilde{W}$ can be written as

$\tilde{W} = [\begin{matrix} \frac{{\hat{\underline{Λ}}}_{1} ⊙ {\underline{M}}_{1}}{‖ {\hat{\underline{Λ}}}_{1} ⊙ {\underline{M}}_{1} ‖} & \frac{{\hat{\underline{Λ}}}_{2} ⊙ {\underline{M}}_{2}}{‖ {\hat{\underline{Λ}}}_{2} ⊙ {\underline{M}}_{2} ‖} & \dots & \frac{{\hat{\underline{Λ}}}_{K} ⊙ {\underline{M}}_{K}}{‖ {\hat{\underline{Λ}}}_{K} ⊙ {\underline{M}}_{K} ‖} \end{matrix}],$ (5)

where ${\hat{\underline{Λ}}}_{j}$ is jth estimated loading. The vector ${\underline{M}}_{j}$ has the element 1 at the position of the m largest loadings of ${\hat{\underline{Λ}}}_{j}$ and zero otherwise. $⊙$ denotes the Hadamard product for element by element multiplication of matrices.

12 The second equation follows from $\tilde{F} = X^{⊤} \tilde{W} {({\tilde{W}}^{⊤} \tilde{W})}^{- 1} = X^{⊤} \tilde{W} = ({\underline{F}}_{1} Λ_{1, 1} + e_{1}) {\tilde{w}}_{1, 1} = ({\underline{F}}_{1} Λ_{1, 1} + e_{1})$ .

13 Assumption 1 about the population factors is the same as Bai and Ng (Citation2002). Assumption 2 allows loadings to be random. Since the loadings are independent of factors and errors, all results in Bai and Ng (Citation2002) hold. Assumption 3.1 imposes moment conditions for errors, which is the same as Assumption C.1 in Bai and Ng (Citation2002). This assumption implies that $σ_{e}^{2}$ is bounded. Assumption 3.2 is close to Assumption 3.2 (i) and (ii) in Fan, Liao, and Mincheva (Citation2013). This assumption restricts the cross-sectional dependence of errors and is standard in the literature on approximate factor models. Since $Σ_{e}$ is symmetric, ${‖ Σ_{e} ‖}_{1} \leq M$ is equivalent to ${‖ Σ_{e} ‖}_{\infty} \leq M$ . ${‖ Σ_{e} ‖}_{1} \leq M$ implies that $\sum_{j = 1}^{N} | τ_{i j, t t} | \leq M$ . Together with $τ_{i j, t t}$ being the same for all t from the stationarity of e_t , this assumption implies Assumption C.3 in Bai and Ng (Citation2002). Assumption 3.3 allows for weak time-series dependence of errors, which is slighter stronger than Assumption C.2 in Bai and Ng (Citation2002). Assumption 3.6 is the time average counterpart of Assumption C.5 in Bai and Ng (Citation2002). Assumption 4 implies Assumption D in Bai and Ng (Citation2002). The fourth moment conditions in Assumptions 3.6 and 4, together with Boole’s inequality or the union bound, are used to show the uniform convergence of loadings, without assuming the boundedness of loadings. Since Assumptions 1–4 imply Assumptions A–D in Bai and Ng (Citation2002), all results in Bai and Ng (Citation2002) hold.

14 A specific estimator like PCA selects a specific rotation matrix H. Bai and Ng (Citation2002) and Bai (Citation2003) showed that the PCA loadings $\hat{Λ}$ are consistent estimators of the rotated loadings $Λ H$ . The rotation matrix H is defined as $H = D^{- 1 / 2} V_{H}^{⊤} Σ_{F}^{1 / 2}$ based on the spectral decomposition $Σ_{F}^{1 / 2} Σ_{Λ} Σ_{F}^{1 / 2} = V_{H} D V_{H}^{⊤}$ , that is, $D = diag (D_{1}, D_{2}, \dots, D_{K})$ are the eigenvalues of $Σ_{F}^{1 / 2} Σ_{Λ} Σ_{F}^{1 / 2}$ in decreasing order and $V_{H}$ are the corresponding eigenvectors. If the population factor model has the same rotation as the PCA estimates, that is, $Σ_{F}$ is diagonal and $Σ_{Λ} = I_{K}$ then H simplifies to an identity matrix. In this case, the generalized correlation $ρ_{\tilde{Λ}, Λ}$ simplifies to the sum of squared correlations between the vectors of estimated and population loadings.

15 The generalized correlation is also known as canonical correlation. Our measure is based on the Euclidian inner product without demeaning the loadings which is appropriate for measuring the span of the loading matrix. Our generalized correlation measure is the sum of the squared individual generalized correlations. The first individual generalized correlation is the highest correlation that can be achieved through a linear combination of the proximate loadings $\tilde{Λ}$ and the population loadings $Λ$ . For the second generalized correlation, we first project out the subspace that spans the linear combination for the first generalized correlation and then determine the highest possible correlation that can be achieved through linear combinations of the remaining K – 1 dimensional subspaces. This procedure continues until we have calculated the K individual generalized correlations. Mathematically the individual generalized correlations are the square root of the eigenvalues of the matrix ${(Λ^{⊤} Λ / N)}^{- 1} (Λ^{⊤} \tilde{Λ} / N) {({\tilde{Λ}}^{⊤} \tilde{Λ} / N)}^{- 1} ({\tilde{Λ}}^{⊤} Λ / N)$ .

16 Suitable assumptions include either independent entries with uniformly bounded variance and fourth moment (Vershynin Citation2010, theor. 5.37), Sub-Gaussian rows (Vershynin Citation2010, theor. 5.39), or Sub-Gaussian columns (Vershynin Citation2010, theor. 5.58).

17 Instead of applying PCA, latent factors can also be estimated by an iterative procedure where for a set of candidate factors a first stage set of loadings is estimated with a time-series regression, which is then used in a second stage to obtain factors in a cross-sectional regression. This procedure is iterated until convergence. For example, Bai and Ng (Citation2019) used a variation of this approach. Our result shows that in an approximate factor model one step is sufficient to obtain a consistent estimator of the loadings.

18 Lemma A.1 is adapted from Hsing (Citation1988), theor. 3.3. $| Λ_{i, 1} |$ is indexed by the cross-sectional units. We assume that $| Λ_{i, 1} |$ are exchangeable and can be properly reshuffled to satisfy the strong mixing condition.

19 The individual generalized correlations are the square root of the eigenvalues of the matrix ${(F^{⊤} F / T)}^{- 1} (F^{⊤} \tilde{F} / T) {({\tilde{F}}^{⊤} \tilde{F} / T)}^{- 1} ({\tilde{F}}^{⊤} F / T)$ .

20 Note that the weighted PCA loadings ${\hat{Λ}}^{wt}$ need to be multiplied by $Θ^{- 1}$ to be consistent estimates for the population loadings $Λ$ up to the usual rotation matrix.

21 The efficiency argument requires that the time and cross-sectional dependency of the residuals can be separated. This would be satisfied if residuals are generated as $e = A_{N} ϵ A_{T}$ where the N × T matrix $ϵ$ has iid entries and the weak cross-sectional correlation is captured by the N × N matrix $A_{N}$ , while the T × T matrix $A_{T}$ captures the time-series correlation.

22 There are various shrinkage estimator for estimating a large-dimensional residual covariance matrix, for example the hard-thresholding estimator proposed in Fan, Liao, and Mincheva (Citation2013).

23 A generalization is to add an $l_{2}$ penalty term which leads to an Elastic Net penalty (Zou and Hastie Citation2005).

24 The number of nonzero elements in all loadings for the Lasso estimator is monotonically decreasing in the parameter α. Hence, under certain conditions, there is a one-to-one mapping between the level of sparsity of the full loading matrix and the $l_{1}$ penalty weight. However, except for special cases, α cannot control the sparsity of a specific loading vector. Using different $l_{1}$ penalties α_j for different loading vectors allows for more control for the sparsity level in a specific loading vector, but it is not straightforward and not always possible to select any desired sparsity level.

25 We also simulate data with time-series dependence or with cross-sectional dependence in the errors. The results are presented in Section IA.D in the appendix (supplementary material).

26 The loadings for SPCA on the test dataset are the same as on the training dataset. However, the loadings of PPCA and PPCA (wt) are different on the training and test data as they are estimated in a second stage regression.

27 We thank Kozak, Nagel, and Santosh (Citation2020) for allowing us to use their data.

28 Kozak, Nagel, and Santosh (Citation2020) used a set of 50 anomaly characteristics. We use 37 of those characteristics with the longest available cross section. The same data are studied in Lettau and Pelger (Citation2020b).

29 As many investors trade on factors, trading on proximate factors can also significantly reduce trading costs.

30 We use the residuals from a 5-factor model to estimate the standard deviation of the errors. The results are robust to this choice.

31 We only report the out-of-sample results in the main text. The Internet Appendix collects the in-sample results with very similar findings. An alternative approach to implement sparse PCA is to directly impose the cardinality constraint ${‖ {\underline{Λ}}_{j} ‖}_{0} \leq m$ rather than using the $l_{1}$ penalty term $α \sum_{j = 1}^{K} {‖ {\underline{Λ}}_{j} ‖}_{1}$ in the optimization problem (18), where ${‖ {\underline{Λ}}_{j} ‖}_{0}$ equals the number of nonzero elements in ${\underline{Λ}}_{j}$ . Sigg and Buhmann (Citation2008) developed an expectation-maximization (EM) algorithm based on a probabilistic expression of PCA to solve this optimization problem with constraint ${‖ {\underline{Λ}}_{j} ‖}_{0} \leq m$ . This approach will in general return sparse loadings with m nonzero elements in each sparse loading vector when m is reasonably small. The appendix (supplemental material) collects the results for this alternative implementation with very similar findings as in the main text.

32 Note that the notion of weak and strong factors here refers to the variation in the variance explained in the data, and this notion differs from that in Chudik, Pesaran, and Tosetti (Citation2011).

33 If we are concerned that factor loadings of returns vary over time, it is possible to allow the loadings to be functions of observable time-varying state processes. An example of this approach would be a factor model with regime changes, for example, $X_{i t} = {(1_{z_{t} \leq θ} \cdot Λ_{i}^{(1)} + 1_{z_{t} > θ} \cdot Λ_{i}^{(2)})}^{⊤} F_{t} + e_{i t}$ , where z_t is an observed variable, θ is the unknown threshold value, and $1_{z_{t} \leq θ}$ is an indicator function (Massacci Citation2017, Citation2020). Another approach is to model the time-varying loadings as nonparametric functions of an observed time-varying state variable (Pelger and Xiong Citation2021). We could combine our method with Massacci (Citation2017, Citation2020) or Pelger and Xiong (Citation2021) to estimate sparse proximate factors with regime changes. However, the theoretical guarantees are beyond the scope of this article and left for future work.

34 This dataset is updated in a timely manner and is available at https://research.stlouisfed.org/econ/mccracken/fred-databases/

35 Bai (Citation2003) showed in their Proposition 2 that $\max_{1 \leq t \leq T} ‖ {\tilde{F}}_{t} - {(H^{- 1})}^{⊤} F_{t} ‖ = O_{P} (\sqrt{1 / T} + \sqrt{T} / \sqrt{N})$ . Fan, Liao, and Mincheva (Citation2013) show in their Theorem 4 that $\max_{l \leq N} ‖ {\hat{Λ}}_{l} - H Λ_{l} ‖ = O_{P} (\sqrt{1 / N} + \sqrt{log N} / \sqrt{T})$ under the additional assumption of bounded loadings and $\max_{1 \leq t \leq T} ‖ {\tilde{F}}_{t} - {(H^{- 1})}^{⊤} F_{t} ‖ = O_{P} (\sqrt{1 / T} + T^{1 / 4} / \sqrt{N})$ . Our results relax their assumption of bounded loadings which comes at the cost of a slower rate than in Fan, Liao, and Mincheva (Citation2013).

Interpretable Sparse Proximate Factors for Large Dimensions

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Interpretable Sparse Proximate Factors for Large Dimensions

Abstract

Acknowledgments

Supplementary Materials

Notes

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature