Full article: Variational Inference for Large Bayesian Vector Autoregressions

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

We propose a novel variational Bayes approach to estimate high-dimensional Vector Autoregressive (VAR) models with hierarchical shrinkage priors. Our approach does not rely on a conventional structural representation of the parameter space for posterior inference. Instead, we elicit hierarchical shrinkage priors directly on the matrix of regression coefficients so that (a) the prior structure maps into posterior inference on the reduced-form transition matrix and (b) posterior estimates are more robust to variables permutation. An extensive simulation study provides evidence that our approach compares favorably against existing linear and nonlinear Markov chain Monte Carlo and variational Bayes methods. We investigate the statistical and economic value of the forecasts from our variational inference approach for a mean-variance investor allocating her wealth to different industry portfolios. The results show that more accurate estimates translate into substantial out-of-sample gains across hierarchical shrinkage priors and model dimensions.

KEYWORDS:

1 Introduction

Hierarchical shrinkage priors have been shown to represent an effective regularization technique when estimating large Vector Autoregressive (VAR) models. The use of these priors often relies on a Cholesky decomposition of the residuals covariance matrix so that a large system of equations is reduced to a sequence of univariate regressions. This allows for more efficient computations as priors can be elicited on the structural VAR representation implied by the Cholesky factorization, and posterior inference is carried out equation-by-equation.

Such a conventional approach has two important implications for posterior inference: first, priors are not order-invariant, meaning that posterior inference is sensitive to permutations of the endogenous variables for a given prior specification. This is particularly relevant in high dimensions whereby logical orders of the variables of interest might be unclear, or a full search among all possible ordering combinations might be unfeasible (see, e.g., Chan, Koop, and Yu Citation2023). Second, imposing a shrinkage prior to the structural VAR formulation might not help to pin down reduced-form VAR parameters. That is, priors are not translation-invariant. This is especially relevant in forecasting applications whereby the main objective is to accurately identify predictive relationships across variables rather than to identify structural shocks.

In this article, we take a different approach toward posterior inference with hierarchical shrinkage priors in large VAR models. Specifically, we propose a novel variational Bayes estimation approach which allows for a fast and accurate estimation of the reduced-form VAR without leveraging on a conventional structural VAR representation. This allows us to elicit shrinkage priors directly on the matrix of regression coefficients so that (a) the prior structure directly maps into the posterior inference of the reduced-form transition matrix and (b) posterior estimates are less sensitive to variable permutation. We also account for the effect of “exogenous” predictors and stochastic volatility in the residuals.

The key idea is that by abstracting from the linearity constraints implied by a structural VAR formulation, one can provide a direct identification of the reduced-form VAR parameters. This could have important implications for forecasting, especially in large-scale models where the set of regression coefficients may be sparse (see, e.g., Bernardi, Bianchi, and Bianco Citation2023). Our approach is computationally more efficient than comparable Markov chain Monte Carlo (MCMC) methods while maintaining a high accuracy in posterior estimates.

We investigate the estimation accuracy using an extensive simulation study for different model dimensions and variable permutations. As benchmarks, we consider a variety of established estimation approaches developed for large Bayesian VAR models, such as the linearized MCMC proposed by Chan and Eisenstat (Citation2018) and Cross, Hou, and Poon (Citation2020) and its variational Bayes counterpart proposed by Chan and Yu (Citation2022) and Gefang, Koop, and Poon (Citation2023). Both approaches are built upon a structural VAR formulation. In addition, we compare our variational inference method against the MCMC algorithm developed by Gruber and Kastner (Citation2022), which is not constrained by a Cholesky factorization for parameter identification similar to our approach. We test each estimation method for different hierarchical priors, such as the adaptive-Lasso of Leng, Tran, and Nott (Citation2014), an adaptive version of the Normal-Gamma of Griffin and Brown (Citation2010), and the Horseshoe of Carvalho, Polson, and Scott (Citation2010).

Overall, the simulation results show that our variational Bayes algorithm represents the best tradeoff between estimation accuracy and computational efficiency. Specifically, posterior inference from our variational Bayes method is as accurate as nonlinear MCMC methods (see, e.g., Gruber and Kastner Citation2022) but is considerably more efficient. At the same time, our approach is as efficient as conventional MCMC and variational Bayes methods based on a structural VAR formulation but is significantly more accurate and less sensitive to variable permutation.

Our approach toward posterior inference in large VARs is guided by the principle that more accurate identification of the reduced-form transition matrix should ultimately lead to better out-of-sample forecasts and financial decision-making. To test this assumption, we investigate the statistical and economic value of the forecasts from our model in the context of a mean-variance investor who allocates her wealth between an industry portfolio and a risk-free asset based on lagged cross-industry returns and macroeconomic predictors.

Although the model is general and can be applied to any type of financial returns, as far as data are stationary, our focus on different industry portfolios is motivated by keen interest from researchers (see, e.g., Fama and French Citation1997; Hou and Robinson Citation2006) and practitioners alike. Indeed, the implications of industry returns predictability are arguably far from trivial. If all industries are unpredictable, the market return, a weighted average of the industry portfolios, should also be unpredictable. As a result, the abundant evidence of aggregate market return predictability (see, e.g., Rapach and Zhou Citation2013) implies that at least some industry portfolio returns should be predictable.

The main results show that our variational inference approach fares better than competing methods regarding out-of-sample point and density forecasts. More accurate forecasts translate into larger economic gains as measured by certainty equivalent return spreads vis-á-vis a naive investor who makes investment decisions based on the sample mean and variance of the returns. This holds across different hierarchical prior specifications. Overall, the empirical results support our view that by accurately identifying weak correlations between predictors and portfolio returns, one can significantly improve—statistically and economically—the out-of-sample performance of large-scale VAR models.

Our article connects to a growing literature exploring the use of Bayesian methods to estimate high-dimensional VAR models, such as Chan and Eisenstat (Citation2018), Carriero, Clark, and Marcellino (Citation2019), Huber and Feldkircher (Citation2019), Chan and Yu (Citation2022), Cross, Hou, and Poon (Citation2020), Kastner and Huber (Citation2020), Chan, Koop, and Yu (Citation2023), Chan (Citation2021), Gruber and Kastner (Citation2022), and Gefang, Koop, and Poon (Citation2023), among others. We contribute to this literature by providing a fast and accurate variational inference method which generalizes posterior inference with hierarchical shrinkage priors by abstracting from a conventional structural VAR representation.

A second strand of literature we contribute to is related to the predictability of stock returns. Specifically, we contribute to the ongoing struggle to capture the dynamics of risk premiums by looking at industry-based portfolios. Early exceptions are Ferson and Harvey (Citation1991), Ferson and Korajczyk (Citation1995), Ferson and Harvey (Citation1999) and Avramov (Citation2004). As highlighted by Lewellen, Nagel, and Shanken (Citation2010), the sample variation of industry portfolios is particularly elusive to model since conventional risk factors do not seem to capture significant comovements. We contribute to this literature by investigating the out-of-sample predictability of industry portfolios through the lens of a novel estimation method for large Bayesian VAR models.

2 The Choice of Model Parameterization

Let $y_{t} = {(y_{1, t}, \dots, y_{d, t})}^{⊺} \in R^{d}$ be a multivariate normal random variable and denote by $x_{t} = {(1, x_{1, t}, \dots, x_{p, t})}^{⊺} \in R^{(p + 1)}$ a vector of covariates at time t. A vector autoregressive model with exogenous covariates and stochastic volatility is defined in compact form as (1) $y_{t} = Θ z_{t - 1} + u_{t}, u_{t} \sim N_{d} (0_{d}, Ω_{t}^{- 1}), t = 1, \dots, T,$ (1) with $z_{t - 1} = {(y_{t - 1}^{⊺}, x_{t - 1}^{⊺})}^{⊺}$ and $Θ = (Φ, Γ)$ consistently partitioned, where $Φ \in R^{d \times d}$ is the transition matrix containing the autoregression coefficients and $Γ \in R^{d \times (p + 1)}$ is the matrix of regression parameters for the exogenous predictors. Here, $u_{t} \in R^{d}$ is a sequence of uncorrelated innovation terms such that $u_{t - k} ⊥ u_{t - j}$ $\forall k, j$ with $k \neq j$ and $Ω_{t} \in S_{+ +}^{d}$ being a symmetric and positive-definite time-varying precision matrix. A modified Cholesky factorization of $Ω_{t}$ can be conveniently exploited to re-write the model with orthogonal innovations (see, e.g., Rothman, Levina, and Zhu Citation2010).

Let $Ω_{t} = L^{⊺} V_{t} L$ , where $L \in R^{d \times d}$ is unit-lower-triangular and $V_{t} \in S_{+ +}^{d}$ is diagonal with time-varying elements $V_{t} = Diag (ν_{1, t}, \dots, ν_{d, t})$ . By multiplying both sides of (1) by $L = I_{d} - B$ one can obtain two alternative re-parameterizations of the same model: (2a) $y_{t} = B (y_{t} - Θ z_{t - 1}) + Θ z_{t - 1} + ε_{t}, ε_{t} \sim N_{d} (0_{d}, V_{t}^{- 1}),$ (2a) (2b) $y_{t} = B y_{t} + A z_{t - 1} + ε_{t}, ε_{t} \sim N_{d} (0_{d}, V_{t}^{- 1}),$ (2b) where $A = L Θ$ , and B has a strict-lower-triangular structure with elements $β_{j, k} = - l_{j, k}$ for $j = 2, \dots, d$ and $k = 1, \dots, j - 1$ . The key difference is that (2a) is nonlinear in the parameters, while (2b) is linear. The latter is the structural VAR representation, widely used in existing MCMC and variational Bayes estimation methods for high-dimensional VAR models (see, e.g., Chan and Eisenstat Citation2018; Chan and Yu Citation2022; Gefang, Koop, and Poon Citation2023). Instead, (2a) is the reduced-form parameterization at the core of our variational inference approach. This has been used in the context of MCMC algorithms for smaller dimensions (see, e.g., Huber and Feldkircher Citation2019; Gruber and Kastner Citation2022).

From (2a)–(2b) one can obtain two alternative equation-by-equation representations in which the jth component of $y_{t}$ is defined as (3a) $y_{j, t} = β_{j} r_{j, t} + ϑ_{j} z_{t - 1} + ε_{j, t}, ε_{j, t} \sim N (0, ν_{j, t}^{- 1}),$ (3a) (3b) $y_{j, t} = β_{j} y_{t}^{j} + a_{j} z_{t - 1} + ε_{j, t}, ε_{j, t} \sim N (0, ν_{j, t}^{- 1}),$ (3b) for all $j = 1, \dots, d$ and $t = 1, \dots, T$ , where $β_{j} \in R^{j - 1}$ is a row vector containing the non-null elements in the jth row of B, and $ϑ_{j}, a_{j}$ denote the jth row of $Θ$ and A, respectively. For any $j = 1, \dots, d$ , let $r_{j, t} = y_{t}^{j} - Θ^{j} z_{t - 1}$ denotes the vector of residuals up to the $(j - 1)$ th regression, with $y_{t}^{j} = {(y_{1, t}, \dots, y_{j - 1, t})}^{⊺} \in R^{j - 1}$ and $Θ^{j} \in R^{(j - 1) \times d}$ the sub-matrix containing the first j–1 rows of $Θ$ . We follow Gefang, Koop, and Poon (Citation2023) and Chan and Yu (Citation2022) and model the time variation in $ν_{j, t}^{- 1} = \exp (h_{j, t})$ assuming a log-volatility process $h_{j, t} = h_{j, t - 1} + e_{j, t}$ with $e_{j, t} \sim N (0, ψ_{j})$ , where the initial state $h_{0, j} \sim N (0, k_{0} ψ_{j}), k_{0} ≫ 0$ , is unknown.

A discussion on variable permutation

Existing Bayesian approaches for large VAR models often rely on the structural representation in (2b), and therefore consider the elements in A as the parameters of interest. This simplifies the implementation of MCMC (see, e.g., Chan and Eisenstat Citation2018) and variational Bayes algorithms (see, e.g., Gefang, Koop, and Poon Citation2023). Under the re-parameterization $A = L Θ$ , each element $ϑ_{i, j}$ –which denotes the (i, j)-entry of $Θ$ –is a linear combination $ϑ_{i, j} = a_{i, j} + \sum_{k = 1}^{i - 1} c_{i, k} a_{k, j}$ , where $a_{i, j}$ and $c_{i, j}$ are the (i, j)-entry of A and $L^{- 1}$ , respectively.

This raises two main issues: first, $a_{i, j} = 0$ does not imply $ϑ_{i, j} = 0$ , that is a shrinkage prior on A does not preserve the structure of $Θ$ . Second, the estimate $\hat{Θ} = {\hat{L}}^{- 1} \hat{A}$ for a given prior is sensitive to variables permutation due to its dependence on the Cholesky factorization (see, e.g., Chan, Koop, and Yu Citation2023). provides a visual representation of this issue by comparing the estimates obtained from a horseshoe prior on (2a) versus (2b), for two different permutations of $y_{t}$ .

Fig. 1 Comparison between the posterior inference obtained from $A = L Θ$ (first row) and the original parameterization $Θ$ (second row), for two different permutations of $y_{t}$ .

This simple example suggests that the estimates $\hat{Θ} = {\hat{L}}^{- 1} \hat{A}$ diverge from the true $Θ$ and are sensitive to variable permutation. Instead, inference based on (2a) provides a more accurate identification of $Θ$ , less sensitive to variable permutation.

3 Variational Bayes Inference

A variational approach to Bayesian inference requires to minimize the Kullback-Leibler ( $K L$ ) divergence between an approximating density $q (ξ)$ and the true posterior density $p (ξ | y)$ , where $ξ$ denotes the set of parameters of interest. Ormerod and Wand (Citation2010) show that minimizing the $K L$ divergence can be equivalently stated as the maximization of the “effective lower bound” (ELBO) denoted by $\underline{p} (y; q)$ : (4) $\begin{matrix} q^{*} (ξ) = \arg \max_{_{q (ξ) \in Q}} \log \underline{p} (y; q), \\ \underline{p} (y; q) = \int q (ξ) \log {\frac{p (y, ξ)}{q (ξ)}} d ξ, \end{matrix}$ (4) where $q^{*} (ξ) \in Q$ represents the optimal variational density and $Q$ is a space of density functions. Depending on the assumption on $Q$ , one falls into different variational paradigms. For instance, given a partition of the parameters vector $ξ = {ξ_{1}, \dots, ξ_{p}}$ , a mean-field variational Bayes (MFVB) approach assumes a factorization of the form $q (ξ) = \prod_{j = 1}^{p} q_{i} (ξ_{j})$ . A closed-form expression for each optimal variational density $q^{*} (ξ_{j})$ can be defined as (5) $\begin{matrix} q^{*} (ξ_{j}) \propto \exp {E_{q^{⋆} (ξ ∖ ξ_{j})} [\log p (y, ξ)]}, \\ q^{⋆} (ξ ∖ ξ_{j}) = \prod_{\overset{i = 1}{i \neq j}}^{p} q_{i} (ξ_{i}), \end{matrix}$ (5) where the expectation is taken with respect to the joint approximating density with the jth element of the partition removed $q^{⋆} (ξ ∖ ξ_{j})$ . This allows the implementation of an efficient iterative algorithm to estimate the optimal density $q^{*} (ξ)$ . However, some components $q^{*} (ξ_{j})$ may remain too complex to handle and further restrictions are needed. If we assume that $q^{*} (ξ_{j})$ belongs to a pre-specified parametric family of distributions, the MFVB outlined above is sometimes labeled as semi-parametric (see Rohde and Wand Citation2016).

3.1 Optimal Variational Densities

We present a factorization of the variational density $q (ξ)$ for the model in (2a). As a baseline, we consider a non-informative Normal prior for each entry of $Θ$ ; that is, $ϑ_{j, k} \sim N (0, υ)$ , for $j = 1, \dots, d$ and $k = 1, \dots, d + p + 1$ . In addition, let $ψ_{j} \sim InvGa (a_{ψ}, b_{ψ})$ for $j = 1, \dots, d$ , and $β_{j, k} \sim N (0, τ)$ , for $j = 2, \dots, d$ and $k = 1, \dots, j - 1$ . Here, $InvGa (\cdot, \cdot)$ denotes the Inverse-Gamma distribution, and $a_{ψ} > 0, b_{ψ} > 0, τ ≫ 0$ and $υ ≫ 0$ are the related hyper-parameters.

Let $ξ = {(ϑ^{⊺}, h^{⊺}, ψ^{⊺}, β^{⊺})}^{⊺}$ be the set of parameters of interest, the corresponding variational density can be factorized as $q (ξ) = q (ϑ) q (h) q (ψ) q (β)$ , where: (6) $\begin{matrix} q (ϑ) = \prod_{j = 1}^{d} q (ϑ_{j}), q (h) = \prod_{j = 1}^{d} q (h_{j}), \\ q (ψ) = \prod_{j = 1}^{d} q (ψ_{j}), q (β) = \prod_{j = 2}^{d} q (β_{j}) . \end{matrix}$ (6)

For ease of exposition, we summarize in the main text the optimal variatonal density for the main parameters of interest $Θ$ for the baseline non-informative prior and three alternative hierarchical shrinkage priors. The full derivations of the optimal variational densities $q^{*} (h_{j}) \equiv N_{T + 1} (μ_{q (h_{j})}, Σ_{q (h_{j})}), q^{*} (ψ_{j}) \equiv InvGa (a_{q (ψ_{j})}, b_{q (ψ_{j})})$ , and $q^{*} (β_{j}) \equiv N_{j - 1} (μ_{q (β_{j})}, Σ_{q (β_{j})})$ are reported in Appendix B as Proposition B.1.1, B.1.7, and B.1.4, respectively. We leave to Proposition B.1.3 in Appendix B the derivations for the constant volatility case with $ν_{j, t} = ν_{j}$ and $ν_{j} \sim Ga (a_{ν}, b_{ν})$ for $j = 1, \dots, d$ , where $Ga (\cdot, \cdot)$ denotes the gamma distribution, and $a_{ν} > 0, b_{ν} > 0$ . Appendix B also provides the analytical form of the lower bound for each set of parameters.

Proposition 3.1 provides the optimal variational density for the jth row of $Θ$ under the Normal prior specification $ϑ_{j, k} \sim N (0, υ)$ . The proof and analytical derivations are reported in Appendix B.1.

Proposition 3.1.

The optimal variational density for $ϑ_{j}$ is $q^{*} (ϑ_{j}) \equiv N_{d + p + 1} (μ_{q (ϑ_{j})}, Σ_{q (ϑ_{j})})$ with hyper-parameters: (7) $\begin{matrix} Σ_{q (ϑ_{j})} = {(\sum_{t = 1}^{T} μ_{q (ω_{j, j, t})} z_{t - 1} z_{t - 1}^{⊺} + 1 / υ I_{d + p + 1})}^{- 1}, \\ μ_{q (ϑ_{j})} = Σ_{q (ϑ_{j})} (\sum_{t = 1}^{T} (μ_{q (ω_{j, t})} \otimes z_{t - 1}) y_{t} \\ - \sum_{t = 1}^{T} (μ_{q (ω_{j, - j, t})} \otimes z_{t - 1} z_{t - 1}^{⊺}) μ_{q (ϑ_{- j})}), \end{matrix}$ (7) where $ϑ = (\begin{matrix} ϑ_{j} \\ ϑ_{- j} \end{matrix})$ and $ω_{j, t}$ denotes the jth row of $Ω_{t} = (\begin{matrix} ω_{j, j, t} & ω_{j, - j, t} \\ ω_{- j, j, t} & Ω_{- j, - j, t} \end{matrix}) .$

Note that although the multivariate model is reduced to a sequence of univariate regressions, the analytical form of the variational means $μ_{q (ϑ_{j})}$ in Proposition 3.1 depends on all the other rows through $μ_{q (ϑ_{- j})}$ . As a result, the estimates of $ϑ_{j}$ explicitly depend on all of the other $ϑ_{- j}$ . This addresses the issue in the MCMC algorithm of Carriero, Clark, and Marcellino (Citation2019), which has been highlighted by Bognanni (Citation2022) and corrected by Carriero et al. (Citation2022).

Bayesian adaptive-Lasso

The Bayesian adaptive-Lasso of Leng, Tran, and Nott (Citation2014) extends the original work of Park and Casella (Citation2008) by assuming a different shrinkage for each regression parameter based on a Laplace distribution with an individual scaling parameter $ϑ_{j, k} | λ_{j, k} \sim Lap (λ_{j, k})$ , for $j = 1, \dots, d$ and $k = 1, \dots, d + p + 1$ . The latter can be represented as a scale mixture of Normal distributions with an exponential mixing density, $ϑ_{j, k} | υ_{j, k} \sim N (0, υ_{j, k}), υ_{j, k} | λ_{j, k}^{2} \sim Exp (λ_{j, k}^{2} / 2)$ . The scaling parameters $λ_{j, k}^{2}$ are not fixed but are inferred from the data by assuming a common hyper-prior distribution $λ_{j, k}^{2} \sim Ga (h_{1}, h_{2})$ , where $h_{1}, h_{2} > 0$ .

Let $ξ_{L} = (ξ^{⊺}, υ^{⊺}, {(λ^{2})}^{⊺}))^{⊺}$ be the vector $ξ$ augmented with the adaptive-Lasso prior parameters. The distribution $q (ξ_{L})$ can be factorized as, (8) $\begin{matrix} q (ξ_{L}) = q (ξ) q (υ, λ^{2}), \\ q (υ, λ^{2}) = \prod_{j = 1}^{d} \prod_{k = 1}^{d + p + 1} q (υ_{j, k}) q (λ_{j, k}^{2}), \end{matrix}$ (8)

Proposition 3.2 provides the optimal variational density for the jth row of $Θ$ under the Bayesian adaptive-Lasso prior specification $ϑ_{j, k} | υ_{j, k} \sim N (0, υ_{j, k}), υ_{j, k} | λ_{j, k}^{2} \sim Exp (λ_{j, k}^{2} / 2)$ , and $λ_{j, k}^{2} \sim Ga (h_{1}, h_{2})$ . The proof and the analytical derivations are reported in Appendix B.2.

Proposition 3.2.

The optimal variational density for $ϑ_{j}$ is $q^{*} (ϑ_{j}) \equiv N_{d + p + 1} (μ_{q (ϑ_{j})}, Σ_{q (ϑ_{j})})$ with $Σ_{q (ϑ_{j})} = (\sum_{t = 1}^{T} μ_{q (ω_{j, j, t})}$ ${z_{t - 1} z_{t - 1}^{⊺} + Diag (μ_{q (1 / υ_{j})}))}^{- 1}$ , where $Diag (μ_{q (1 / υ_{j})})$ is a diagonal matrix with elements $μ_{q (1 / υ_{j})} = (μ_{q (1 / υ_{j, 1})}, μ_{q (1 / υ_{j, 2})}, \dots,$ $μ_{q (1 / υ_{j, d + p + 1})})$ . The parameters $μ_{q (ϑ_{j})}$ and $μ_{q (ω_{j, j, t})}$ are as in Proposition 3.1. The optimal variational densities of the scaling parameters are $q^{*} (λ_{j, k}^{2}) \equiv Ga (a_{q (λ_{j, k}^{2})}, b_{q (λ_{j, k}^{2})})$ with $a_{q (λ_{j, k}^{2})}, b_{q (λ_{j, k}^{2})}$ defined in (B.20), and $q^{*} (1 / υ_{j, k}) \equiv IG (a_{q (1 / υ_{j, k})}, b_{q (1 / υ_{j, k})})$ with $a_{q (υ_{j, k})}, b_{q (υ_{j, k})}$ defined in (B.19).

Adaptive Normal-Gamma

We expand the original Normal-Gamma prior in Griffin and Brown (Citation2010) by assuming that each regression coefficient has a different shrinkage parameter. The hierarchical specification requires that $ϑ_{j, k} | υ_{j, k} \sim N (0, υ_{j, k})$ , and $υ_{j, k} | η_{j}, λ_{j, k} \sim Ga (η_{j}, η_{j} λ_{j, k} / 2)$ for $j = 1, \dots, d$ and $k = 1, \dots, d + p + 1$ . Note that by restricting $η_{j} = 1$ one obtains the adaptive-Lasso prior. The marginalization over the variance $υ_{j, k}$ leads to $p (ϑ_{j, k} | η_{j}, λ_{j, k})$ which corresponds to a Variance-Gamma distribution. The hyper-parameters η_j and $λ_{j, k}$ are not fixed but are inferred from the data by assuming two common hyper-priors $λ_{j, k} \sim Ga (h_{1}, h_{2})$ and $η_{j} \sim Exp (h_{3})$ , where $h_{l} > 0$ for l = 1, 2, 3.

Let $ξ_{NG} = {(ξ^{⊺}, υ^{⊺}, λ^{⊺}, η^{⊺})}^{⊺}$ be the vector $ξ$ augmented with the parameters of the adaptive Normal-Gamma prior. The joint distribution $q (ξ_{NG})$ can be factorized as, (9) $\begin{matrix} q (ξ_{NG}) = q (ξ) q (υ, λ, η), \\ q (υ, λ, η) = \prod_{j = 1}^{d} q (η_{j}) \prod_{k = 1}^{d + p + 1} q (υ_{j, k}) q (λ_{j, k}) . \end{matrix}$ (9)

Proposition 3.3 provides the optimal variational density for the jth row of $Θ$ under an adaptive Normal-Gamma specification $υ_{j, k} | η_{j}, λ_{j, k} \sim Ga (η_{j}, η_{j} λ_{j, k} / 2), λ_{j, k} \sim Ga (h_{1}, h_{2})$ and $η_{j} \sim Exp (h_{3})$ . The proof and analytical derivations are reported in Appendix B.3.

Proposition 3.3.

The optimal variational density for $ϑ_{j}$ is $q^{*} (ϑ_{j}) \equiv N_{d + p + 1} (μ_{q (ϑ_{j})}, Σ_{q (ϑ_{j})})$ with $Σ_{q (ϑ_{j})} = (\sum_{t = 1}^{T} μ_{q (ω_{j, j, t})}$ ${z_{t - 1} z_{t - 1}^{⊺} + Diag (μ_{q (1 / υ_{j})}))}^{- 1}$ , where $Diag (μ_{q (1 / υ_{j})})$ is a diagonal matrix with elements $μ_{q (1 / υ_{j})} = (μ_{q (1 / υ_{j, 1})}, μ_{q (1 / υ_{j, 2})}, \dots,$ $μ_{q (1 / υ_{j, d + p + 1})})$ . The parameters $μ_{q (ϑ_{j})}$ and $μ_{q (ω_{j, j, t})}$ are as in Proposition 3.1. The optimal variational densities of the scaling parameters are $q^{*} (λ_{j, k}) \equiv Ga (a_{q (λ_{j, k})}, b_{q (λ_{j, k})})$ with $a_{q (λ_{j, k})}, b_{q (λ_{j, k})}$ defined in (B.24), and $q^{*} (υ_{j, k}) \equiv GIG (ζ_{q (υ_{j, k})}, a_{q (υ_{j, k})}, b_{q (υ_{j, k})})$ is a generalized inverse normal distribution with $ζ_{q (υ_{j, k})}, a_{q (υ_{j, k})},$ $b_{q (υ_{j, k})}$ defined in (B.23).

The optimal density for the parameter η_j is not a known distribution function. Proposition B.3.3 in Appendix B.3 provides an analytical approximation of its moments so that the optimal density can be calculated via numerical integration.

Horseshoe prior

We consider the Horseshoe prior as proposed by Carvalho, Polson, and Scott (Citation2009, Citation2010). This is based on the hierarchical specification $ϑ_{j, k} | υ_{j, k}^{2}, γ^{2} \sim N (0, γ^{2} υ_{j, k}^{2}), γ \sim C^{+} (0, 1), υ_{j, k} \sim C^{+} (0, 1)$ , where $C^{+} (0, 1)$ denotes the standard half-Cauchy distribution with probability density function equal to $f (x) = 2 / {π (1 + x^{2})} ⊮_{(0, \infty)} (x)$ . The Horseshoe is a global-local prior that implies an aggressive shrinkage of weak signals without affecting the strong ones (see, e.g., Polson and Scott Citation2011). We follow Wand et al. (Citation2011) and leverage on a scale mixture representation of the half-Cauchy distribution as, (10) $\begin{matrix} ϑ_{j, k} | υ_{j, k}^{2}, γ^{2} \sim N (0, γ^{2} υ_{j, k}^{2}), γ^{2} | η \sim InvGa (1 / 2, 1 / η), \\ υ_{j, k}^{2} | λ_{j, k} \sim InvGa (1 / 2, 1 / λ_{j, k}), \\ η \sim InvGa (1 / 2, 1), \\ λ_{j, k} \sim InvGa (1 / 2, 1), \end{matrix}$ (10) where the local and global shrinkage parameters are $υ_{j, k}^{2}$ and $γ^{2}$ , respectively.

Let $ξ_{HS} = {(ξ^{⊺}, {(υ^{2})}^{⊺}, γ^{2}, λ^{⊺}, η)}^{⊺}$ be the vector $ξ$ augmented with the parameters of the Horseshoe prior. The joint distribution $ξ_{HS}$ can be factorized as, (11) $\begin{matrix} q (ξ_{HS}) = q (ξ) q (υ^{2}, γ^{2}, λ, η), \\ q (υ^{2}, γ^{2}, λ, η) = q (γ^{2}) q (η) \prod_{j = 1}^{d} \prod_{k = 1}^{d + p + 1} q (υ_{j, k}^{2}) q (λ_{j, k}) . \end{matrix}$ (11)

Proposition 3.4 provides the optimal variational density for the jth row of $Θ$ under the Horseshoe prior outlined in (10). The proof and analytical derivations are reported in Appendix B.4.

Proposition 3.4.

The optimal variational density for $ϑ_{j}$ is $q^{*} (ϑ_{j}) \equiv N_{d + p + 1} (μ_{q (ϑ_{j})}, Σ_{q (ϑ_{j})})$ with $Σ_{q (ϑ_{j})} = (\sum_{t = 1}^{T} μ_{q (ω_{j, j, t})}$ ${z_{t - 1} z_{t - 1}^{⊺} + μ_{q (1 / γ^{2})} Diag (μ_{q (1 / υ_{j}^{2})}))}^{- 1}$ , where $Diag (μ_{q (1 / υ_{j}^{2})})$ is a diagonal matrix with elements $μ_{q (1 / υ_{j}^{2})} = (μ_{q (1 / υ_{j, 1}^{2})},$ $μ_{q (1 / υ_{j, 2}^{2})}, \dots, μ_{q (1 / υ_{j, d + p + 1}^{2})})$ . The parameters $μ_{q (ϑ_{j})}$ and $μ_{q (ω_{j, j, t})}$ are as in Proposition 3.1. The optimal variational densities for the global shrinkage is $q^{*} (γ^{2}) \equiv InvGa (\frac{1}{2} {d (d + p + 1) + 1},$ $b_{q (γ^{2})})$ with $b_{q (γ^{2})}$ defined in (B.33), and $q^{*} (η) \equiv InvGa$ $(1, b_{q (η)})$ with $b_{q (η)}$ defined in (B.35). The optimal variational densities for the local shrinkage parameters are $q^{*} (υ_{j, k}^{2}) \equiv InvGa (1, b_{q (υ_{j, k}^{2})})$ and $q^{*} (λ_{j, k}) \equiv InvGa (1, b_{q (λ_{j, k})})$ , with $b_{q (υ_{j, k}^{2})}$ and $b_{q (λ_{j, k})}$ defined in (B.32) and (B.34), respectively.

3.2 From Shrinkage to Sparsity

In addition to computational tractability, shrinking rather than selecting is a defining feature of the hierarchical priors outlined in Section 3.1. Yet, posterior estimates of $Θ$ are non-sparse and thus can not provide exact differentiation between significant versus nonsignificant predictors. The latter is particularly relevant since we ultimately want to assess the accuracy of our variational inference approach—versus existing MCMC and variational Bayes algorithms—in identifying the exact structure of $Θ$ .

To address this issue, we build upon Ray and Bhattacharya (Citation2018) and implement a Signal Adaptive Variable Selector (SAVS) algorithm to induce sparsity in $\hat{Θ}$ , conditional on a given prior. The SAVS is a post-processing algorithm which divides signals and nulls on the basis of the point estimates of the regression coefficients (see, e.g., Hauzenberger, Huber, and Onorante Citation2021). Specifically, let ${\hat{ϑ}}_{j}$ be the posterior estimate of $ϑ_{j}$ and $z_{j}$ be the associated vector of covariates. If $| {\hat{ϑ}}_{j} | | | z_{j} | |^{2} \leq | {\hat{ϑ}}_{j} |^{- 2}$ we set ${\hat{ϑ}}_{j} = 0$ , where $| | \cdot | |$ denotes the Euclidean norm.

The reason why we rely on the SAVS post-processing to induce sparsity in the posterior estimates is threefold. First, as highlighted by Ray and Bhattacharya (Citation2018), the SAVS represents an automatic procedure in which the sparsity-inducing property directly depends on the effectiveness of the shrinkage performed on ${\hat{ϑ}}_{j}$ . This refers to the precision of the posterior mean estimates; that is, the more accurate is ${\hat{ϑ}}_{j}$ , the more precise is the identification of the nonzero elements in $Θ$ . Second, the SAVS is “agnostic” with respect to the shrinkage prior or estimation approach adopted, so it represents a natural tool to compare different estimation methods. Third, it is decision-theoretically motivated as it grounds on the idea of minimizing the posterior expected loss (see, e.g., Huber, Koop, and Onorante Citation2021).

In addition to SAVS, we expand on Hahn and Carvalho (Citation2015) and provide a multivariate extension to their least-angle regression originally built for univariate regressions. Appendix D.2 provides the full derivation as well as a complete discussion of the drawbacks compared to SAVS. In addition, Appendix D.2 reports the results of a direct comparison between the SAVS and our multivariate extension to Hahn and Carvalho (Citation2015) based on simulated data.

3.3 Variational Predictive Density

Consider the posterior distribution $p (ξ | z_{1 : t})$ given the information set $z_{1 : t} = {y_{1 : t}, x_{1 : t}}$ and the conditional likelihood $p (y_{t + 1} | z_{t}, ξ)$ . A standard predictive density takes the form, (12) $p (y_{t + 1} | z_{1 : t}) = \int p (y_{t + 1} | z_{t}, ξ) p (ξ | z_{1 : t}) d ξ .$ (12)

Given an optimal variational density $q^{*} (ξ)$ that approximates $p (ξ | z_{1 : t})$ , we follow Gunawan, Kohn, and Nott (Citation2020) and obtain the variational predictive distribution (13) $\begin{matrix} q (y_{t + 1} | z_{1 : t}) = \int p (y_{t + 1} | z_{t}, ξ) q^{*} (ξ) d ξ \\ = \iint p (y_{t + 1} | z_{t}, ϑ, Ω_{t}) q^{*} (ϑ) q^{*} (Ω_{t}) d ϑ d Ω_{t} . \end{matrix}$ (13)

Although an analytical expression for (13) is not available, a simulation-based estimator for $q (y_{t + 1} | z_{1 : t})$ can be obtained through Monte Carlo integration by averaging $p (y_{t + 1} | z_{t}, ξ^{(i)})$ over the draws $ξ^{(i)} \sim q^{*} (ξ)$ , such that $\hat{q} (y_{t + 1} | z_{1 : t}) = N^{- 1} \sum_{i = 1}^{N} p (y_{t + 1} | z_{t}, ξ^{(i)})$ . Note that a complete characterization of the optimal variational predictive density entails $q^{*} (Ω_{t})$ with $Ω_{t} = L^{⊺} V_{t} L$ . Proposition 3.5 shows that conditional on L and $V_{t}$ , the optimal distribution of $Ω_{t}$ can be approximated by a d-dimensional Wishart distribution ${Wishart}_{d} (δ_{t}, H_{t})$ , where δ_t and $H_{t}$ are the degrees of freedom parameter and the scaling matrix, respectively.

Proposition 3.5.

The approximate distribution $\tilde{q}$ of $Ω_{t}$ is ${Wishart}_{d} ({\hat{δ}}_{t}, {\hat{H}}_{t})$ , where the scaling matrix is given by ${\hat{H}}_{t} = {\hat{δ}}_{t}^{- 1} E_{q} [Ω_{t}]$ and ${\hat{δ}}_{t}$ can be obtained numerically as the solution of a convex optimization problem.

The complete proof is available in Appendix C.1 and is based on the Expectation Propagation (EP) approach proposed by Minka (Citation2001). In order to implement this approach, there is no need to know $q^{*} (Ω_{t})$ , but it is sufficient to be able to compute $E_{q} (Ω_{t})$ . The latter can be reconstructed based on the optimal variational densities of the Cholesky factor $q^{*} (β)$ –and therefore for L–and of $q^{*} (V_{t})$ . The simulation results in Appendix C.1 show that the proposed Wishart distribution provides an accurate approximation of $q^{*} (Ω_{t})$ for both small and large dimensional models.

Based on Proposition 3.5, we can further simplify (13) by integrating $Ω_{t}$ such that (14) $q (y_{t + 1} | z_{1 : t}) = \int h (y_{t + 1} | z_{t}, ϑ) q^{*} (ϑ) d ϑ,$ (14) where $h (y_{t + 1} | z_{t}, ϑ)$ denotes the probability density function of a multivariate Student-t distribution $t_{v} (m, S)$ with mean $m = Θ z_{t}$ , scaling matrix $S = {(v \hat{H})}^{- 1}$ , and degrees of freedom parameter $v = \hat{δ} - d + 1$ . As a result, the predictive distribution can be more efficiently approximated by averaging the density of the multivariate Student-t $h (y_{t + 1} | z_{t}, ϑ^{(i)})$ over the draws $ϑ^{(i)} \sim q^{*} (ϑ)$ , for $i = 1, \dots, N$ , such that $\hat{q} (y_{t + 1} | z_{1 : t}) = N^{- 1} \sum_{i = 1}^{N} h (y_{t + 1} | z_{t}, ϑ^{(i)})$ .

Note that the main advantage of the approximation obtained from Proposition 3.5 is to allow for a considerably faster computation of the variational predictive density, compared to using $q^{*} (L)$ and $q^{*} (V_{t})$ as stationary distributions to sample $Ω_{t}$ , similar to an MCMC. This is because the scaling matrix of the Wishart distribution is available in closed form, and the computation of degrees of freedom requires only a one-dimensional optimization. In Appendix C.2, we discuss a further simplification that minimizes the KL divergence between the multivariate Student-t and a multivariate Normal distribution.

4 Simulation Study

In this section, we report the results of an extensive simulation study designed to compare the properties of our estimation approach against both MCMC and variational Bayes methods for large VAR models. To begin, we compare our VB algorithm against the MCMC approach of Chan and Eisenstat (Citation2018) and Cross, Hou, and Poon (Citation2020) and the variational inference algorithm proposed by Chan and Yu (Citation2022) and Gefang, Koop, and Poon (Citation2023). Both approaches are built upon the structural VAR representation in (2b). In addition, we also compare our VB method against the MCMC approach developed by Huber and Feldkircher (Citation2019) and Gruber and Kastner (Citation2022), which is based upon a nonlinear parameterization as in (2a).

For comparability with Gruber and Kastner (Citation2022) and Gefang, Koop, and Poon (Citation2023), which do not consider the presence of exogenous predictors, we consider a standard VAR(1) as a data generating process. Consistent with the empirical implementations, we set T = 360 and d = 30, 49. The choice of d is due to the two alternative industry classifications explored in the main empirical analysis. We assume either a moderate—50% of zeros—or a high—90% of zeros—level of sparsity in the true matrix $Θ$ . The latter is generated as follows: we fix to zero $s \cdot d^{2}$ entries at random, with $s = 0.5, 0.9$ and d = 30, 49, and the remaining nonzero coefficients are sampled from a mixture of two normal distributions with means equal to ±0.08 and standard deviation 0.1. Appendix D provides additional details on the data-generating process and additional simulation results for d = 15.

4.1 Estimation Accuracy

As a measure of point estimation accuracy, we first look at the Frobenius norm $| | Θ - \hat{Θ} ‖_{F}$ . The latter measures the difference between the true $Θ$ observed at each simulation and its estimate $\hat{Θ}$ . In addition, we compare the ability of each estimation method to identify the nonzero elements in the true $Θ$ based on the F1 score. This is expressed as a function of counts of true positives (tp), false positives (fp) and false negatives (fn), $F 1 = \frac{2 t p}{2 t p + f p + f n} .$

The F1 score takes value one if identification is perfect, that is, no false positives and no false negatives, and zero if there are no true positives. We compute both measures of estimation accuracy on N = 100 replications and compare each estimation method and prior specification. The estimates from the MCMC specifications are based on 5000 posterior simulations after discarding the first 5000 as a burn-in sample.

Point estimates

shows the box charts summarizing the Frobenius norm $| | Θ - \hat{Θ} ‖_{F}$ across N = 100 replications. We label the linearized MCMC and variational methods with LMCMC and LVB, respectively, with MCMC the nonlinear method of Gruber and Kastner (Citation2022) and with VB our variational inference method. To increase readability, we separate the results by prior and color-coding the four estimation methods. For instance, for a given subplot, we report the results for the Normal, adaptive-Lasso, adaptive Normal-Gamma and Horseshoe priors from the left to the right panel. Within each panel, the simulation results for the LMCMC, LVB, MCMC, and VB estimates are reported in red, yellow, light blue, and green, respectively.

Fig. 2 Frobenius norm of $Θ - \hat{Θ}$ across N = 100 replications, for different shrinkage priors and different inference methods.

Fig. 2 Frobenius norm of Θ−Θ̂ across N = 100 replications, for different shrinkage priors and different inference methods.

Beginning with the moderate sparsity case (top panels), the simulation results show that LMCMC and LVB approaches tend to perform equally across different shrinkage priors, with the only exception of the Normal-Gamma prior, whereby LMCMC slightly outperforms LVB. However, the discrepancy between the two structural VAR representation methods increases when sparsity becomes more pervasive (see bottom panels).

Overall, the simulation results support our view that by eliciting shrinkage priors directly on $Θ$ —as per the parameterization in (2a)—the accuracy of the posterior estimates improves. The mean squared errors obtained from MCMC and VB are lower compared to both LMCMC and LVB. This holds for all priors and model dimensions. The accuracy with d = 30 of the MCMC and VB is virtually the same. Yet, with d = 49, our VB is slightly more accurate than MCMC for the adaptive-Lasso and the Horseshoe prior.

Sparsity identification

shows the box charts of F1 scores across N = 100 simulations. The labeling is the same as in . Both LMCMC and LVB produce a rather dismal identification of the nonzero elements in $Θ$ across prior and model dimensions. This is due to the fact that $\hat{Θ} = {\hat{L}}^{- 1} \hat{A}$ in (2b) so that a sparse $\hat{A}$ does not translate into a sparse $\hat{Θ}$ , and thus is less accurate in identifying the nonzero coefficients in the true $Θ$ . When the level of sparsity increases, so does the divergence between A and $Θ$ .

Fig. 3 F1 score computed across N = 100 replications by looking at the true non-null parameters in $Θ$ and the non-null parameters estimated based on $\hat{Θ}$ .

Fig. 3 F1 score computed across N = 100 replications by looking at the true non-null parameters in Θ and the non-null parameters estimated based on Θ̂.

Consistent with our argument in favor of the parameterization in (2a), both the MCMC and VB approaches produce a more accurate identification of the nonzero coefficients in $Θ$ , as shown by the F1 score. The gap between LMCMC, LVB, versus MCMC and VB becomes larger for higher levels of sparsity. This result holds across different hierarchical shrinkage priors and for different model dimensions. Yet, our VB approach turns out to be more accurate than MCMC under the adaptive-Lasso and Horseshoe priors for higher sparsity levels.

Note that sparsity in the posterior estimates for $\hat{Θ}$ for different hierarchical shrinkage priors is induced in the simulation results by using the SAVS algorithm of Ray and Bhattacharya (Citation2018). Appendix D.2 provides additional simulation results obtained by implementing a multivariate version of the post-processing method proposed by Hahn and Carvalho (Citation2015) as an alternative to the SAVS. The F1 scores are largely the same across methods; in fact, the evidence is even more in favor of our VB, compared to its MCMC counterpart when using the extended Hahn and Carvalho (Citation2015) approach: our VB is more accurate than MCMC with a Normal-Gamma prior.

Computational efficiency

reports the computational time—expressed in a log-minute scale—required by each competing estimation approach under different shrinkage priors. To highlight the performance for a given prior, we separate the results by estimation methods and color-coding the four different shrinkage priors. For instance, for a given subplot, we report the results for the LMCMC, LVB, MCMC, and VB estimates from left to right panel. Within each panel, the Normal, adaptive-Lasso, adaptive Normal-Gamma, and Horseshoe priors are colored in shades of gray from light (left) to dark (right) gray, respectively. To guarantee more reliable comparability, we re-coded all competing methods in Rcpp and used the same 2.5 GHz Intel Xeon W-2175 with 32GB of RAM for all implementations. This allows us to compare all methods on an equal footing.

Fig. 4 Computational time required by each estimation approach for different hierarchical shrinkage priors. The time is expressed on a log-minute scale.

The results highlight that our VB approach has a clear computational advantage compared to linear and nonlinear MCMC methods. For instance, for d = 30, our VB is more than 100 times faster than the MCMC of Gruber and Kastner (Citation2022) and more than 10 times faster than the LMCMC of Cross, Hou, and Poon (Citation2020), respectively. The gap in favor of our VB method increases in larger dimensions; for d = 49, the MCMC approach takes on average almost 60 min to generate posterior estimates which are comparably accurate to our VB, which instead takes on average between 30–40 sec for the estimate. Such an efficiency gap has profound implications for a practical forecasting implementation, especially within the context of recursive predictions with higher frequency data such as stock returns (see Section 5.2). Perhaps not surprisingly, the LVB approach of Chan and Yu (Citation2022) and Gefang, Koop, and Poon (Citation2023) is highly competitive in terms of computational efficiency. However, being built on a structural VAR formulation, and show that the computational efficiency of the LVB approach comes at the cost of substantially lower estimation accuracy.

Appendix E.1 provides a further qualitative discussion on the computational costs of some of the existing MCMC approaches. We review some of the results reported in the original papers and show that these largely align with our own findings. In addition, we also discuss some of the limitations of the nonlinear MCMC for the recursive forecasting implementation.

Robustness to variable permutation

At the paper’s outset, we argue that a conventional structural VAR formulation potentially generates posterior estimates sensitive to variable permutation. That is posterior estimates of $Θ$ depend on the ordering imposed on the target variables $y_{t}$ , conditional on a given prior. To highlight this issue, in Appendix D, we report additional simulation results for all estimation methods and shrinkage priors under variable permutation. The results show that the accuracy of the posterior estimates from both LMCMC and LVB changes once the variable’s ordering is reversed (see ). This is especially evident for the Normal-Gamma and Horseshoe priors and when zero coefficients in $Θ$ are more pervasive. On the other hand, the estimation accuracy of both the MCMC approach of Gruber and Kastner (Citation2022) and our VB method do not substantially deteriorate by arbitrarily changing the ordering of the target variables.

Note that although our approach provides estimates and point predictions that are robust with respect to variable ordering, density forecasts might differ. To address this issue, Arias, Rubio-Ramirez, and (Citation2023) propose a time-varying correlation matrix model based on the parameterization of Archakov and Hansen (Citation2021). Still, the latter is computationally intensive and may not be suitable for large datasets. As shown by Chan, Koop, and Yu (Citation2023), a decomposition $Ω_{t} = L^{⊺} V_{t} L$ , where L is an unrestricted square matrix rather than lower-triangular, represent a valid alternative toward permutation invariance in high-dimensional settings.

5 A Empirical Study of Industry Returns Predictability

We investigate both the statistical and economic value of our variational Bayes approach within the context of US industry returns predictability. To expand the scope of the empirical exercise, we consider two alternative industry aggregations: d = 30 industry portfolios from July 1926 to May 2020 and a larger cross-section of d = 49 industry portfolios from July 1969 to May 2020. The size of the cross sections changes due to a different industry classification. At the end of June in year t, each NYSE, AMEX, and NASDAQ stock is assigned to an industry portfolio based on its four-digit SIC code. Thus, the returns on a given value-weighted portfolio are computed from July of t to June of t + 1. The sample periods cover major events, from the great depression to the Covid-19 outbreak.

In addition to cross-industry portfolio returns, we consider a variety of predictors, such as the returns on the market portfolio (mkt), and the returns on four alternative long-short investment strategies based on market capitalization (smb), book-to-market ratios (hml), operating profitability (rmw) and firm investments (cma) (see Fama and French Citation2015). We also consider a set of additional macroeconomic predictors from Goyal and Welch (Citation2008), such as the log price-dividend ratio (pd), the difference between the long term yield on government bonds and the T-bill (term), the BAA-AAA bond yields difference (credit), the monthly log change in the CPI (infl), the aggregate market book-to-market ratio (bm), the net-equity issuing activity (ntis) and the corporate bond returns (corpr).

5.1 In-Sample Parameter Estimates

In order to highlight some of the main properties of each method, we first report the in-sample estimates of $Θ$ for the d = 49 industry case for all priors. compares $\hat{Θ}$ based on the full sample obtained from the LMCMC and the LVB with constant volatility, and our VB with and without stochastic volatility. Appendix E.3 reports the additional in-sample estimates for d = 30 industry portfolios.

Fig. 5 Variational Bayes estimates of the regression coefficients $Θ$ for different estimation methods. We report the estimates for the d = 49 industry case obtained for all priors. We report the results for VB with and without stochastic volatility.

The in-sample estimates highlight three key results. First, there are visible differences across shrinkage priors. For instance, the Horseshoe tend to shrink parameters more aggressively toward zero so that $\hat{Θ}$ is more sparse than, for example, the adaptive Normal-Gamma. Second, consistent with Gefang, Koop, and Poon (Citation2023), the estimates of the LMCMC and LVB tend to be closely related. Yet, these in-sample estimates are substantially different compared to our VB approach. This is due to the parameterization $\hat{Θ} = {\hat{L}}^{- 1} \hat{A}$ in (2b). Third, with the exception of the adaptive-Lasso prior, the estimates $\hat{Θ}$ from VB are remarkably stable between constant versus stochastic volatility specifications.

5.2 Out-of-Sample Forecasting Accuracy

Intuitively, different estimates of $Θ$ should be reflected in different conditional forecasts. To test this intuition, we now compare the LMCMC, LVB, and the VB estimation approaches with and without stochastic volatility. For completeness, we also consider a series of univariate model specifications (U henceforth), which is akin to assuming conditional independence across industry portfolios. We consider a 360-month rolling window period for each model estimation; for instance, for the 30-industry classification, the out-of-sample period is from July 1957 to May 2020.

Given the recursive nature of the empirical implementation and the extensive out-of-sample period, we do not consider the MCMC approach of Gruber and Kastner (Citation2022). This is because the computational cost would make such implementation prohibitive in practice, as discussed in the simulation study based on . For instance, on a 2.5 GHz Intel Xeon W-2175 with 32GB of RAM and 14 cores, it would take $20 \min \times 767 forecasts \times 4 priors = 61, 360$ min, or 42 days, to recursively implement the MCMC approach for the 30 industry portfolios with constant volatility. The computational cost would be even larger when adding stochastic volatility or for the 49 industry portfolios. Appendix E.1 provides a complete discussion of the computational costs of some of the existing MCMC approaches and the key relevance for a higher-frequency forecasting implementation, such as ours.

Point forecasts

We begin by inspecting the accuracy of point forecasts for each industry based on the out-of-sample predictive R² (see, e.g., Goyal and Welch Citation2008), $R_{j, oos}^{2} (M_{s}) = 1 - \frac{\sum_{t_{0} = 2}^{T} {(y_{j t} - {\hat{y}}_{j t} (M_{s}))}^{2}}{\sum_{t_{0} = 2}^{T} {(y_{j t} - {\bar{y}}_{j t})}^{2}},$ where t₀ is the date of the first prediction, ${\bar{y}}_{j t}$ is the naive forecast from the recursive mean–using the same rolling window of observations–and ${\hat{y}}_{j t} (M_{s})$ is the conditional mean returns for industry $j = 1, \dots, d$ for a given estimation method $M_{s}$ .

The left panels of show the box charts with the distribution of the $R_{j, oos}^{2}$ over the $j = 1, \dots, d$ industries. For a given subplot, the results for the Normal, Bayesian Lasso, Normal-Gamma and Horseshoe priors are reported from the left to the right. Within each panel of a subplot, the forecasting results for the U, LMCMC, LVB, and VB estimates are color-coded in orange, red, yellow, and green (from left to right), respectively. The vertical dashed line within each panel separates between constant and stochastic volatility specifications. Based on the same separation across methods and priors, the right panels of report a breakdown of the industries for which $R_{j, oos}^{2} (M_{s}) > 0$ .

Fig. 6 Left panels report the $R_{j, oos}^{2} (M_{s})$ (in %) across industry portfolios. Right panels report the industries for which a given model can generate $R_{j, oos}^{2} (M_{s}) > 0$ . The top (bottom) panels report the results for 30 (49) industry portfolios.

The out-of-sample $R_{j, oos}^{2} (M_{s})$ tends to be mostly negative across estimation methods and shrinkage priors. This is consistent with the existing evidence on stock returns predictability: a simple naive forecast based on a rolling sample mean represents a challenging benchmark to beat (see, e.g., Campbell and Thompson Citation2007). However, our variational inference approach substantially improves upon univariate regressions and the LMCMC, LVB methods, which are both based on a structural VAR representation. For instance, our VB with stochastic volatility generates a positive $R_{j, oos}^{2} (M_{s})$ for more than half of the 30 industry portfolios based on the adaptive Normal-Gamma and the Horseshoe. This compares to 4 (adaptive Normal-Gamma) and 3 (Horseshoe) positive $R_{j, oos}^{2} (M_{s})$ obtained from LMCMC with stochastic volatility. The gap further increases for the 49-industry classification; our VB method is virtually the only approach that can systematically generate positive $R_{j, oos}^{2} (M_{s})$ across industries. Although more concentrated on the Horseshoe prior, the out-performance of our method relative to both LMCMC and VB holds across different priors.

Density forecasts

We follow Fisher et al. (Citation2020) and assess the accuracy of the density forecasts across priors and estimation methods based on the average log-score (ALS) differential with respect to a “no-predictability” benchmark, (15) ${ALS}_{j} (M_{s}) = \frac{1}{T - t_{0}} \sum_{t_{0} = 2}^{T} (\ln S_{j t} (M_{s}) - \ln {\bar{S}}_{j t}),$ (15) where $\ln S_{j t} (M_{s})$ denotes the log-score at time t for industry j obtained by evaluating a Normal density with the conditional mean and variance forecast from the model $M_{s}$ . Consistent with the rationale of the no-predictability benchmark in $R_{j, oos}^{2} (M_{s})$ , the log-score for $\ln {\bar{S}}_{j, t}$ is constructed by evaluating a Normal density based on recursive mean and variance.

reports the results. The labeling is the same as in . The results show that by adding stochastic volatility, the accuracy of density forecasts substantially improves across priors and estimation methods. For instance, our VB method with stochastic volatility generates positive log-score differentials for almost all of the portfolios for the 30 industry classification and for more than half of the 49 industry portfolios. Interestingly, when it comes to density forecasts rather than modeling expected returns, the Gefang, Koop, and Poon (Citation2023) variational method built on a structural VAR representation performs on par with our VB method. This is likely due to stochastic volatility alone since our VB still stands out within the constant volatility specifications. Overall, our VB approach outperforms the competing estimation methods under all prior specifications.

Fig. 7 Left panels report the log-score differential across industry portfolios. Right panels report the industries for which a given model can generate a positive log-score differential. The top (bottom) panels report the results for 30 (49) industry portfolios.

Returns predictability over the business cycle

Existing literature suggests that expected returns are counter-cyclical and that returns predictability is more pronounced during periods of economic contractions versus expansions (see, e.g., Rapach, Strauss, and Zhou Citation2010). Thus, in the following, we investigate if the forecasting performance of our estimation framework changes over the business cycle. More precisely, we split the data into recession and expansionary periods using the NBER dates of peaks and troughs. This information is considered ex-post and is not used at any time in the estimation and/or forecasting process. We compute the corresponding $R_{j, oos}^{2} (M_{s})$ for the recession periods only.

reports the industries for which $R_{j, oos}^{2} (M_{s}) > 0$ for both the 30 (left panel) and the 49 (right panel) industry classification. The corresponding cross-sectional distribution of the $R_{j, oos}^{2} (M_{s})$ and the relative log-scores are reported in Appendix E.2. The labeling of is the same as in . By comparing with the full sample, the results suggest that the accuracy of the predictions substantially improves across methods and priors. Nevertheless, our VB method outperforms the naive forecast from the rolling mean for a larger fraction of industry portfolios compared to other methods, in particular when stochastic volatility is added to the model. The difference between the recession and the full-sample performance persists when considering the 49 industry classification, especially for the adaptive Normal-Gamma and the Horseshoe prior.

Fig. 8 The figure reports the industries for which $R_{j, oos}^{2} (M_{s}) > 0$ . The left (right) panel report the results for 30 (49) industry portfolios.

5.3 Economic Evaluation

A positive predictive performance does not necessarily translate into economic value. However, in practice, an investor is keenly interested in the economic value of returns predictability, perhaps even more than the statistical performance. Hence, it is of paramount importance to evaluate the extent to which apparent gains in predictive accuracy translate into better investment performances.

Following existing literature (see, e.g., Goyal and Welch Citation2008; Rapach, Strauss, and Zhou Citation2010), we consider a representative investor with a single-period horizon and mean-variance preferences who allocates her wealth between an industry portfolio and a risk-free asset. Thus, the investor optimal allocation to stocks for period t + 1 based on information at time t is given by $w_{j t} = \frac{1}{γ} \frac{{\hat{y}}_{j t}}{{\hat{ν}}_{j t}^{- 1}}$ , where ${\hat{y}}_{j t}$ represents the expected return for industry $j = 1, \dots, d$ and ${\hat{ν}}_{j t}^{- 1}$ the corresponding volatility forecast at time t. We also constrain the weights for each industry to $- 0.5 \leq w_{j t} \leq 1.5$ to prevent extreme short sales and leveraged positions. We assume a risk aversion coefficient of γ = 5.

reports the average utility gain—in monthly %—obtained by using a given forecast ${\hat{y}}_{j t}$ instead of the recursive sample mean ${\bar{y}}_{j t}$ . The average utility for a given model is calculated as ${\hat{u}}_{j} = {\bar{r}}_{j} - 0.5 γ {\bar{σ}}_{j}^{2}$ where ${\bar{r}}_{j}$ and ${\bar{σ}}_{j}^{2}$ represent the sample mean and variance, respectively, of the portfolio return $r_{j t + 1} = w_{j t} y_{j t + 1}$ realized over the forecasting period for the industry $j = 1, \dots, d$ under a given prior specification and estimation method. The utility gain is calculated by subtracting from the average utility ${\hat{u}}_{j}$ the average utility obtained by using the naive forecast from the recursive mean and variance to calculate w_jt. A positive value for the utility gain indicates the fee a risk-averse investor is willing to pay to access the investment strategy implied by $M_{s}$ .

Fig. 9 The left panel reports the cross-sectional distribution of the average utility gain across industry portfolios. The right panel reports the industries for which the utility gain is positive. The top (bottom) panels report the results for the 30-industry (49-industry) classification.

The out-of-sample economic significance largely confirms the statistical performance across methods. From a purely economic standpoint, the forecast from a recursive mean is quite challenging to beat: we observe that the average utility gain is mostly negative, with the only exception of those provided by VB with a Horseshoe prior specification. The results show that a representative investor with mean-variance utility is willing to pay, on average, a monthly fee of almost 15 basis points to access the strategy based on our variational estimation of a large VAR with stochastic volatility. In addition, the right panels of show that the positive economic value obtained from our VB is more broadly spread across industries than alternative methods. This holds especially true for the 30-industry classification but applies to the more granular 49-industry classification.

6 Concluding Remarks

We propose a novel variational Bayes inference method for large-scale VAR with exogenous predictors and stochastic volatility. Different from most existing estimation methods for high-dimensional VAR models, our approach does not rely on a structural form representation. This allows fast and accurate estimation of the model parameters without leveraging on a standard Cholesky-based transformation of the parameter space. We show both in simulation and empirically that our estimation approach outperforms across different prior specifications, both statistically and economically, forecasts from existing benchmark estimation strategies, such as equivalent, nonlinear MCMC algorithms (see, e.g., Gruber and Kastner Citation2022) linearized MCMC (see, e.g., Cross, Hou, and Poon Citation2020) and linearized variational inference methods (see, e.g., Gefang, Koop, and Poon Citation2023).

Supplementary Materials

The supplementary material contains the proof and derivations of all propositions and theoretical results in the paper. The supplementary material contains also additional simulation and empirical results. These additional results have also been briefly discussed in the main text of the paper. The R code pertaining to the variational inference scheme developed in the paper can be found at this link: https://github.com/whitenoise8/Variational-inference-for-large-Bayesian-vector-autoregressions.

Supplemental material

OnlineAppendix.pdf

Download PDF (8.1 MB)

Acknowledgments

We are thankful to Andrea Carriero and seminar participants at the 2021 Virtual NBER-NSF SBIES, the 2021 European Summer Meeting of the Econometric Society, the 2nd Workshop on Dimensionality Reduction and Inference in High-Dimensional Time Series at Maastricht University, the 2023 Summer Forum Workshop on Macroeconomics and Policy Evaluation at the Barcelona School of Economics, and the 3rd International Conference on Econometrics and Business Analytics in Tashkent, for their helpful comments and suggestions.

Disclosure Statement

The authors report there are no competing interests to declare.

Additional information

Funding

This research has been partially funded by the BERN_BIRD2222_01 - BIRD 2022 grant from the University of Padua.

References

Archakov, I., and Hansen, P. R. (2021), “A New Parametrization of Correlation Matrices,” Econometrica, 89, 1699–1715. DOI: 10.3982/ECTA16910.
Web of Science ®Google Scholar
Arias, J. E., Rubio-Ramirez, J. F., and Shin, M. (2023), “Macroeconomic Forecasting and Variable Ordering in Multivariate Stochastic Volatility Models,” Journal of Econometrics, 235, 1054–1086. DOI: 10.1016/j.jeconom.2022.04.013.
Web of Science ®Google Scholar
Avramov, D. (2004), “Stock Return Predictability and Asset Pricing Models,” Review of Financial Studies, 17, 699–738. DOI: 10.1093/rfs/hhg059.
Web of Science ®Google Scholar
Bernardi, M., Bianchi, D., and Bianco, N. (2023), “Dynamic Variable Selection in High-Dimensional Predictive Regressions,” working paper.
Google Scholar
Bognanni, M. (2022), “Comment on “Large Bayesian Vector Autoregressions with Stochastic Volatility and Non-conjugate Priors,” Journal of Econometrics, 227, 498–505. DOI: 10.1016/j.jeconom.2021.10.008.
Web of Science ®Google Scholar
Campbell, J. Y., and Thompson, S. B. (2007), “Predicting Excess Stock Returns Out of Sample: Can Anything Beat the Historical Average?” The Review of Financial Studies, 21, 1509–1531. DOI: 10.1093/rfs/hhm055.
Web of Science ®Google Scholar
Carriero, A., Clark, T. E., and Marcellino, M. (2019), “Large Bayesian Vector Autoregressions with Stochastic Volatility and Non-conjugate Priors,” Journal of Econometrics, 212, 137–154. DOI: 10.1016/j.jeconom.2019.04.024.
Web of Science ®Google Scholar
Carriero, A., Chan, J., Clark, T. E., and Marcellino, M. (2022), “Corrigendum to “Large Bayesian Vector Autoregressions with Stochastic Volatility and Non-conjugate Priors” [j. econometrics 212 (1)(2019) 137–154],” Journal of Econometrics, 227, 506–512. DOI: 10.1016/j.jeconom.2019.04.024.
Google Scholar
Carvalho, C. M., Polson, N. G., and Scott, J. G. (2009), “Handling Sparsity via the Horseshoe,” in Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, April 16–18, 2009 (Vol. 5), pp. 73–80.
Google Scholar
Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010), “The Horseshoe Estimator for Sparse Signals,” Biometrika, 97, 465–480. DOI: 10.1093/biomet/asq017.
Web of Science ®Google Scholar
Chan, J. C. (2021), “Minnesota-Type Adaptive Hierarchical Priors for Large Bayesian Vars,” International Journal of Forecasting, 37, 1212–1226. DOI: 10.1016/j.ijforecast.2021.01.002.
Web of Science ®Google Scholar
Chan, J. C., and Eisenstat, E. (2018), “Bayesian Model Comparison for Time-Varying Parameter Vars with Stochastic Volatility,” Journal of Applied Econometrics, 33, 509–532. DOI: 10.1002/jae.2617.
Web of Science ®Google Scholar
Chan, J. C., and Yu, X. (2022), “Fast and Accurate Variational Inference for Large Bayesian VARs with Stochastic Volatility,” Journal of Economic Dynamics and Control, 143, 104505. DOI: 10.1016/j.jedc.2022.104505.
Web of Science ®Google Scholar
Chan, J. C., Koop, G., and Yu, X. (2023), “Large Order-Invariant Bayesian VARs with Stochastic Volatility,” Journal of Business & Economic Statistics, 1–13. DOI: 10.1080/07350015.2023.2252039.
Web of Science ®Google Scholar
Cross, J. L., Hou, C., and Poon, A. (2020), “Macroeconomic Forecasting with Large Bayesian VARs: Global-Local Priors and the Illusion of Sparsity,” International Journal of Forecasting, 36, 899–915. DOI: 10.1016/j.ijforecast.2019.10.002.
Web of Science ®Google Scholar
Fama, E. F., and French, K. R. (1997), “Industry Costs of Equity,” Journal of Financial Economics, 43, 153–193. DOI: 10.1016/S0304-405X(96)00896-3.
Web of Science ®Google Scholar
———(2015), “A Five-Factor Asset Pricing Model,” Journal of Financial Economics, 116, 1–22.
Web of Science ®Google Scholar
Ferson, W. E., and Harvey, C. R. (1991), “The Variation of Economic Risk Premiums,” Journal of Political Economy, 99, 385–415. DOI: 10.1086/261755.
Web of Science ®Google Scholar
———(1999), “Conditioning Variables and the Cross Section of Stock Returns,” The Journal of Finance, 54, 1325–1360.
Web of Science ®Google Scholar
Ferson, W. E., and Korajczyk, R. A. (1995), “Do Arbitrage Pricing Models Explain the Predictability of Stock Returns?” Journal of Business, 68, 309–349. DOI: 10.1086/296667.
Web of Science ®Google Scholar
Fisher, J. D., Pettenuzzo, D., and Carvalho, C. M. (2020), “Optimal Asset Allocation with Multivariate Bayesian Dynamic Linear Models,” Annals of Applied Statistics, 14, 299–338.
Web of Science ®Google Scholar
Gefang, D., Koop, G., and Poon, A. (2023), “Forecasting Using Variational Bayesian Inference in Large Vector Autoregressions with Hierarchical Shrinkage,” International Journal of Forecasting, 39, 346–363. DOI: 10.1016/j.ijforecast.2021.11.012.
Web of Science ®Google Scholar
Goyal, A., and Welch, I. (2008), “A Comprehensive Look at the Empirical Performance of Equity Premium Prediction,” The Review of Financial Studies, 21, 1455–1508. DOI: 10.1093/rfs/hhm014.
Web of Science ®Google Scholar
Griffin, J. E., and Brown, P. J. (2010), “Inference with Normal-Gamma Prior Distributions in Regression Problems,” Bayesian Analysis, 5, 171–188. DOI: 10.1214/10-BA507.
Web of Science ®Google Scholar
Gruber, L., and Kastner, G. (2022), “Forecasting Macroeconomic Data with Bayesian VARs: Sparse or Dense? It depends!,” arXiv preprint arXiv:2206.04902.
Google Scholar
Gunawan, D., Kohn, R., and Nott, D. (2020), “Variational Approximation of Factor Stochastic Volatility Models,” arXiv e-prints, art. arXiv:2010.06738.
Google Scholar
Hahn, P. R., and Carvalho, C. M. (2015), “Decoupling Shrinkage and Selection in Bayesian Linear Models: A Posterior Summary Perspective,” Journal of the American Statistical Association, 110, 435–448. DOI: 10.1080/01621459.2014.993077.
Web of Science ®Google Scholar
Hauzenberger, N., Huber, F., and Onorante, L. (2021), “Combining Shrinkage and Sparsity in Conjugate Vector Autoregressive Models,” Journal of Applied Econometrics, 36, 304–327. DOI: 10.1002/jae.2807.
Web of Science ®Google Scholar
Hou, K., and Robinson, D. T. (2006), “Industry Concentration and Average Stock Returns,” The Journal of Finance, 61, 1927–1956. DOI: 10.1111/j.1540-6261.2006.00893.x.
Web of Science ®Google Scholar
Huber, F., and Feldkircher, M. (2019), “Adaptive Shrinkage in Bayesian Vector Autoregressive Models,” Journal of Business & Economic Statistics, 37, 27–39. DOI: 10.1080/07350015.2016.1256217.
Web of Science ®Google Scholar
Huber, F., Koop, G., and Onorante, L. (2021), “Inducing Sparsity and Shrinkage in Time-Varying Parameter Models,” Journal of Business & Economic Statistics, 39, 669–683. DOI: 10.1080/07350015.2020.1713796.
Web of Science ®Google Scholar
Kastner, G., and Huber, F. (2020), “Sparse Bayesian Vector Autoregressions in Huge Dimensions,” Journal of Forecasting, 39, 1142–1165. DOI: 10.1002/for.2680.
Web of Science ®Google Scholar
Leng, C., Tran, M. N., and Nott, D. (2014), “Bayesian Adaptive Lasso,” Annals of the Institute of Statistical Mathematics, 66, 221–244. DOI: 10.1007/s10463-013-0429-6.
Web of Science ®Google Scholar
Lewellen, J., Nagel, S., and Shanken, J. (2010), “A Skeptical Appraisal of Asset Pricing Tests,” Journal of Financial Economics, 96, 175–194. DOI: 10.1016/j.jfineco.2009.09.001.
Web of Science ®Google Scholar
Minka, T. P. (2001), “Expectation Propagation for Approximate Bayesian Inference,” in Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pp. 362–369.
Google Scholar
Ormerod, J. T., and Wand, M. P. (2010), “Explaining Variational Approximations,” The American Statistician, 64, 140–153. DOI: 10.1198/tast.2010.09058.
Web of Science ®Google Scholar
Park, T., and Casella, G. (2008), “The Bayesian Lasso, Journal of the American Statistical Association, 103, 681–686. DOI: 10.1198/016214508000000337.
Web of Science ®Google Scholar
Polson, N. G., and Scott, J. G. (2011), “Shrink Globally, Act Locally: Sparse Bayesian Regularization and Prediction,” in Bayesian Statistics, 9, 501–538.
Google Scholar
Rapach, D., and Zhou, G. (2013), “Forecasting Stock Returns,” in Handbook of Economic Forecasting (Vol. 2), eds. G. Elliott and A. Timmermann, pp. 328–383, Amsterdam: Elsevier.
Google Scholar
Rapach, D. E., Strauss, J. K., and Zhou, G. (2010), “Out-of-Sample Equity Premium Prediction: Combination Forecasts and Links to the Real Economy,” The Review of Financial Studies, 23, 821–862. DOI: 10.1093/rfs/hhp063.
Web of Science ®Google Scholar
Ray, P., and Bhattacharya, A. (2018), “Signal Adaptive Variable Selector for the Horseshoe Prior,” arXiv: Methodology.
Google Scholar
Rohde, D., and Wand, M. P. (2016), “Semiparametric Mean Field Variational Bayes: General Principles and Nmerical Issues,” The Journal of Machine Learning Research, 17, 5975–6021.
Web of Science ®Google Scholar
Rothman, A. J., Levina, E., and Zhu, J. (2010), “A New Approach to Cholesky-based Covariance Regularization in High Dimensions,” Biometrika, 97, 539–550. DOI: 10.1093/biomet/asq022.
Web of Science ®Google Scholar
Wand, M. P., Ormerod, J. T., Padoan, S. A., and Frührwirth, R. (2011), “Mean Field Variational Bayes for Elaborate Distributions,” Bayesian Analysis, 6, 847–900. DOI: 10.1214/11-BA631.
Web of Science ®Google Scholar

Variational Inference for Large Bayesian Vector Autoregressions

Abstract

1 Introduction