Search in:

Statistical Theory and Related Fields Volume 5, 2021 - Issue 4

Submit an article Journal homepage

Free access

613

Views

CrossRef citations to date

Altmetric

Listen

Articles in the special topic of Bayesian analysis

On the non-local priors for sparsity selection in high-dimensional Gaussian DAG models

Xuan CaoDivision of Statistics and Data Science, Department of Mathematical Sciences, University of Cincinnati, Cincinnati, OH, USACorrespondence[email protected]

https://orcid.org/0000-0002-6859-0030 View further author information

Fang YangDivision of Statistics and Data Science, Department of Mathematical Sciences, University of Cincinnati, Cincinnati, OH, USAView further author information

Pages 332-345 | Received 27 May 2020, Accepted 05 Jul 2021, Published online: 05 Sep 2021

Cite this article
https://doi.org/10.1080/24754269.2021.1963182
CrossMark

In this article

1. Introduction
2. Preliminaries
3. Model specification
4. Main results
5. Computation
6. Simulation studies
7. Results for hyper-pMOM Cholesky prior
8. Discussion
Acknowledgements
Additional information
References

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

We consider sparsity selection for the Cholesky factor L of the inverse covariance matrix in high-dimensional Gaussian DAG models. The sparsity is induced over the space of L via non-local priors, namely the product moment (pMOM) prior [Johnson, V., & Rossell, D. (2012). Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association, 107(498), 649–660. https://doi.org/10.1080/01621459.2012.682536] and the hierarchical hyper-pMOM prior [Cao, X., Khare, K., & Ghosh, M. (2020). High-dimensional posterior consistency for hierarchical non-local priors in regression. Bayesian Analysis, 15(1), 241–262. https://doi.org/10.1214/19-BA1154]. We establish model selection consistency for Cholesky factor under more relaxed conditions compared to those in the literature and implement an efficient MCMC algorithm for parallel selecting the sparsity pattern for each column of L. We demonstrate the validity of our theoretical results via numerical simulations, and also use further simulations to demonstrate that our sparsity selection approach is competitive with existing methods.

Keywords:

Bayesian DAG models
non-local priors
high-dimensional data
posterior consistency
graph selection

1. Introduction

Covariance estimation and selection is a fundamental problem in multivariate statistical inference. In recent years, high-throughput data from various applications is being generated rapidly. Several promising methods have been proposed to interpret the complex multivariate relationships in these high-dimensional datasets. In particular, methods inducing sparsity in the Cholesky factor of the inverse have proven to be very effective in applications. These models are also referred to as Gaussian DAG models. In particular, consider i.i.d. observations $Y_{1}, Y_{2}, \dots, Y_{n}$ obeying a multivariate normal distribution with mean vector $0_{p}$ and covariance matrix Σ. Let $Ω = L D^{- 1} L^{T}$ be the unique modified Cholesky decomposition of the inverse covariance matrix $Ω = Σ^{- 1}$ , where L is a lower triangular matrix with unit diagonals, and D is a diagonal matrix with all diagonal entries being positive. A given sparsity pattern on L corresponds to certain conditional independence relationships, which can be encoded in terms of a directed acyclic graph $D$ on the set of p variables as follows: if the ith and jth variables do not share an edge in $D$ , then $L_{i j} = 0$ (see Section 2 for more details). In this paper, we focus on imposing sparsity on the Cholesky factor of the inverse covariance matrix through a class of non-local priors.

Non-local priors were first introduced by Johnson and Rossell (Citation2010) as densities that are identically zero whenever a model parameter is equal to its null value in the context of hypothesis testing. Non-local priors discard spurious covariates faster as the sample size n increases compared with local priors, while preserving exponential learning rates to detect nontrivial coefficients. These non-local priors including the product moment (pMOM) non-local prior are extended to Bayesian model selection problems in Johnson and Rossell (Citation2012) and Shin et al. (Citation2018) by imposing non-local priors on regression coefficients. Wu (Citation2016) and Cao et al. (Citation2020) consider a fully Bayesian approach with the pMOM non-local prior and an appropriate Inverse-Gamma prior on the hyperparameter (the so-called hyper-pMOM prior), and discuss the potential advantages of using hyper-pMOM priors and establish model selection consistency in regression setting.

In the context of Gaussian DAG models, Altamore et al. (Citation2013) deal with structural learning for Gaussian DAG models from an objective Bayesian perspective by assigning a prior distribution on the space of DAGs, together with an improper product moment prior on the Cholesky factor corresponding to each DAG. However, objective priors are often improper and cannot be used to directly compute the Bayes factors, even when the marginal likelihoods are strictly positive and finite. The authors therefore utilize the fractional Bayes factor (FBF) approach and implement an efficient stochastic search algorithm to deal with data sets having sample size smaller than the number of variables. Cao et al. (Citation2019) further establish consistency results under these objective priors under rather restrictive conditions.

To the best of our knowledge, a rigorous investigation of high-dimensional posterior consistency properties with either pMOM prior or the hyper-pMOM prior has not been undertaken for either undirected graphical models or DAG models. Hence, our first goal was to investigate if high-dimensional consistency results could be established under these two more diverse and algebraically complex class of non-local priors in the Gaussian DAG model setting. Our second goal was to investigate if these consistency results can be obtained under relaxed or comparable conditions. Our third goal was to develop efficient algorithms for exploring the massive candidate space containing $2^{p (p - 1) / 2}$ models. These were challenging goals of course, as the posterior distributions are not available in closed form for both the pMOM prior and the hyper-pMOM prior.

As the main contributions of this paper, we establish high-dimensional posterior ratio consistency for Gaussian DAG models with both the pMOM prior as well as the hyper-pMOM prior on the Cholesky factor L, and under a uniform-like prior on the sparsity pattern in L (Theorems 4.2–7.3). Following the nomenclature in Lee et al. (Citation2019) and Niu et al. (Citation2019), this notion of consistency also referred to as consistency of posterior odds implies the maximal ratio between the marginal posterior probability assigned to a ‘non-true’ model and the posterior probability assigned to the ‘true’ model converges to zero. That also indicates that the true model will be the posterior mode with probability tending to 1. As indicated in Shin et al. (Citation2018), since the pMOM priors already induce a strong penalty on the model size, it is no longer necessary to penalize larger models through priors on the graph space like Erdos–Renyi prior (Niu et al., Citation2019), beta-mixture prior (Carvalho & Scott, Citation2009), or multiplicative prior (Tan et al., Citation2017). Also, through simulation studies where we implement an efficient parallel MCMC algorithm for exploring the sparsity pattern of each column of L, we demonstrate that the models studied in this paper can outperform existing state-of-the-art methods including both penalized likelihood and Bayesian approaches in different settings.

The rest of the paper is organized as follows. Section 2 provides background material regarding the Gaussian DAG model and introduces the pMOM Cholesky distribution. In Section 3, we present our hierarchical Bayesian model and the parameter class for the inverse covariance matrices. Model selection consistency results for both the pMOM Cholesky prior and the hyper-pMOM Cholesky prior are stated in Sections 4 and 7 respectively, with proofs provided in the supplement. In Section 6 we use simulation experiments to illustrate the model selection consistency, and demonstrate the benefits of our Bayesian approach and computation procedures for Cholesky factor selection vis-a-vis existing Bayesian and penalized likelihood approaches. We end our paper with a discussion session in Section 8.

2. Preliminaries

In this section, we provide the necessary background material from graph theory, Gaussian DAG models, and also introduce our pMOM Cholesky prior.

2.1. Gaussian DAG models

We consider the multivariate Gaussian distribution (1) $Y \sim N_{p} (0, Ω^{- 1}),$ (1) where Ω is a $p \times p$ inverse covariance matrix. Any positive definite matrix Ω can be uniquely decomposed as $Ω = L D^{- 1} L^{T}$ , where L is a lower triangular matrix with unit diagonal entries, and D is a diagonal matrix with positive diagonal entries. This decomposition is known as the modified Cholesky decomposition of Ω (Pourahmadi, Citation2007). By considering this decomposition, one can place an appropriate prior over the diagonals of D to construct a hierarchical model. In addition, the unit diagonals resulting from the modified Cholesky decomposition can benefit the posterior calculation and proof of consistency.

A directed acyclic graph (DAG) $D = (V, E)$ consists of the vertex set $V = {1, \dots, p}$ and an edge set E such that there is no directed path starting and ending at the same vertex. As in Ben-David et al. (Citation2016) and Lee et al. (Citation2019), we will without loss of generality assume a parent ordering, where that all the edges are directed from larger vertices to smaller vertices. Applications with natural ordering of variables include estimation of causal relationships from temporal observations, or settings where additional experimental data can determine the ordering of variables, and estimation of transcriptional regulatory networks from gene expression data (Huang et al., Citation2006; Khare et al., Citation2017; Shojaie & Michailidis, Citation2010; Yu & Bien, Citation2017). The set of parents of i, denoted by $p a_{i} (D)$ , is the collection of all vertices which are larger than i and share an edge with i.

A Gaussian DAG model over a given DAG $D$ denoted by $N_{D}$ consists of all multivariate Gaussian distributions which obey the directed Markov property with respect to a DAG $D$ . In particular, if $Y = (Y_{1}, \dots, Y_{p})^{T} \sim N_{p} (0, Σ)$ and $N_{p} (0, Σ = Ω^{- 1}) \in N_{D}$ , then $Y_{i} ⊥ Y_{{i + 1, \dots, p} ∖ p a_{i} (D)} | Y_{p a_{i} (D)},$ for each $1 \leq i < p$ . For the connection between the Cholesky factor L and the underlying DAG $D$ , if $Ω = L D^{- 1} L^{T}$ is the modified Cholesky decomposition of Ω, then $N_{p} (0, Ω^{- 1})$ is a Gaussian DAG model over $D$ if and only if $L_{i j} = 0$ whenever $i \notin p a_{j} (D)$ .

2.2. Notations

Consider the modified Cholesky decomposition $Ω = L D^{- 1} L^{T}$ , where L is a lower triangular matrix with all the unit diagonal entries and $D = diag {d_{1}, d_{2}, \dots, d_{p}}$ , where $d_{i} (1 \leq i \leq p)$ 's are all positive and $d_{i}$ represents the ith diagonal entry of D. We introduce latent binary variables $Z = {Z_{21}, Z_{31}, \dots, Z_{p 1}, Z_{32}, Z_{42}, \dots, Z_{p, p - 1}}$ for $1 \leq j < k \leq p$ to indicate whether $L_{k j}$ is active, i.e., $Z_{k j} = 1$ if $L_{k j} \neq 0$ and 0, otherwise.

In this way, we are viewing the binary variable Z as the indicator for the sparsity pattern in L. For each $1 \leq j \leq p - 1$ , let $Z_{j} = {Z_{k j} : k > j, Z_{k j} = 1}$ , a subset of ${j + 1, j + 2, \dots, p}$ , be the index set of all non-zero components in ${Z_{j + 1, j}, \dots, Z_{p, j}}$ . $Z_{j}$ explicitly gives the support of the Cholesky factor and the sparsity pattern of the underlying DAG. Denote $| Z_{j} | = \sum_{k = j + 1}^{p} Z_{k j}$ as the cardinality of set $Z_{j}$ for $1 \leq j \leq p - 1$ .

For any $p \times p$ matrix A, denote $A_{S_{1}, S_{2}}$ as a subset of A defined by rows in set $S_{1}$ and columns in set $S_{2}$ . Following the definition of Z, for any $p \times p$ matrix A, denote the column vectors $A_{Z_{j}, j} = (A_{k j})_{k \in Z_{j}}$ and $A_{j \cup Z_{j}, j} = (A_{j j}, A_{Z_{j}, j}^{T})^{T} .$ Also, let $A_{Z_{j}, Z_{j}} = (A_{k i})_{k, i \in Z_{j}}$ , $A_{j \cup Z_{j}, j \cup Z_{j}} = (\begin{matrix} A_{i i} & A_{Z_{j}, j}^{T} \\ A_{Z_{j}, j} & A_{Z_{j}, Z_{j}} \end{matrix}) .$ In particular, $A_{p \cup Z_{p}, p} = A_{p \cup Z_{p}, p \cup Z_{p}} = A_{p p}$ .

Next, we provide some additional required notations. For $x \in R^{p}$ , let $∥ x ∥_{r} = (\sum_{j = 1}^{p} | x_{j} |^{r})^{\frac{1}{r}}$ and $∥ x ∥_{\infty} = max_{j} | x_{j} |$ represent the standard $l_{r}$ and $l_{\infty}$ norms. For a $p \times p$ matrix A, let ${e i g}_{1} (A) \leq {e i g}_{2} (A) \leq \dots \leq {e i g}_{p} (A)$ be the ordered eigenvalues of A and denote $∥ A ∥_{max} = max_{1 \leq i, j \leq p} | A_{i j} |,$ $∥ A ∥_{(r, s)} = sup {∥ A x ∥_{s} : ∥ x ∥_{r} = 1}, for 1 \leq r, s < \infty .$ In particular, $∥ A ∥_{(1, 1)} = max_{j} \sum_{i} | A_{i j} |, ∥ A ∥_{(\infty, \infty)} = max_{i} \sum_{j} | A_{i j} | and ∥ A ∥_{(2, 2)} = {e i g}_{p} (A)^{1 / 2} .$

2.3. pMOM Cholesky prior

Johnson and Rossell (Citation2012) introduce the product moment (pMOM) non-local prior for the regression coefficients with density given by (2) $\begin{aligned} m_{p} (2 π)^{- \frac{p}{2}} (τ σ^{2})^{- r p - \frac{p}{2}} | A_{p} |^{\frac{1}{2}} \\ \times \exp (- \frac{β_{p}^{T} A_{p} β_{p}}{2 τ σ^{2}}) \prod_{i = 1}^{p} β_{i}^{2 r} . \end{aligned}$ (2) Here $A_{p}$ is a $p \times p$ nonsingular matrix, r is a positive integer referred to as the order of the density and $m_{p}$ is the normalizing constant independent of τ and $σ^{2}$ , where τ is some positive constant. Variations of the density in (Equation2(2) $\begin{aligned} m_{p} (2 π)^{- \frac{p}{2}} (τ σ^{2})^{- r p - \frac{p}{2}} | A_{p} |^{\frac{1}{2}} \\ \times \exp (- \frac{β_{p}^{T} A_{p} β_{p}}{2 τ σ^{2}}) \prod_{i = 1}^{p} β_{i}^{2 r} . \end{aligned}$ (2) ), called the piMOM and peMOM density, have also been developed in Johnson and Rossell (Citation2012), Rossell et al. (Citation2013) and Shin et al. (Citation2018). Adapted to our framework, we place the following non-local prior on the Cholesky factor L corresponding to pMOM prior for a certain sparsity pattern Z, (3) $\begin{aligned} π (L_{Z_{j}, j} ∣ d_{j}, Z_{j}) \\ = m_{| Z_{j} |} (2 π)^{- \frac{| Z_{j} |}{2}} (τ d_{j})^{- r | Z_{j} | - \frac{| Z_{j} |}{2}} | A_{Z_{j}, Z_{j}} |^{\frac{1}{2}} \\ \times \exp \{- \frac{(L_{Z_{j}, j})^{T} A_{Z_{j}, Z_{j}} L_{Z_{j}, j}}{2 τ d_{j}}\} \prod_{i \in Z_{j}} L_{i j}^{2 r}, \end{aligned}$ (3) for $j = 1, 2, \dots, p - 1$ , where similarly, $A_{p}$ is a $p \times p$ positive definite matrix, r is a positive integer, $τ > 0$ , and $m_{| Z_{j} |}$ is the normalizing constant independent of τ and $d_{j}$ , but dependent on $| Z_{j} |$ . $m_{| Z_{j} |}$ can not be explicitly written in closed form by can be bounded below and above by a function of $| Z_{j} |$ . We refer to (Equation3(3) $\begin{aligned} π (L_{Z_{j}, j} ∣ d_{j}, Z_{j}) \\ = m_{| Z_{j} |} (2 π)^{- \frac{| Z_{j} |}{2}} (τ d_{j})^{- r | Z_{j} | - \frac{| Z_{j} |}{2}} | A_{Z_{j}, Z_{j}} |^{\frac{1}{2}} \\ \times \exp \{- \frac{(L_{Z_{j}, j})^{T} A_{Z_{j}, Z_{j}} L_{Z_{j}, j}}{2 τ d_{j}}\} \prod_{i \in Z_{j}} L_{i j}^{2 r}, \end{aligned}$ (3) ) as our pMOM Cholesky priors. To introduce a hierarchical model on the Cholesky parameter $(L, D)$ , we will impose an Inverse-Gamma prior on the diagonal entries of D. Note that to obtain our desired asymptotic consistency results, appropriate conditions for all the aforementioned hyperparameters will be introduced in Section 4.1.

3. Model specification

Let $Y_{1}, Y_{2}, \dots, Y_{n} \in R^{p}$ be the observed data and $S = \frac{1}{n} \sum_{i = 1}^{n} Y_{i} Y_{i}^{T}$ denote the sample covariance matrix. The class of pMOM Cholesky distributions (Equation3(3) $\begin{aligned} π (L_{Z_{j}, j} ∣ d_{j}, Z_{j}) \\ = m_{| Z_{j} |} (2 π)^{- \frac{| Z_{j} |}{2}} (τ d_{j})^{- r | Z_{j} | - \frac{| Z_{j} |}{2}} | A_{Z_{j}, Z_{j}} |^{\frac{1}{2}} \\ \times \exp \{- \frac{(L_{Z_{j}, j})^{T} A_{Z_{j}, Z_{j}} L_{Z_{j}, j}}{2 τ d_{j}}\} \prod_{i \in Z_{j}} L_{i j}^{2 r}, \end{aligned}$ (3) ) can be used for Bayesian sparsity selection of the Cholesky factor through the following hierarchical model, (4) $\begin{aligned} Y ∣ D, L & \sim N_{p} (0, (L D^{- 1} L^{T})^{- 1}), \end{aligned}$ (4) (5) $\begin{aligned} L_{Z_{j}, j} ∣ d_{j}, Z_{j} & \overset{i n d}{\sim} pMOM Cholesky, 1 \leq j < p, \end{aligned}$ (5) (6) $\begin{aligned} d_{j} & \overset{i n d}{\sim} Inverse-Gamma (α_{1}, α_{2}), 1 \leq j \leq p . \end{aligned}$ (6) The proposed hierarchical model now has five hyperparameters: the scale parameter $τ > 0$ , the order r and positive definite matrix A in model (Equation5(5) $\begin{aligned} L_{Z_{j}, j} ∣ d_{j}, Z_{j} & \overset{i n d}{\sim} pMOM Cholesky, 1 \leq j < p, \end{aligned}$ (5) ) for the pMOM Cholesky prior, the shape parameter $α_{1}$ and scale parameter $α_{2}$ in model (Equation6(6) $\begin{aligned} d_{j} & \overset{i n d}{\sim} Inverse-Gamma (α_{1}, α_{2}), 1 \leq j \leq p . \end{aligned}$ (6) ) for the Inverse-Gamma prior on $d_{j}$ . Further restrictions on these hyperparameters to ensure desired consistency will be specified in Section 4.1.

Remark 3.1

Note that in the currently presented hierarchical model, we have not assigned any specific form to the prior over the sparsity patterns of L (essentially the space of Z). Some standard regularity assumptions for this prior will be provided later in Section 4.1. In fact, we will essentially impose a uniform-like prior on Z. Because of the strong penalty induced on the model size by the pMOM prior, it is no longer necessary to penalize larger models through priors on the graph space like the Erdos–Renyi prior (Niu et al., Citation2019), the complexity prior (Lee et al., Citation2019), or the multiplicative prior (Tan et al., Citation2017).

Note that under the hierarchical model (Equation4(4) $\begin{aligned} Y ∣ D, L & \sim N_{p} (0, (L D^{- 1} L^{T})^{- 1}), \end{aligned}$ (4) )–(Equation6(6) $\begin{aligned} d_{j} & \overset{i n d}{\sim} Inverse-Gamma (α_{1}, α_{2}), 1 \leq j \leq p . \end{aligned}$ (6) ), we can conduct posterior inference for the sparsity pattern of each column of L independently, which will benefit the computation significantly in the sense that it allows for parallel searching. In order to show the posterior ratio consistency $π (Z_{j} | Y)$ , we need the following lemma that establishes the marginal posterior probability.

Lemma 3.1

Under the hierarchical model (Equation4(4) $\begin{aligned} Y ∣ D, L & \sim N_{p} (0, (L D^{- 1} L^{T})^{- 1}), \end{aligned}$ (4) )–(Equation6(6) $\begin{aligned} d_{j} & \overset{i n d}{\sim} Inverse-Gamma (α_{1}, α_{2}), 1 \leq j \leq p . \end{aligned}$ (6) ), the resulting (marginal) posterior probability for $Z_{j} (1 \leq j < p)$ is given by (7) $\begin{aligned} π (Z_{j} | Y) & \propto π (Z_{j}) m_{| Z_{j} |} | A_{Z_{j}, Z_{j}} |^{\frac{1}{2}} τ^{- r | Z_{j} | - \frac{| Z_{j} |}{2}} \frac{1}{| n {\tilde{S}}_{Z_{j}, Z_{j}} |^{\frac{1}{2}}} \\ \times \int_{0}^{\infty} d_{j}^{- (\frac{n}{2} + r | Z_{j} | + α_{1} + 1)} \\ \times \exp (- \frac{{\tilde{S}}_{j | Z_{j}} + 2 α_{2}}{2 d_{j}}) E_{| Z_{j} |} (\prod_{i \in Z_{j}} L_{i j}^{2 r}) d d_{j}, \end{aligned}$ (7) where $m_{| Z_{j} |}$ is some normalized constant independent of $d_{j}$ , $\tilde{S} = S + \frac{A}{n τ}$ , ${\tilde{S}}_{j | Z_{j}} = {\tilde{S}}_{j j} - ({\tilde{S}}_{Z_{j}, j})^{T} ({\tilde{S}}_{Z_{j}, Z_{j}})^{- 1} {\tilde{S}}_{Z_{j}, j},$ and $E_{| Z_{j} |} (.)$ denotes the expectation with respect to a multivariate normal distribution with mean $- ({\tilde{S}}_{Z_{j}, Z_{j}})^{- 1} {\tilde{S}}_{Z_{j}, j}$ , and covariance matrix $d_{j} ({\tilde{S}}_{Z_{j}, Z_{j}})^{- 1}$ .

Here we provide the proof of Lemma 3.1.

Proof of Lemma 3.1

By (Equation4(4) $\begin{aligned} Y ∣ D, L & \sim N_{p} (0, (L D^{- 1} L^{T})^{- 1}), \end{aligned}$ (4) )–(Equation6(6) $\begin{aligned} d_{j} & \overset{i n d}{\sim} Inverse-Gamma (α_{1}, α_{2}), 1 \leq j \leq p . \end{aligned}$ (6) ) and Bayes' rule, under the pMOM Cholesky prior, the resulting posterior probability for $Z_{j}$ is given by, (8) $\begin{aligned} π (Z_{j} | Y) & \propto π (Z_{j}) \int_{0}^{\infty} \int π (Y ∣ D, L) π (L_{Z_{j}, j} ∣ d_{j}, Z_{j}) \\ \times π (d_{j}) d L_{Z_{j}, j} d d_{j} \\ \propto \int_{0}^{\infty} \int \exp \{- \frac{n (L_{j \cup Z_{j}, j})^{T} S_{j \cup Z_{j}, j \cup Z_{j}} L_{j \cup Z_{j}, j}}{2 d_{j}}\} \\ \times d_{j}^{- (\frac{n}{2} + α_{1} + 1)} e^{- \frac{α_{2}}{d_{j}}} m_{| Z_{j} |} (2 π)^{- \frac{| Z_{j} |}{2}} \\ \times (τ d_{j})^{- r | Z_{j} | - \frac{| Z_{j} |}{2}} | A_{Z_{j}, Z_{j}} |^{\frac{1}{2}} \\ \times \exp \{- \frac{(L_{Z_{j}, j})^{T} A_{Z_{j}, Z_{j}} L_{Z_{j}, j}}{2 τ d_{j}}\} \\ \times \prod_{i \in Z_{j}} L_{i j}^{2 r} d L_{Z_{j}, j} d d_{j} . \end{aligned}$ (8) Note that $\begin{aligned} {(L_{j \cup Z_{j}, j})}^{T} S_{j \cup Z_{j}, j} L_{j \cup Z_{j}, j} \\ = (1, {(L_{Z_{j}, j})}^{T}) (\begin{matrix} S_{j j} & {(S_{Z_{j}, j})}^{T} \\ S_{Z_{j}, j} & S_{Z_{j}, Z_{j}} \end{matrix}) (1, L_{Z_{j}, j}) . \end{aligned}$ Therefore, it follows from (Equation8(8) $\begin{aligned} π (Z_{j} | Y) & \propto π (Z_{j}) \int_{0}^{\infty} \int π (Y ∣ D, L) π (L_{Z_{j}, j} ∣ d_{j}, Z_{j}) \\ \times π (d_{j}) d L_{Z_{j}, j} d d_{j} \\ \propto \int_{0}^{\infty} \int \exp \{- \frac{n (L_{j \cup Z_{j}, j})^{T} S_{j \cup Z_{j}, j \cup Z_{j}} L_{j \cup Z_{j}, j}}{2 d_{j}}\} \\ \times d_{j}^{- (\frac{n}{2} + α_{1} + 1)} e^{- \frac{α_{2}}{d_{j}}} m_{| Z_{j} |} (2 π)^{- \frac{| Z_{j} |}{2}} \\ \times (τ d_{j})^{- r | Z_{j} | - \frac{| Z_{j} |}{2}} | A_{Z_{j}, Z_{j}} |^{\frac{1}{2}} \\ \times \exp \{- \frac{(L_{Z_{j}, j})^{T} A_{Z_{j}, Z_{j}} L_{Z_{j}, j}}{2 τ d_{j}}\} \\ \times \prod_{i \in Z_{j}} L_{i j}^{2 r} d L_{Z_{j}, j} d d_{j} . \end{aligned}$ (8) ) that $\begin{aligned} \int \prod_{i \in Z_{j}} \exp \{- \frac{n (L_{j \cup Z_{j}, j})^{T} S_{j \cup Z_{j}, j \cup Z_{j}} L_{j \cup Z_{j}, j}}{2 d_{j}}\} \\ \times \exp \{- \frac{(L_{Z_{j}, j})^{T} A_{Z_{j}, Z_{j}} L_{Z_{j}, j}}{2 τ d_{j}}\} \prod_{i \in Z_{j}} L_{i j}^{2 r} d L_{Z_{j}, j} \\ = \int \prod_{i \in Z_{j}} L_{i j}^{2 r} \exp \{- \frac{\begin{matrix} (L_{Z_{j}, j} + ({\tilde{S}}_{Z_{j}, Z_{j}})^{- 1} S_{Z_{j}, j})^{T} \\ {\tilde{S}}_{Z_{j}, Z_{j}} (L_{Z_{j}, j} + ({\tilde{S}}_{Z_{j}, Z_{j}})^{- 1} S_{Z_{j}, j}) \end{matrix}}{2 d_{j} / n}\} \\ \times \exp \{- \frac{S_{j j} - (S_{Z_{j}, j})^{T} ({\tilde{S}}_{Z_{j}, Z_{j}})^{- 1} S_{Z_{j}, j}}{2 d_{j} / n}\} d L_{Z_{j}, j}, \end{aligned}$ where ${\tilde{S}}_{Z_{j}} = S_{Z_{j}} + \frac{A_{Z_{j}}}{n τ}$ . Hence, by (Equation8(8) $\begin{aligned} π (Z_{j} | Y) & \propto π (Z_{j}) \int_{0}^{\infty} \int π (Y ∣ D, L) π (L_{Z_{j}, j} ∣ d_{j}, Z_{j}) \\ \times π (d_{j}) d L_{Z_{j}, j} d d_{j} \\ \propto \int_{0}^{\infty} \int \exp \{- \frac{n (L_{j \cup Z_{j}, j})^{T} S_{j \cup Z_{j}, j \cup Z_{j}} L_{j \cup Z_{j}, j}}{2 d_{j}}\} \\ \times d_{j}^{- (\frac{n}{2} + α_{1} + 1)} e^{- \frac{α_{2}}{d_{j}}} m_{| Z_{j} |} (2 π)^{- \frac{| Z_{j} |}{2}} \\ \times (τ d_{j})^{- r | Z_{j} | - \frac{| Z_{j} |}{2}} | A_{Z_{j}, Z_{j}} |^{\frac{1}{2}} \\ \times \exp \{- \frac{(L_{Z_{j}, j})^{T} A_{Z_{j}, Z_{j}} L_{Z_{j}, j}}{2 τ d_{j}}\} \\ \times \prod_{i \in Z_{j}} L_{i j}^{2 r} d L_{Z_{j}, j} d d_{j} . \end{aligned}$ (8) ), we have (9) $\begin{aligned} π (Z_{j} | Y) \\ \propto π (Z_{j}) \int_{0}^{\infty} \int π (Y ∣ D, L) π (L_{Z_{j}, j} ∣ d_{j}, Z_{j}) \\ \times π (d_{j}) d L_{Z_{j}, j} d d_{j} \\ \propto π (Z_{j}) m_{| Z_{j} |} | A_{Z_{j}, Z_{j}} |^{\frac{1}{2}} τ^{- r | Z_{j} | - \frac{| Z_{j} |}{2}} \frac{1}{| n {\tilde{S}}_{Z_{j}, Z_{j}} |^{\frac{1}{2}}} \\ \times \int_{0}^{\infty} d_{j}^{- (\frac{n}{2} + (r - \frac{1}{2}) | Z_{j} | + α_{1} + 1)} \\ \times \exp (- \frac{n {\tilde{S}}_{j | Z_{j}} + 2 α_{2}}{2 d_{j}}) E_{| Z_{j} |} (\prod_{i \in Z_{j}} L_{i j}^{2 r}) d d_{j} . \end{aligned}$ (9)

In particular, these posterior probabilities can be used to select a model by computing the posterior mode defined by (10) $\hat{Z_{j}} = {a r g m a x}_{Z_{j}} π (Z_{j} | Y) .$ (10)

4. Main results

In this section we aim to investigate the high-dimensional asymptotic properties for the proposed model in Section 3. For this purpose, we will work in a setting where the data dimension $p = p_{n}$ and the hyperparameters vary with the sample size n and $p_{n} \geq n$ . Assume that the data is actually being generated from a true model specified as follows. Let $Y_{1}^{n}, Y_{2}^{n}, \dots, Y_{n}^{n}$ be independent and identically distributed multivariate variate Gaussian vectors with mean $0_{p_{n}}$ and true covariance matrix $Σ_{0}^{n} = (Ω_{0}^{n})^{- 1}$ , where $Ω_{0}^{n} = L_{0}^{n} (D_{0}^{n})^{- 1} (L_{0}^{n})^{T}$ is the modified Cholesky decomposition of $Ω_{0}^{n}$ . The sparsity pattern of the true Cholesky factor $L_{0}^{n}$ is uniquely encoded in the true binary variable set denoted as $Z_{0}^{n}$ .

In order to establish our asymptotic consistency results, we need the following mild assumptions with corresponding discussion/interpretation. Denote $d_{n} = max_{1 \leq j \leq p - 1} | {Z_{0}^{n}}_{j} |$ as the maximum number of non-zero entries in each column of $L_{0}^{n}$ . Let $s_{n} = min_{1 \leq j, i \leq p, i \in Z_{j}} | (L_{0}^{n})_{i j} |$ as the smallest (in absolute value) non-zero off-diagonal entry in $L_{0}^{n}$ , and can be interpreted as the ‘signal size’. For sequences $a_{n}$ and $b_{n}$ , $a_{n} \sim b_{n}$ means $a_{n} / b_{n} \to c$ for some constant c>0. Let $a_{n} = o (b_{n})$ represent $a_{n} / b_{n} \to 0$ as $n \to \infty$ .

4.1. Assumptions

Assumption 1

There exists $ϵ_{0} \leq 1$ , such that for every $n \geq 1,$ $0 < ϵ_{0} \leq {e i g}_{1} (Ω_{0}^{n}) \leq {e i g}_{p_{n}} (Ω_{0}^{n}) \leq ϵ_{0}^{- 1}$ .

Assumption 2

$d_{n} \sqrt{\log p_{n} / n} \to 0$ as $n \to \infty$ .

Assumption 3

$d_{n} \log p_{n} / (s_{n}^{2} n) \to 0$ as $n \to \infty$ .

Assumption 4

For each $Z_{j} (1 \leq j < p)$ , a uniform prior is placed over all models of size less than or equal to $q_{n}$ , i.e., $π (Z_{j}) \propto I (| Z_{j} | \leq q_{n})$ , where $q_{n} = o (\sqrt{n / \log p_{n}})$ .

Assumption 5a

The hyperparameters $A_{p_{n}}$ , τ, $α_{1}$ , $α_{2}$ in (Equation5(5) $\begin{aligned} L_{Z_{j}, j} ∣ d_{j}, Z_{j} & \overset{i n d}{\sim} pMOM Cholesky, 1 \leq j < p, \end{aligned}$ (5) ) and (Equation6(6) $\begin{aligned} d_{j} & \overset{i n d}{\sim} Inverse-Gamma (α_{1}, α_{2}), 1 \leq j \leq p . \end{aligned}$ (6) ) satisfy $0 < a_{1} < {e i g}_{1} (A_{p_{n}}) \leq {e i g}_{2} (A_{p_{n}}) \leq \dots \leq {e i g}_{p_{n}} (A_{p_{n}}) < a_{2} < \infty$ and $0 < α_{1}, α_{2}, τ < a_{2}$ . Here $a_{1}, a_{2}$ are constants not depending on n.

Assumption 1 has been commonly used for establishing high-dimensional covariance asymptotic properties (Banerjee & Ghosal, Citation2014, Citation2015; Bickel & Levina, Citation2008; El Karoui, Citation2008; Xiang et al., Citation2015). Assumption 2 essentially allow the number of variables $p_{n}$ to grow slower than $e^{n / d_{n}^{2}}$ compared to previous literatures with rate $e^{n / d_{n}^{4}}$ (Banerjee & Ghosal, Citation2014, Citation2015; Xiang et al., Citation2015). Assumption 2 also states the maximum number of parents for all the nodes for the true model (i.e., $d_{n}$ ) must be at a smaller order than $\sqrt{n / \log p_{n}}$ .

Assumption 3 also known as the ‘beta-min’ condition provides a lower bound for the minimum values of $L_{0}^{n}$ that is needed for establishing consistency. This type of condition has been used for the exact support recovery of the high-dimensional linear regression models as well as Gaussian DAG models (Khare et al., Citation2017; Lee et al., Citation2019; Yang et al., Citation2016). Assumption 4 essentially states that the uniform-like prior on the space of the $2^{p_{n} (p_{n} - 1) / 2}$ possible models, places zero mass on unrealistically large models. Since Assumption 2 already restricts $d_{n}$ to be $o (\sqrt{n / \log p_{n}})$ , Assumption 4 does not affect the probability assigned to the true model. See similar assumptions in Johnson and Rossell (Citation2012) and Shin et al. (Citation2018) in the context of regression.

Assumption 5a is standard which assumes the eigenvalues of the scale matrix in the pMOM Cholesky prior are uniformly bounded in n. Note that for the default value of $A_{p_{n}} = I_{p_{n}}$ , Assumption 5a is immediately satisfied. See similar assumptions in Shin et al. (Citation2018) and Johnson and Rossell (Citation2012). This assumption also states the hyperparameter τ in pMOM Cholesky prior and $α_{1}, α_{2}$ in the Inverse-Gamma prior are bounded by a constant.

For the rest of this paper, $p_{n}$ , $Ω_{0}^{n}$ , $Σ_{0}^{n}$ , $L_{0}^{n}, D_{0}^{n}, Z_{0}^{n}, Z^{n}, d_{n}, s_{n}$ will be denoted as p, $Ω_{0}$ , $Σ_{0}$ , $L_{0}$ , $D_{0}$ , $Z_{0}, Z, d, s$ by leaving out the superscript for notational convenience. Let $P_{Ω_{0}}$ and $E_{Ω_{0}}$ denote the probability measure and expected value corresponding to the ‘true’ model specified in the beginning of Section 4, respectively.

4.2. Posterior ratio consistency

Since the posterior probabilities in (Equation7(7) $\begin{aligned} π (Z_{j} | Y) & \propto π (Z_{j}) m_{| Z_{j} |} | A_{Z_{j}, Z_{j}} |^{\frac{1}{2}} τ^{- r | Z_{j} | - \frac{| Z_{j} |}{2}} \frac{1}{| n {\tilde{S}}_{Z_{j}, Z_{j}} |^{\frac{1}{2}}} \\ \times \int_{0}^{\infty} d_{j}^{- (\frac{n}{2} + r | Z_{j} | + α_{1} + 1)} \\ \times \exp (- \frac{{\tilde{S}}_{j | Z_{j}} + 2 α_{2}}{2 d_{j}}) E_{| Z_{j} |} (\prod_{i \in Z_{j}} L_{i j}^{2 r}) d d_{j}, \end{aligned}$ (7) ) are not available in closed form, we need to leverage the following lemma that gives the upper bound for the Bayes factor between any ‘non-true’ model $Z_{j}$ and the true model ${Z_{0}}_{j}$ . Proof for this lemma will be provided in the supplement.

Lemma 4.1

Under Assumptions 1–5a, for each $1 \leq j < p$ , the Bayes factor between any ‘non-true’ model $Z_{j}$ and the true model ${Z_{0}}_{j}$ under the pMOM Cholesky prior will be bound above by, (11) $\begin{aligned} \frac{π (Y | Z_{j})}{π (Y | {Z_{0}}_{j})} & \leq ((M τ)^{r + 1 / 2} n^{1 / 2})^{- (| Z_{j} | - | {Z_{0}}_{j} |)} \\ \times \frac{| {\tilde{S}}_{j \cup {Z_{0}}_{j}, j \cup {Z_{0}}_{j}} | {\tilde{S}}_{j | {Z_{0}}_{j}}}{| {\tilde{S}}_{j \cup Z_{j}, j \cup Z_{j}} | {\tilde{S}}_{j | Z_{j}}} \frac{(V | Z_{j} |^{- 1})^{r | Z_{j} |}}{{(\frac{s}{2})}^{2 r | {Z_{0}}_{j} |}} \\ \times \frac{Γ (\frac{n}{2} + (r - \frac{1}{2}) | Z_{j} | + α_{1})}{Γ (\frac{n}{2} + (r - \frac{1}{2}) | {Z_{0}}_{j} | + α_{1})} \\ \times \frac{(n {\tilde{S}}_{j | {Z_{0}}_{j}} / 2 + α_{2})^{\frac{n}{2} + (r - \frac{1}{2}) | {Z_{0}}_{j} | + α_{1}}}{(n {\tilde{S}}_{j | Z_{j}} / 2 + α_{2})^{\frac{n}{2} + (r - \frac{1}{2}) | Z_{j} | + α_{1}}} \\ + ((M τ)^{r + 1 / 2} n^{1 / 2})^{- (| Z_{j} | - | {Z_{0}}_{j} |)} \\ \times \frac{| {\tilde{S}}_{j \cup {Z_{0}}_{j}, j \cup {Z_{0}}_{j}} | {\tilde{S}}_{j | {Z_{0}}_{j}}}{| {\tilde{S}}_{j \cup Z_{j}, j \cup Z_{j}} | {\tilde{S}}_{j | Z_{j}}} \\ \times \frac{n^{- r | Z_{j} |}}{{(\frac{s}{2})}^{2 r | {Z_{0}}_{j} |}} \frac{Γ (\frac{n - | Z_{j} |}{2} + α_{1})}{Γ (\frac{n}{2} + (r - \frac{1}{2}) | {Z_{0}}_{j} | + α_{1})} \\ \times \frac{(n {\tilde{S}}_{j | {Z_{0}}_{j}} / 2 + α_{2})^{\frac{n}{2} + (r - \frac{1}{2}) | {Z_{0}}_{j} | + α_{1}}}{(n {\tilde{S}}_{j | Z_{j}} / 2 + α_{2})^{\frac{n - | Z_{j} |}{2} + α_{1}}}, \end{aligned}$ (11) for some positive constant M, where $V = (S_{Z_{j}, j})^{T} \times ({\tilde{S}}_{Z_{j}, Z_{j}})^{- 2} S_{Z_{j}, j}$ .

The upper bound for the Bayes factor in Lemma 4.1 can be used to prove posterior ratio consistency. This notion of consistency implies that the true model will be the posterior mode with probability tending to 1.

Theorem 4.2

Under Assumptions 1–5a, the following holds: for all $1 \leq j < p$ , $max_{Z_{j} \neq {Z_{j}}_{0}} \frac{π (Z_{j} | Y)}{π ({Z_{0}}_{j} | Y)} \overset{P_{Ω_{0}}}{⟶} 0, as n \to \infty .$

Proof of this result is provided in the supplement. If one is interested in a point estimate of $Z_{j}$ , the most apparent choice would be the posterior mode defined as (12) ${\hat{Z}}_{j} = {a r g m a x}_{Z_{j}} π (Z_{j} | Y) .$ (12) By noting that $max_{Z_{j} \neq {Z_{0}}_{j}} \frac{π (Z_{j} | Y)}{π ({Z_{0}}_{j} | Y)} < 1 \Rightarrow \hat{Z_{j}} = {Z_{0}}_{j},$ we have the following corollary.

Corollary 4.1

Under Assumptions 1–5a, the posterior mode ${\hat{Z}}_{j}$ is equal to the true model ${Z_{0}}_{j}$ with probability tending to 1, i.e., for all $1 \leq j < p$ , $P_{Ω_{0}} (\hat{Z_{j}} = {Z_{0}}_{j}) \to 1, as n \to \infty .$

4.3. Strong model selection consistency

Next we establish a stronger notion of consistency (compared to Theorem 4.2) that is referred to as strong selection consistency. which implies that the posterior mass assigned to the true model ${Z_{0}}_{j}$ converges to 1 in probability (Lee et al., Citation2019; Narisetty & He, Citation2014). For achieving the strong selection consistency, we need the following assumption instead of Assumption 5a on τ. Proof for this theorem is provided in the supplement.

Assumption 5b

The hyperparameters $A_{p}$ , τ, $α_{1}$ , $α_{2}$ in (Equation5(5) $\begin{aligned} L_{Z_{j}, j} ∣ d_{j}, Z_{j} & \overset{i n d}{\sim} pMOM Cholesky, 1 \leq j < p, \end{aligned}$ (5) ) and (Equation6(6) $\begin{aligned} d_{j} & \overset{i n d}{\sim} Inverse-Gamma (α_{1}, α_{2}), 1 \leq j \leq p . \end{aligned}$ (6) ) satisfy $0 < a_{1} < {e i g}_{1} (A_{p}) \leq {e i g}_{2} (A_{p}) \leq \dots \leq {e i g}_{p} (A_{p}) < a_{2} < \infty$ , $0 < α_{1}, α_{2} < a_{2}$ and $τ \sim p^{2 κ / (r + 1 / 2)}$ , for some $κ > 1$ . Here $a_{1}, a_{2}$ are constants not depending on n.

Theorem 4.3

Under Assumptions 1–5b, the following holds: for all $1 \leq j < p$ , $π ({Z_{0}}_{j} | Y) \overset{P_{Ω_{0}}}{⟶} 1, as n \to \infty .$

Remark 4.1

Not that neither Theorem 4.2 nor Corollary 4.1 requires any restriction on the rate of the scale parameter τ in the pMOM Cholesky prior that will be growing, this requirement is only needed for Theorem 4.3. As noted in Johnson and Rossell (Citation2012), the scale parameter τ is of particular importance, as it reflects the dispersion of the non-local prior density around zero, and implicitly determines the size of the regression coefficients that will be shrunk to zero. Shin et al. (Citation2018) treat τ as given, and consider a setting where p and τ vary with the sample size n. In the context of linear regression, they show that high-dimensional model selection consistency is achieved under the peMOM prior under the assumption that τ grows larger than $\log p$ .

4.4. Comparison with existing methods

We compare our results and assumptions with those of existing methods in both Bayesian and frequentist literature. Assumption 2 is a weaker assumption for high-dimensional covariance asymptotic than other Bayesian approaches including Xiang et al. (Citation2015) and Banerjee and Ghosal (Citation2014, Citation2015). However, compared with methods based on penalized likelihood, Assumption 2 is stronger than the condition $d \log p / n < c_{0}$ for some constant $c_{0}$ in Yu and Bien (Citation2017) and van de Geer and Bühlmann (Citation2013), which also study the estimation of Cholesky factor for Gaussian DAG models with and without the known ordering condition, respectively. In terms of undirected graphical models, Assumption 2 is also more restrictive compared to the complexity assumptions $\log p = o (n)$ in Cai et al. (Citation2011) and $n > d c_{1} \log p$ for some constant $c_{1}$ in Zhang and Zou (Citation2014).

Interested readers may also find Assumption 4 to be stronger compared with Condition (P) in Lee et al. (Citation2019) where $q_{n} \sim n (\log p_{n})^{- 1} {(\log n)^{- 1} \lor c_{2}}$ for some constant $c_{2}$ . This is actually a result of the uniform-like prior imposed on the $Z_{j}$ . If we replace the uniform prior with the Erdos–Renyi prior or the complexity prior, this restriction can be relaxed to encompass a larger class of models. However, the simulation results will be compromised by always favouring the sparest model, since the penalty on larger models has already been induced through the pMOM prior itself.

In the context of estimating DAGs using non-local priors, Altamore et al. (Citation2013) deal with structural learning for Gaussian DAG models from an objective Bayesian perspective by assigning a prior distribution on the space of DAGs, together with an improper pMOM prior on the Cholesky factor corresponding to each DAG. The authors in Altamore et al. (Citation2013) proposed the FBF approach, but did not take the opportunity to examine the theoretical consistency. The major contributions of this paper are to fill the gap of high-dimensional asymptotic properties for pMOM and hyper-pMOM priors in Gaussian graphical models, and to develop efficient algorithms for exploring the massive candidate space containing $2^{p (p - 1) / 2}$ models, as we discuss in the next section.

5. Computation

In this section, we will take on the task to illustrate the computational strategy for the proposed model. The integral formulation in (Equation7(7) $\begin{aligned} π (Z_{j} | Y) & \propto π (Z_{j}) m_{| Z_{j} |} | A_{Z_{j}, Z_{j}} |^{\frac{1}{2}} τ^{- r | Z_{j} | - \frac{| Z_{j} |}{2}} \frac{1}{| n {\tilde{S}}_{Z_{j}, Z_{j}} |^{\frac{1}{2}}} \\ \times \int_{0}^{\infty} d_{j}^{- (\frac{n}{2} + r | Z_{j} | + α_{1} + 1)} \\ \times \exp (- \frac{{\tilde{S}}_{j | Z_{j}} + 2 α_{2}}{2 d_{j}}) E_{| Z_{j} |} (\prod_{i \in Z_{j}} L_{i j}^{2 r}) d d_{j}, \end{aligned}$ (7) ) is quite complicated, and the posterior probabilities can not be obtained in closed form. Hence, we use the Laplace approximation to compute $π (Z_{j} | Y)$ . Detailed formulas are provided in the supplement. A similar approach to compute posterior probabilities based on Laplace approximations has been used in Johnson and Rossell (Citation2012) and Shin et al. (Citation2018). In practice, when the computation burden for Laplace approximation becomes intensive as p increases, we also suggest using the upper bound of the posterior ((A.5) in the supplement) as an approximation, since our proofs are based on these upper bounds and the consistency results are therefore already guaranteed. Based on these approximations, we consider the following MCMC algorithm for exploring the model space:

Set the initial value $Z^{c u r r}$ .
For each $j = 1, \dots, p - 1$ ,
1. Given the current $Z_{j}^{c u r r}$ , propose $Z_{j}^{c a n d}$ by either
  1. changing a non-zero entry in $Z_{j}^{c u r r}$ to zero with probability $(1 - α_{Z})$ or
  2. changing a zero entry in $Z_{j}^{c u r r}$ to one with probability $α_{Z}$ .
2. Compute $p_{a} = min \{1, \frac{π (Z_{j}^{c a n d} | Y) q (Z_{j}^{c u r r} | Z_{j}^{c a n d})}{π (Z_{j}^{c u r r} | Y) q (Z_{j}^{c a n d} | Z_{j}^{c u r r})}\} .$
3. Draw $u \sim U (0, 1)$ . If $p_{a} > u$ , Set $Z_{j}^{c u r r} = Z_{j}^{c a n d}$ .
4. Repeat (a)–(c) until a sufficiently long chain is acquired.

Note that the inference for Z, the steps 2-(a) and 2-(d) in the above algorithm can be parallelized for each column of Z. For more details about the parallel MCMC algorithm, we refer the interested readers to Lee et al. (Citation2019), Bhadra and Mallick (Citation2013) and Johnson and Rossell (Citation2012). The above algorithm is coded in R and publicly available at https://github.com/xuan-cao/Non-local-Cholesky.

6. Simulation studies

In this section, we demonstrate our main results through simulation studies. To serve this purpose, we consider several different combinations of $(n, p)$ including both the low-dimensional and high-dimensional cases. For each fixed p, a $p \times p$ lower triangular matrix with unit diagonals is constructed. In particular, we randomly choose 4% or 8% of the lower triangular entries of the Cholesky factor and set them to be non-zero values according to the following three scenarios. The remaining entries are set to zero. We refer to this matrix as $L_{0}$ . The matrix $L_{0}$ also reflects the true underlying DAG structure encoded in $Z_{0}$ .

Scenario 1: All the non-zero off-diagonal entries in $L_{0}$ are set to be 1.
Scenario 2: All the non-zero off-diagonal entries in $L_{0}$ are generated from $N (0, 1)$ .
Scenario 3: Each non-zero off-diagonal entry is set to be 0.25, 0.5 or 0.75 with equal probability.

Next, we generate n i.i.d. observations from the $N (0_{p}, (L_{0}^{- 1})^{T} L_{0}^{- 1})$ distribution, and set the hyperparameters as r = 2, $A_{p} = I_{p}$ , $α_{1} = α_{2} = 0.01$ . The above process ensures all the assumptions are satisfied. Since our posterior ratio consistency in Theorem 4.2 and strong model selection consistency in Theorem 4.3 require different constraints on the scale parameter τ, we also consider three values for τ: (a) $τ = 1$ ; (b) $τ = 2$ ; (c) $τ_{p} = p^{2.01}$ . Then, we perform model selection on the Cholesky factor using the four procedures outlined below.

Lasso-DAG with quantile based tuning: We implement the Lasso-DAG approach in Shojaie and Michailidis (Citation2010) by choosing penalty parameters (separate for each variable i) given by $λ_{i} = 2 n^{- \frac{1}{2}} Φ^{- 1} (\frac{0.1}{2 p (i - 1)})$ , where $Φ (\cdot)$ denotes the cumulative distribution function of the standard normal distribution. This choice is justified in Shojaie and Michailidis (Citation2010) based on asymptotic considerations.
ESC Metropolis–Hastings algorithm: We implement the Rao-Blackwellized Metropolis–Hastings algorithm for the empirical sparse Cholesky (ESC) prior introduced in Lee et al. (Citation2019) for exploring the space of the Cholesky factor. The hyperparameters and the initial states are taken as suggested in Lee et al. (Citation2019). Each MCMC chain for each row of the Cholesky factor runs for 5000 iterations with a burn-in period of 2000. All the active components in L with inclusion probability larger than 0.5 are selected.
FBF Fractional Bayes factor approach: We implement the stochastic search algorithm based on fractional Bayes factors for non-local moment priors suggested in Altamore et al. (Citation2013). The stochastic search algorithm is similar to that proposed by Scott and Carvalho (Citation2008), which includes re-sampling moves, local moves and global moves. The rationale can be summarized by saying that edge moves which already improved some models are likely to improve other models as well. The final model is constructed by collecting the entries with inclusion probabilities greater than 0.5.
pMOM Cholesky MCMC algorithm: We ran the MCMC algorithm outlined in Section 5 with $α_{Z} = 0.5$ for each combination and data set to conduct the posterior inference for each column of Z. The initial value for Z is set by thresholding the modified Cholesky factor of $(S + 0.3 I)^{- 1}$ (S is the sample covariance matrix) and setting the entries with absolute values larger than 0.1 to be 1 and 0 otherwise. Each MCMC chain runs for an iteration of 10,000 times with a burn-in period of 5000, which gives us 5000 posterior samples. In our simulation settings, we use four separate cores for parallel computing. We construct the final model by collecting the entries with inclusion probabilities greater than 0.5.

The model selection performance of these four methods is then compared using several different measures of structure such as false discovery rate, true positive rate and Mathews correlation coefficient (average over 100 independent repetitions). False Discovery Rate (FDR) represents the proportion of true non-zero entries among all the entries detected by the given procedure, True Positive Rate (TPR) measures the proportion of true non-zero entries detected by the given procedure among all the non-zero entries from the true model. FDR and TPR are defined as $FDR = \frac{F P}{T P + F P}, TPR = \frac{T P}{T P + F N} .$ Mathews Correlation Coefficient (MCC) is commonly used to assess the overall performance of binary classification methods and is defined as $M C C = \frac{T P \times T N - F P \times F N}{\sqrt{\begin{matrix} (F P + T N) (T P + F N) (T N + F P) (T N + F N) \end{matrix}}},$ where TP, TN, FP and FN correspond to true positive, true negative, false positive and false negative, respectively. Note that the value of MCC ranges from −1 to 1 with larger values corresponding to better fits (−1 and 1 represent worst and best fits, respectively). One would like the FDR values to be as close to 0 and TPR values to be as close to 1 as possible. The results are provided in Tables –, corresponding to different simulation settings.

Table 1. Model selection performance table under Scenario 1 with 4% non-zero entries.

Display Table

Table 2. Model selection performance table under Scenario 1 with 8% non-zero entries.

Display Table

Table 3. Model selection performance table under Scenario 2 with 4% non-zero entries.

Display Table

Table 4. Model selection performance table under Scenario 2 with 8% non-zero entries.

Display Table

Table 5. Model selection performance table under Scenario 3 with 4% non-zero entries.

Display Table

Table 6. Model selection performance table under Scenario 3 with 8% non-zero entries.

Display Table

Note that this cutoff value of 0.5 for obtaining the posterior estimator in our MCMC procedure is a natural default choice and could be changed in different contexts. However, it turns out that compared with other methods, our results are quite robust with respect to the thresholding value as we draw out the ROC curves under the setting with $4 %$ non-zero entries given in Figure . In particular, we observe the fixed $τ_{2}$ pMOM Cholesky model overall outperforms the other three methods including the pMOM model with growing $τ_{p}$ , especially when n and p increase.

Figure 1. ROC curves for sparsity selection. Top: n = 100; bottom: n = 200. (a) p = 100, (b) p = 200, (c) p = 500, (d) p = 200, (e) p = 400 and (f) p = 1000.

It is clear that our hierarchical Bayesian approach with pMOM Cholesky prior under two different values of τ outperforms Lasso-DAG, ESC and FBF approaches based on almost all measures. The FDR values for our Bayesian pMOM Cholesky approaches are mostly below 0.3 except when p = 1000, while the ones for the other methods are around or beyond 0.5. The TPR values for the proposed approaches are all beyond 0.6 in most cases, while the ones for the penalized likelihood approaches and other two Bayesian approaches are all below 0.55 in most scenarios. For the most comprehensive measure of MCC, our proposed Bayesian approach outperforms all the other three methods under all the cases of $(n, p)$ and two different sparsity settings. It is also worthwhile to compare the simulation performance between three different values of τ under the pMOM Cholesky prior. We can tell that the higher order of $τ_{p}$ though could guarantee the strong model selection consistency (Theorem 4.3), compared with the constant $τ_{1}$ and $τ_{2}$ case, the selection performance slightly suffers from the strong penalty induced by both the pMOM prior itself and the larger $τ_{p}$ value. The performance under $τ = 1$ and $τ = 2$ are very similar with a slightly better performance given by $τ = 1$ . Hence, from a practical standpoint, one would prefer treating τ as a smaller constant (not growing with p) for better estimation accuracy.

It is also meaningful to compare the computational runtime between different methods. In Figure , we plot the run time comparison among our pMOM Cholesky approach, ESC and FBF. We can see that the run time for pMOM is significantly lessened compared to ESC and FBF, especially under the setting where we ran each ESC-based chain for 5000 iterations, while for pMOM, we ran 10,000 iterations. The computational cost of ESC is also extremely expensive in the sense that it requires not only additional run time, but also larger memory (more than 30 GB when $p > 900$ ).

Figure 2. Run time comparison.

Overall, this experiment illustrates that the proposed hierarchical Bayesian approach with our pMOM Cholesky prior can be used for a broad yet computationally feasible model search, and at the same time can lead to a much more significant improvement in model selection performance for estimating the sparsity pattern of the Cholesky factor and the underlying DAG.

7. Results for hyper-pMOM Cholesky prior

In the generalized linear regression setting, Wu (Citation2016) proposes a fully Bayesian approach with the hyper-pMOM prior where an appropriate Inverse-Gamma prior $Inverse-Gamma (λ_{1}, λ_{2})$ is placed on the parameter τ in the pMOM prior. Following the nomenclature in Wu (Citation2016), we refer to the following mixture of priors as the hyper-pMOM Cholesky prior, (13) $\begin{aligned} L_{Z_{j}, j} ∣ d_{j}, Z_{j} & \overset{i n d}{\sim} pMOM Cholesky, 1 \leq j < p, \end{aligned}$ (13) (14) $\begin{aligned} d_{j} & \overset{i n d}{\sim} Inverse-Gamma (α_{1}, α_{2}), 1 \leq j \leq p, \end{aligned}$ (14) (15) $\begin{aligned} τ & \sim Inverse-Gamma (λ_{1}, λ_{2}), \end{aligned}$ (15) where $λ_{1}$ and $λ_{2}$ are positive constants.

Note that as indicated in Wu (Citation2016) and Cao et al. (Citation2020), compared to the pMOM density in (Equation2(2) $\begin{aligned} m_{p} (2 π)^{- \frac{p}{2}} (τ σ^{2})^{- r p - \frac{p}{2}} | A_{p} |^{\frac{1}{2}} \\ \times \exp (- \frac{β_{p}^{T} A_{p} β_{p}}{2 τ σ^{2}}) \prod_{i = 1}^{p} β_{i}^{2 r} . \end{aligned}$ (2) ) with given τ, the marginal hyper-pMOM now possesses thicker tails that induce prior dependence. In addition, this type of mixture of priors could achieve better model selection performance especially for small samples (Liang et al., Citation2008).

By (Equation13(13) $\begin{aligned} L_{Z_{j}, j} ∣ d_{j}, Z_{j} & \overset{i n d}{\sim} pMOM Cholesky, 1 \leq j < p, \end{aligned}$ (13) )–(Equation15(15) $\begin{aligned} τ & \sim Inverse-Gamma (λ_{1}, λ_{2}), \end{aligned}$ (15) ), under the hyper-pMOM Cholesky prior, the resulting posterior probability for $Z_{j}$ is given by, (16) $\begin{aligned} π (Z_{j} | Y) \\ \propto π (Z_{j}) m_{| Z_{j} |} | A_{Z_{j}, Z_{j}} |^{\frac{1}{2}} \\ \times \int_{0}^{\infty} \int_{0}^{\infty} τ^{- r | Z_{j} | - \frac{| Z_{j} |}{2} - (λ_{1} + 1)} e^{- \frac{λ_{2}}{τ}} \frac{1}{| n {\tilde{S}}_{Z_{j}, Z_{j}} |^{\frac{1}{2}}} \\ \times d_{j}^{- (\frac{n}{2} + (r - \frac{1}{2}) | Z_{j} | + α_{1} + 1)} \\ \times \exp (- \frac{n {\tilde{S}}_{j | Z_{j}} + 2 α_{2}}{2 d_{j}}) E_{| Z_{j} |} (\prod_{i \in Z_{j}} L_{i j}^{2 r}) d d_{j} d τ, \end{aligned}$ (16) where ${\tilde{S}}_{Z_{j}, Z_{j}} = S_{Z_{j}, Z_{j}} + \frac{A_{Z_{j}, Z_{j}}}{n τ}$ , ${\tilde{S}}_{j | Z_{j}} = {\tilde{S}}_{j j} - ({\tilde{S}}_{Z_{j}, j})^{T} \times ({\tilde{S}}_{Z_{j}, Z_{j}})^{- 1} {\tilde{S}}_{Z_{j}, j},$ and $E_{| Z_{j} |} (.)$ denotes the expectation with respect to a multivariate normal distribution with mean $- ({\tilde{S}}_{Z_{j}, Z_{j}})^{- 1} {\tilde{S}}_{Z_{j}, j}$ , and covariance matrix $d_{j} (n {\tilde{S}}_{Z_{j}, Z_{j}})^{- 1}$ . Since these posterior probabilities are still not available in closed, we have the following lemma that provides the upper bound for the Bayes factor under the following assumption.

Assumption 5c

The hyperparameters $A_{p}$ , $α_{1}, α_{2}, λ_{1}, λ_{2}$ in (Equation13(13) $\begin{aligned} L_{Z_{j}, j} ∣ d_{j}, Z_{j} & \overset{i n d}{\sim} pMOM Cholesky, 1 \leq j < p, \end{aligned}$ (13) )–(Equation15(15) $\begin{aligned} τ & \sim Inverse-Gamma (λ_{1}, λ_{2}), \end{aligned}$ (15) ) satisfy $0 < a_{1} < {e i g}_{1} (A_{p}) \leq {e i g}_{2} (A_{p}) \leq \dots \leq {e i g}_{p} (A_{p}) < a_{2} < \infty$ and $0 < α_{1}, α_{2}, λ_{1}, λ_{2} < a_{2}$ . Here $a_{1}, a_{2}$ are constants not depending on n.

Lemma 7.1

Under Assumption 1–5c, for each $1 \leq j < p$ , the Bayes factor between any ‘non-true’ model $Z_{j}$ and the true model ${Z_{0}}_{j}$ under the hyper-pMOM Cholesky prior will be bound above by, (17) $\begin{aligned} \frac{π (Y | Z_{j})}{π (Y | {Z_{0}}_{j})} \\ \leq (M n^{1 / 2})^{- (| Z_{j} | - | {Z_{0}}_{j} |)} \\ \times \frac{(| Z_{j} |^{- 1} V)^{r | Z_{j} |}}{(\frac{s}{2})^{2 r | {Z_{0}}_{j} |}} \frac{Γ (\frac{n}{2} + (r - \frac{1}{2}) | Z_{j} | + α_{1})}{Γ (\frac{n}{2} + (r - \frac{1}{2}) | {Z_{0}}_{j} | + α_{1})} \\ \times \frac{Γ (r | Z_{j} | + \frac{| Z_{j} |}{2} + λ_{1})}{Γ (r | {Z_{0}}_{j} | + \frac{| {Z_{0}}_{j} |}{2} + λ_{1})} \\ \times \frac{(n {\tilde{S}}_{j | {Z_{0}}_{j}} / 2 + α_{2})^{\frac{n}{2} + (r - \frac{1}{2}) | {Z_{0}}_{j} | + α_{1}}}{(n {\tilde{S}}_{j | Z_{j}} / 2 + α_{2})^{\frac{n}{2} + (r - \frac{1}{2}) | Z_{j} | + α_{1}}} \\ \times \frac{(λ_{2} + c_{3} | {Z_{0}}_{j} | / n + c_{4})^{r | {Z_{0}}_{j} | + \frac{| {Z_{0}}_{j} |}{2} + λ_{1}}}{(λ_{2} - c_{2} | Z_{j} | / (2 n))^{r | Z_{j} | + \frac{| Z_{j} |}{2} + λ_{1}}} \\ + (M n^{1 / 2})^{- (| Z_{j} | - | {Z_{0}}_{j} |)} \frac{n^{- r | Z_{j} |}}{{(\frac{s}{2})}^{2 r | {Z_{0}}_{j} |}} \\ \times \frac{Γ (\frac{n - | Z_{j} |}{2} + α_{1})}{Γ (\frac{n}{2} + (r - \frac{1}{2}) | {Z_{0}}_{j} | + α_{1})} \frac{Γ (r | Z_{j} | + \frac{| Z_{j} |}{2} + λ_{1})}{Γ (r | {Z_{0}}_{j} | + \frac{| {Z_{0}}_{j} |}{2} + λ_{1})} \\ \times \frac{(n {\tilde{S}}_{j | {Z_{0}}_{j}} / 2 + α_{2})^{\frac{n}{2} + (r - \frac{1}{2}) | {Z_{0}}_{j} | + α_{1}}}{(n {\tilde{S}}_{j | Z_{j}} / 2 + α_{2})^{\frac{n - | Z_{j} |}{2} + α_{1}}} \\ \times \frac{(λ_{2} + c_{3} | {Z_{0}}_{j} | / n + c_{4})^{r | {Z_{0}}_{j} | + \frac{| {Z_{0}}_{j} |}{2} + λ_{1}}}{(λ_{2} - c_{2} | Z_{j} | / (2 n))^{r | Z_{j} | + \frac{| Z_{j} |}{2} + λ_{1}}}, \end{aligned}$ (17) for some constants $M, c_{2}, c_{3}, c_{4} > 0$ .

The upper bound in (Equation17(17) $\begin{aligned} \frac{π (Y | Z_{j})}{π (Y | {Z_{0}}_{j})} \\ \leq (M n^{1 / 2})^{- (| Z_{j} | - | {Z_{0}}_{j} |)} \\ \times \frac{(| Z_{j} |^{- 1} V)^{r | Z_{j} |}}{(\frac{s}{2})^{2 r | {Z_{0}}_{j} |}} \frac{Γ (\frac{n}{2} + (r - \frac{1}{2}) | Z_{j} | + α_{1})}{Γ (\frac{n}{2} + (r - \frac{1}{2}) | {Z_{0}}_{j} | + α_{1})} \\ \times \frac{Γ (r | Z_{j} | + \frac{| Z_{j} |}{2} + λ_{1})}{Γ (r | {Z_{0}}_{j} | + \frac{| {Z_{0}}_{j} |}{2} + λ_{1})} \\ \times \frac{(n {\tilde{S}}_{j | {Z_{0}}_{j}} / 2 + α_{2})^{\frac{n}{2} + (r - \frac{1}{2}) | {Z_{0}}_{j} | + α_{1}}}{(n {\tilde{S}}_{j | Z_{j}} / 2 + α_{2})^{\frac{n}{2} + (r - \frac{1}{2}) | Z_{j} | + α_{1}}} \\ \times \frac{(λ_{2} + c_{3} | {Z_{0}}_{j} | / n + c_{4})^{r | {Z_{0}}_{j} | + \frac{| {Z_{0}}_{j} |}{2} + λ_{1}}}{(λ_{2} - c_{2} | Z_{j} | / (2 n))^{r | Z_{j} | + \frac{| Z_{j} |}{2} + λ_{1}}} \\ + (M n^{1 / 2})^{- (| Z_{j} | - | {Z_{0}}_{j} |)} \frac{n^{- r | Z_{j} |}}{{(\frac{s}{2})}^{2 r | {Z_{0}}_{j} |}} \\ \times \frac{Γ (\frac{n - | Z_{j} |}{2} + α_{1})}{Γ (\frac{n}{2} + (r - \frac{1}{2}) | {Z_{0}}_{j} | + α_{1})} \frac{Γ (r | Z_{j} | + \frac{| Z_{j} |}{2} + λ_{1})}{Γ (r | {Z_{0}}_{j} | + \frac{| {Z_{0}}_{j} |}{2} + λ_{1})} \\ \times \frac{(n {\tilde{S}}_{j | {Z_{0}}_{j}} / 2 + α_{2})^{\frac{n}{2} + (r - \frac{1}{2}) | {Z_{0}}_{j} | + α_{1}}}{(n {\tilde{S}}_{j | Z_{j}} / 2 + α_{2})^{\frac{n - | Z_{j} |}{2} + α_{1}}} \\ \times \frac{(λ_{2} + c_{3} | {Z_{0}}_{j} | / n + c_{4})^{r | {Z_{0}}_{j} | + \frac{| {Z_{0}}_{j} |}{2} + λ_{1}}}{(λ_{2} - c_{2} | Z_{j} | / (2 n))^{r | Z_{j} | + \frac{| Z_{j} |}{2} + λ_{1}}}, \end{aligned}$ (17) ) can be used to show the posterior ratio consistency illustrated in the following theorem.

Theorem 7.2

Under Assumptions 1–5c, if we assume $λ_{1}$ and $λ_{2}$ are some fixed positive constants, the following holds under the hyper-pMOM Cholesky prior: for all $1 \leq j < p$ , $\begin{aligned} max_{Z_{j} \neq {Z_{0}}_{j}} \frac{π (Z_{j} | Y)}{π ({Z_{0}}_{j} | Y)} \overset{P_{Ω_{0}}}{⟶} 0, and P_{Ω_{0}} ({\hat{Z}}_{j} = {Z_{0}}_{j}) \to 1, \\ as n \to \infty . \end{aligned}$

In order to achieve strong model selection consistency, we need the following assumption on the hyperparameter $λ_{2}$ instead of Assumption 5c.

Assumption 5d

The hyperparameters $A_{p}$ , $α_{1}, α_{2}, λ_{1}, λ_{2}$ in (Equation13(13) $\begin{aligned} L_{Z_{j}, j} ∣ d_{j}, Z_{j} & \overset{i n d}{\sim} pMOM Cholesky, 1 \leq j < p, \end{aligned}$ (13) )–(Equation15(15) $\begin{aligned} τ & \sim Inverse-Gamma (λ_{1}, λ_{2}), \end{aligned}$ (15) ) satisfy $0 < a_{1} < {e i g}_{1} (A_{p}) \leq {e i g}_{2} (A_{p}) \leq \dots \leq {e i g}_{p} (A_{p}) < a_{2} < \infty$ , $0 < α_{1}, α_{2}, λ_{1} < a_{2}$ and $λ_{2} \sim p^{2 κ / (r + 1 / 2)}$ , for some $κ > 1$ . Here $a_{1}, a_{2}$ are constants not depending on n.

The next theorem establishes the strong selection consistency under the hyper-pMOM Cholesky prior. See proofs for Theorems 7.2 and 7.3 in the supplement.

Theorem 7.3

Under Assumptions 1–5d, for the hyper-pMOM Cholesky prior, the following holds: for all $1 \leq j < p$ , $π ({Z_{0}}_{j} | Y) \overset{P_{Ω_{0}}}{⟶} 1, as n \to \infty .$

Note that for the hyper-pMOM Cholesky prior with the extra layer of prior on τ, the Newton-type algorithm used for optimizing the likelihood could be quite time consuming, and the estimation accuracy will be compromised, especially when the size of the model and the dimension p are large. Therefore, from a practical standpoint, we would still prefer the pMOM Cholesky prior for carrying out the model selection.

8. Discussion

In this paper, we investigate the theoretical consistency properties for the high-dimensional sparse DAG models based on proper non-local priors, namely the pMOM Cholesky and the hyper-pMOM Cholesky priors. We establish both posterior ratio consistency and strong model selection consistency under comparably more general conditions than those in the existing literature. In addition, by putting a uniform-like prior over the space of sparsity pattern for Cholesky factors, we avoid the potential issues of the model being stuck in rather sparse space caused by the priors over the graph space aiming to penalize larger models like the Erdos–Renyi prior, the beta-mixture prior or the multiplicative prior. Also, through simulation studies where we implement an efficient parallel MCMC algorithm for exploring the sparsity pattern of each column of L, we demonstrate that the models studied in this paper can outperform existing state-of-the-art methods including both penalized likelihood and Bayesian approaches in different settings.

Acknowledgments

We would like to thank the Editor, the Associate Editor and the reviewer for their insightful comments which have led to improvements of an earlier version of this paper.

Additional information

Funding

This work was supported by Simons Foundation's collaboration grant (No. 635213).

References

Altamore, D., Consonni, G., & La Rocca, L. (2013). Objective Bayesian search of gaussian directed acyclic graphical models for ordered variables with non-local priors. Biometrics, 69(2), 478–487. https://doi.org/https://doi.org/10.1111/biom.v69.2
Web of Science ®Google Scholar
Banerjee, S., & Ghosal, S. (2014). Posterior convergence rates for estimating large precision matrices using graphical models. Electronic Journal of Statistics, 8(2), 2111–2137. https://doi.org/https://doi.org/10.1214/14-EJS945
Web of Science ®Google Scholar
Banerjee, S., & Ghosal, S. (2015). Bayesian structure learning in graphical models. Journal of Multivariate Analysis, 136, 147–162. https://doi.org/https://doi.org/10.1016/j.jmva.2015.01.015
Web of Science ®Google Scholar
Ben-David, E., Li, T., Massam, H., & Rajaratnam, B. (2016). High dimensional Bayesian inference for Gaussian directed acyclic graph models (Tech. Rep.). http://arxiv.org/abs/1109.4371
Google Scholar
Bhadra, A., & Mallick, B. (2013). Joint high-dimensional Bayesian variable and covariance selection with an application to eQTL analysis. Biometrics, 69(2), 447–457. https://doi.org/https://doi.org/10.1111/biom.v69.2
Web of Science ®Google Scholar
Bickel, P. J., & Levina, E. (2008). Regularized estimation of large covariance matrices. Annals of Statistics, 36(1), 199–227. https://doi.org/https://doi.org/10.1214/009053607000000758
Web of Science ®Google Scholar
Cai, T., Liu, W., & Luo, X. (2011). A constrained ℓ1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494), 594–607. https://doi.org/https://doi.org/10.1198/jasa.2011.tm10155
Web of Science ®Google Scholar
Cao, X., Khare, K., & Ghosh, M. (2019). Posterior graph selection and estimation consistency for high-dimensional Bayesian DAG models. Annals of Statistics, 47(1), 319–348. https://doi.org/https://doi.org/10.1214/18-AOS1689
Web of Science ®Google Scholar
Cao, X., Khare, K., & Ghosh, M. (2020). High-dimensional posterior consistency for hierarchical non-local priors in regression. Bayesian Analysis, 15(1), 241–262. https://doi.org/https://doi.org/10.1214/19-BA1154
Web of Science ®Google Scholar
Carvalho, C. M., & Scott, J. G. (2009). Objective Bayesian model selection in Gaussian graphical models. Biometrika, 96(3), 497–512. https://doi.org/https://doi.org/10.1093/biomet/asp017
Web of Science ®Google Scholar
El Karoui, N. (2008). Spectrum estimation for large dimensional covariance matrices using random matrix theory. Annals of Statistics, 36(6), 2757–2790. https://doi.org/https://doi.org/10.1214/07-AOS581
Web of Science ®Google Scholar
Huang, J., Liu, N., Pourahmadi, M., & Liu, L. (2006). Covariance selection and estimation via penalised normal likelihood. Biometrika, 93(1), 85–98. https://doi.org/https://doi.org/10.1093/biomet/93.1.85
Web of Science ®Google Scholar
Johnson, V., & Rossell, D. (2010). On the use of non-local prior densities in Bayesian hvoothesis tests hypothesis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(2), 143–170. https://doi.org/https://doi.org/10.1111/rssb.2010.72.issue-2
Google Scholar
Johnson, V., & Rossell, D. (2012). Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association, 107(498), 649–660. https://doi.org/https://doi.org/10.1080/01621459.2012.682536
Web of Science ®Google Scholar
Khare, K., Oh, S., Rahman, S., & Rajaratnam, B. (2017). A convex framework for high-dimensional sparse Cholesky based covariance estimation in Gaussian DAG models [Preprint, Department of Statisics, University of Florida].
Google Scholar
Lee, K., Lee, J., & Lin, L. (2019). Minimax posterior convergence rates and model selection consistency in high-dimensional DAG models based on sparse Cholesky factors. Annals of Statistics, 47(6), 3413–3437. https://doi.org/https://doi.org/10.1214/18-AOS1783
Web of Science ®Google Scholar
Liang, F., Paulo, R., Molina, G., Clyde, A. M., & Berger, O. J. (2008). Mixtures of g priors for Bayesian variable selection. Journal of the American Statistical Association, 103(481), 410–423. https://doi.org/https://doi.org/10.1198/016214507000001337
Web of Science ®Google Scholar
Narisetty, N., & He, X. (2014). Bayesian variable selection with shrinking and diffusing priors. Annals of Statistics, 42(2), 789–817. https://doi.org/https://doi.org/10.1214/14-AOS1207
Web of Science ®Google Scholar
Niu, Y., Pati, D., & Mallick, B. (2019). Bayesian graph selection consistency under model misspecification. arxiv:1901.04134
Google Scholar
Pourahmadi, M. (2007). Cholesky decompositions and estimation of a covariance matrix: Orthogonality of variance–correlation parameters. Biometrika, 94(4), 1006–1013. https://doi.org/https://doi.org/10.1093/biomet/asm073
Web of Science ®Google Scholar
Rossell, D., Telesca, D., & Johnson, V. E. (2013). High-dimensional Bayesian classifiers using non-local priors. In Statistical models for data analysis. Springer.
Google Scholar
Scott, J. G., & Carvalho, C. M. (2008). Feature-inclusion stochastic search for gaussian graphical models. Journal of Computational and Graphical Statistics, 17(4), 790–808. https://doi.org/https://doi.org/10.1198/106186008X382683
Web of Science ®Google Scholar
Shin, M., Bhattacharya, A., & Johnson, V. (2018). Scalable Bayesian variable selection using nonlocal prior densities in ultrahigh-dimensional settings. Statistica Sinica, 28(2), 1053–1078. https://doi.org/https://doi.org/10.5705/ss.202016.0167
PubMed Web of Science ®Google Scholar
Shojaie, A., & Michailidis, G. (2010). Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. Biometrika, 97(3), 519–538. https://doi.org/https://doi.org/10.1093/biomet/asq038
PubMed Web of Science ®Google Scholar
Tan, L. S. L., Jasra, A., De Iorio, M., & Ebbels, T. M. D. (2017). Bayesian inference for multiple gaussian graphical models with application to metabolic association networks. The Annals of Applied Statistics, 11(4), 2222–2251. https://doi.org/https://doi.org/10.1214/17-AOAS1076
Web of Science ®Google Scholar
van de Geer, S., & Bühlmann, P. (2013). ℓ0-penalized maximum likelihood for sparse directed acyclic graphs. The Annals of Statistics, 41(2), 536–567. https://doi.org/https://doi.org/10.1214/13-AOS1085
Web of Science ®Google Scholar
Wu, H.-H. (2016). Nonlocal priors for Bayesian variable selection in generalized linear models and generalized linear mixed models and their applications in biology data [PhD thesis, University of Missouri].
Google Scholar
Xiang, R., Khare, K., & Ghosh, M. (2015). High dimensional posterior convergence rates for decomposable graphical models. Electronic Journal of Statistics, 9(2), 2828–2854. https://doi.org/https://doi.org/10.1214/15-EJS1084
Web of Science ®Google Scholar
Yang, Y., Wainwright, M. J., & Jordan, M. I. (2016). On the computational complexity of high-dimensional Bayesian variable selection. Annals of Statistics, 44(6), 2497–2532. https://doi.org/https://doi.org/10.1214/15-AOS1417
Web of Science ®Google Scholar
Yu, G., & Bien, J. (2017). Learning local dependence in ordered data. Journal of Machine Learning Research, 18(42), 1–60.
Google Scholar
Zhang, T., & Zou, H. (2014). Sparse precision matrix estimation via lasso penalized D-trace loss. Biometrika, 101(1), 103–120. https://doi.org/https://doi.org/10.1093/biomet/ast059
Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

On the non-local priors for sparsity selection in high-dimensional Gaussian DAG models

Abstract

1. Introduction

2. Preliminaries

2.1. Gaussian DAG models

2.2. Notations

2.3. pMOM Cholesky prior

3. Model specification

Proof of Lemma 3.1

4. Main results

4.1. Assumptions

4.2. Posterior ratio consistency

4.3. Strong model selection consistency

4.4. Comparison with existing methods

5. Computation

6. Simulation studies

Table 1. Model selection performance table under Scenario 1 with 4% non-zero entries.

Table 2. Model selection performance table under Scenario 1 with 8% non-zero entries.

Table 3. Model selection performance table under Scenario 2 with 4% non-zero entries.

Table 4. Model selection performance table under Scenario 2 with 8% non-zero entries.

Table 5. Model selection performance table under Scenario 3 with 4% non-zero entries.

Table 6. Model selection performance table under Scenario 3 with 8% non-zero entries.

7. Results for hyper-pMOM Cholesky prior

8. Discussion

Acknowledgments

References

Information for

Open access

Opportunities

Help and information

On the non-local priors for sparsity selection in high-dimensional Gaussian DAG models

Abstract

1. Introduction

2. Preliminaries

2.1. Gaussian DAG models

2.2. Notations

2.3. pMOM Cholesky prior

3. Model specification

Proof of Lemma 3.1

4. Main results

4.1. Assumptions

4.2. Posterior ratio consistency

4.3. Strong model selection consistency

4.4. Comparison with existing methods

5. Computation

6. Simulation studies

Table 1. Model selection performance table under Scenario 1 with 4% non-zero entries.

Table 2. Model selection performance table under Scenario 1 with 8% non-zero entries.

Table 3. Model selection performance table under Scenario 2 with 4% non-zero entries.

Table 4. Model selection performance table under Scenario 2 with 8% non-zero entries.

Table 5. Model selection performance table under Scenario 3 with 4% non-zero entries.

Table 6. Model selection performance table under Scenario 3 with 8% non-zero entries.

7. Results for hyper-pMOM Cholesky prior

8. Discussion

Acknowledgments

Additional information

Funding

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date