613
Views
0
CrossRef citations to date
0
Altmetric
Articles in the special topic of Bayesian analysis

On the non-local priors for sparsity selection in high-dimensional Gaussian DAG models

ORCID Icon &
Pages 332-345 | Received 27 May 2020, Accepted 05 Jul 2021, Published online: 05 Sep 2021

Abstract

We consider sparsity selection for the Cholesky factor L of the inverse covariance matrix in high-dimensional Gaussian DAG models. The sparsity is induced over the space of L via non-local priors, namely the product moment (pMOM) prior [Johnson, V., & Rossell, D. (2012). Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association, 107(498), 649–660. https://doi.org/10.1080/01621459.2012.682536] and the hierarchical hyper-pMOM prior [Cao, X., Khare, K., & Ghosh, M. (2020). High-dimensional posterior consistency for hierarchical non-local priors in regression. Bayesian Analysis, 15(1), 241–262. https://doi.org/10.1214/19-BA1154]. We establish model selection consistency for Cholesky factor under more relaxed conditions compared to those in the literature and implement an efficient MCMC algorithm for parallel selecting the sparsity pattern for each column of L. We demonstrate the validity of our theoretical results via numerical simulations, and also use further simulations to demonstrate that our sparsity selection approach is competitive with existing methods.

1. Introduction

Covariance estimation and selection is a fundamental problem in multivariate statistical inference. In recent years, high-throughput data from various applications is being generated rapidly. Several promising methods have been proposed to interpret the complex multivariate relationships in these high-dimensional datasets. In particular, methods inducing sparsity in the Cholesky factor of the inverse have proven to be very effective in applications. These models are also referred to as Gaussian DAG models. In particular, consider i.i.d. observations Y1,Y2,,Yn obeying a multivariate normal distribution with mean vector 0p and covariance matrix Σ. Let Ω=LD1LT be the unique modified Cholesky decomposition of the inverse covariance matrix Ω=Σ1, where L is a lower triangular matrix with unit diagonals, and D is a diagonal matrix with all diagonal entries being positive. A given sparsity pattern on L corresponds to certain conditional independence relationships, which can be encoded in terms of a directed acyclic graph D on the set of p variables as follows: if the ith and jth variables do not share an edge in D, then Lij=0 (see Section 2 for more details). In this paper, we focus on imposing sparsity on the Cholesky factor of the inverse covariance matrix through a class of non-local priors.

Non-local priors were first introduced by Johnson and Rossell (Citation2010) as densities that are identically zero whenever a model parameter is equal to its null value in the context of hypothesis testing. Non-local priors discard spurious covariates faster as the sample size n increases compared with local priors, while preserving exponential learning rates to detect nontrivial coefficients. These non-local priors including the product moment (pMOM) non-local prior are extended to Bayesian model selection problems in Johnson and Rossell (Citation2012) and Shin et al. (Citation2018) by imposing non-local priors on regression coefficients. Wu (Citation2016) and Cao et al. (Citation2020) consider a fully Bayesian approach with the pMOM non-local prior and an appropriate Inverse-Gamma prior on the hyperparameter (the so-called hyper-pMOM prior), and discuss the potential advantages of using hyper-pMOM priors and establish model selection consistency in regression setting.

In the context of Gaussian DAG models, Altamore et al. (Citation2013) deal with structural learning for Gaussian DAG models from an objective Bayesian perspective by assigning a prior distribution on the space of DAGs, together with an improper product moment prior on the Cholesky factor corresponding to each DAG. However, objective priors are often improper and cannot be used to directly compute the Bayes factors, even when the marginal likelihoods are strictly positive and finite. The authors therefore utilize the fractional Bayes factor (FBF) approach and implement an efficient stochastic search algorithm to deal with data sets having sample size smaller than the number of variables. Cao et al. (Citation2019) further establish consistency results under these objective priors under rather restrictive conditions.

To the best of our knowledge, a rigorous investigation of high-dimensional posterior consistency properties with either pMOM prior or the hyper-pMOM prior has not been undertaken for either undirected graphical models or DAG models. Hence, our first goal was to investigate if high-dimensional consistency results could be established under these two more diverse and algebraically complex class of non-local priors in the Gaussian DAG model setting. Our second goal was to investigate if these consistency results can be obtained under relaxed or comparable conditions. Our third goal was to develop efficient algorithms for exploring the massive candidate space containing 2p(p1)/2 models. These were challenging goals of course, as the posterior distributions are not available in closed form for both the pMOM prior and the hyper-pMOM prior.

As the main contributions of this paper, we establish high-dimensional posterior ratio consistency for Gaussian DAG models with both the pMOM prior as well as the hyper-pMOM prior on the Cholesky factor L, and under a uniform-like prior on the sparsity pattern in L (Theorems 4.2–7.3). Following the nomenclature in Lee et al. (Citation2019) and Niu et al. (Citation2019), this notion of consistency also referred to as consistency of posterior odds implies the maximal ratio between the marginal posterior probability assigned to a ‘non-true’ model and the posterior probability assigned to the ‘true’ model converges to zero. That also indicates that the true model will be the posterior mode with probability tending to 1. As indicated in Shin et al. (Citation2018), since the pMOM priors already induce a strong penalty on the model size, it is no longer necessary to penalize larger models through priors on the graph space like Erdos–Renyi prior (Niu et al., Citation2019), beta-mixture prior (Carvalho & Scott, Citation2009), or multiplicative prior (Tan et al., Citation2017). Also, through simulation studies where we implement an efficient parallel MCMC algorithm for exploring the sparsity pattern of each column of L, we demonstrate that the models studied in this paper can outperform existing state-of-the-art methods including both penalized likelihood and Bayesian approaches in different settings.

The rest of the paper is organized as follows. Section 2 provides background material regarding the Gaussian DAG model and introduces the pMOM Cholesky distribution. In Section 3, we present our hierarchical Bayesian model and the parameter class for the inverse covariance matrices. Model selection consistency results for both the pMOM Cholesky prior and the hyper-pMOM Cholesky prior are stated in Sections 4 and 7 respectively, with proofs provided in the supplement. In Section 6 we use simulation experiments to illustrate the model selection consistency, and demonstrate the benefits of our Bayesian approach and computation procedures for Cholesky factor selection vis-a-vis existing Bayesian and penalized likelihood approaches. We end our paper with a discussion session in Section 8.

2. Preliminaries

In this section, we provide the necessary background material from graph theory, Gaussian DAG models, and also introduce our pMOM Cholesky prior.

2.1. Gaussian DAG models

We consider the multivariate Gaussian distribution (1) YNp(0,Ω1),(1) where Ω is a p×p inverse covariance matrix. Any positive definite matrix Ω can be uniquely decomposed as Ω=LD1LT, where L is a lower triangular matrix with unit diagonal entries, and D is a diagonal matrix with positive diagonal entries. This decomposition is known as the modified Cholesky decomposition of Ω (Pourahmadi, Citation2007). By considering this decomposition, one can place an appropriate prior over the diagonals of D to construct a hierarchical model. In addition, the unit diagonals resulting from the modified Cholesky decomposition can benefit the posterior calculation and proof of consistency.

A directed acyclic graph (DAG) D=(V,E) consists of the vertex set V={1,,p} and an edge set E such that there is no directed path starting and ending at the same vertex. As in Ben-David et al. (Citation2016) and Lee et al. (Citation2019), we will without loss of generality assume a parent ordering, where that all the edges are directed from larger vertices to smaller vertices. Applications with natural ordering of variables include estimation of causal relationships from temporal observations, or settings where additional experimental data can determine the ordering of variables, and estimation of transcriptional regulatory networks from gene expression data (Huang et al., Citation2006; Khare et al., Citation2017; Shojaie & Michailidis, Citation2010; Yu & Bien, Citation2017). The set of parents of i, denoted by pai(D), is the collection of all vertices which are larger than i and share an edge with i.

A Gaussian DAG model over a given DAG D denoted by ND consists of all multivariate Gaussian distributions which obey the directed Markov property with respect to a DAG D. In particular, if Y=(Y1,,Yp)TNp(0,Σ) and Np(0,Σ=Ω1)ND, then YiY{i+1,,p}pai(D)|Ypai(D), for each 1i<p. For the connection between the Cholesky factor L and the underlying DAG D, if Ω=LD1LT is the modified Cholesky decomposition of Ω, then Np(0,Ω1) is a Gaussian DAG model over D if and only if Lij=0 whenever ipaj(D).

2.2. Notations

Consider the modified Cholesky decomposition Ω=LD1LT, where L is a lower triangular matrix with all the unit diagonal entries and D=diag {d1,d2,,dp}, where di(1ip)'s are all positive and di represents the ith diagonal entry of D. We introduce latent binary variables Z={Z21,Z31,,Zp1,Z32,Z42,,Zp,p1} for 1j<kp to indicate whether Lkj is active, i.e., Zkj=1 if Lkj0 and 0, otherwise.

In this way, we are viewing the binary variable Z as the indicator for the sparsity pattern in L. For each 1jp1, let Zj={Zkj:k>j,Zkj=1}, a subset of {j+1,j+2,,p}, be the index set of all non-zero components in {Zj+1,j,,Zp,j}. Zj explicitly gives the support of the Cholesky factor and the sparsity pattern of the underlying DAG. Denote |Zj|=k=j+1pZkj as the cardinality of set Zj for 1jp1.

For any p×p matrix A, denote AS1,S2 as a subset of A defined by rows in set S1 and columns in set S2. Following the definition of Z, for any p×p matrix A, denote the column vectors AZj,j=(Akj)kZj and AjZj,j=(Ajj,AZj,jT)T. Also, let AZj,Zj=(Aki)k,iZj, AjZj,jZj=AiiAZj,jTAZj,jAZj,Zj. In particular, ApZp,p=ApZp,pZp=App.

Next, we provide some additional required notations. For xRp, let xr=(j=1p|xj|r)1r and x=maxj|xj| represent the standard lr and l norms. For a p×p matrix A, let eig1(A)eig2(A)eigp(A) be the ordered eigenvalues of A and denote Amax=max1i,jp|Aij|, A(r,s)=sup{Axs:xr=1}, for 1r,s<. In particular, A(1,1)=maxji|Aij|,A(,)=maxij|Aij| and A(2,2)=eigp(A)1/2.

2.3. pMOM Cholesky prior

Johnson and Rossell (Citation2012) introduce the product moment (pMOM) non-local prior for the regression coefficients with density given by (2) mp(2π)p2(τσ2)rpp2|Ap|12×expβpTApβp2τσ2i=1pβi2r.(2) Here Ap is a p×p nonsingular matrix, r is a positive integer referred to as the order of the density and mp is the normalizing constant independent of τ and σ2, where τ is some positive constant. Variations of the density in (Equation2), called the piMOM and peMOM density, have also been developed in Johnson and Rossell (Citation2012), Rossell et al. (Citation2013) and Shin et al. (Citation2018). Adapted to our framework, we place the following non-local prior on the Cholesky factor L corresponding to pMOM prior for a certain sparsity pattern Z, (3) π(LZj,jdj,Zj)=m|Zj|(2π)|Zj|2(τdj)r|Zj||Zj|2|AZj,Zj|12×exp(LZj,j)TAZj,ZjLZj,j2τdjiZjLij2r,(3) for j=1,2,,p1, where similarly, Ap is a p×p positive definite matrix, r is a positive integer, τ>0, and m|Zj| is the normalizing constant independent of τ and dj, but dependent on |Zj|. m|Zj| can not be explicitly written in closed form by can be bounded below and above by a function of |Zj|. We refer to (Equation3) as our pMOM Cholesky priors. To introduce a hierarchical model on the Cholesky parameter (L,D), we will impose an Inverse-Gamma prior on the diagonal entries of D. Note that to obtain our desired asymptotic consistency results, appropriate conditions for all the aforementioned hyperparameters will be introduced in Section 4.1.

3. Model specification

Let Y1,Y2,,YnRp be the observed data and S=1ni=1nYiYiT denote the sample covariance matrix. The class of pMOM Cholesky distributions (Equation3) can be used for Bayesian sparsity selection of the Cholesky factor through the following hierarchical model, (4) YD,LNp0,(LD1LT)1,(4) (5) LZj,jdj,ZjindpMOM Cholesky,1j<p,(5) (6) djindInverse-Gamma(α1,α2),1jp.(6) The proposed hierarchical model now has five hyperparameters: the scale parameter τ>0, the order r and positive definite matrix A in model (Equation5) for the pMOM Cholesky prior, the shape parameter α1 and scale parameter α2 in model (Equation6) for the Inverse-Gamma prior on dj. Further restrictions on these hyperparameters to ensure desired consistency will be specified in Section 4.1.

Remark 3.1

Note that in the currently presented hierarchical model, we have not assigned any specific form to the prior over the sparsity patterns of L (essentially the space of Z). Some standard regularity assumptions for this prior will be provided later in Section 4.1. In fact, we will essentially impose a uniform-like prior on Z. Because of the strong penalty induced on the model size by the pMOM prior, it is no longer necessary to penalize larger models through priors on the graph space like the Erdos–Renyi prior (Niu et al., Citation2019), the complexity prior (Lee et al., Citation2019), or the multiplicative prior (Tan et al., Citation2017).

Note that under the hierarchical model (Equation4)–(Equation6), we can conduct posterior inference for the sparsity pattern of each column of L independently, which will benefit the computation significantly in the sense that it allows for parallel searching. In order to show the posterior ratio consistency π(Zj|Y), we need the following lemma that establishes the marginal posterior probability.

Lemma 3.1

Under the hierarchical model (Equation4)–(Equation6), the resulting (marginal) posterior probability for Zj (1j<p) is given by (7) π(Zj|Y)π(Zj)m|Zj||AZj,Zj|12τr|Zj||Zj|21|nS~Zj,Zj|12×0djn2+r|Zj|+α1+1×expS~j|Zj+2α22djE|Zj|iZjLij2rddj,(7) where m|Zj| is some normalized constant independent of dj, S~=S+Anτ, S~j|Zj=S~jj(S~Zj,j)T(S~Zj,Zj)1S~Zj,j, and E|Zj|(.) denotes the expectation with respect to a multivariate normal distribution with mean (S~Zj,Zj)1S~Zj,j, and covariance matrix dj(S~Zj,Zj)1.

Here we provide the proof of Lemma 3.1.

Proof of Lemma 3.1

By (Equation4)–(Equation6) and Bayes' rule, under the pMOM Cholesky prior, the resulting posterior probability for Zj is given by, (8) π(Zj|Y)π(Zj)0πYD,Lπ(LZj,jdj,Zj)×π(dj)dLZj,jddj0expn(LjZj,j)TSjZj,jZjLjZj,j2dj×dj(n2+α1+1)eα2djm|Zj|(2π)|Zj|2×(τdj)r|Zj||Zj|2|AZj,Zj|12×exp(LZj,j)TAZj,ZjLZj,j2τdj×iZjLij2rdLZj,jddj.(8) Note that LjZj,jTSjZj,jLjZj,j=1,LZj,jTSjjSZj,jTSZj,jSZj,Zj1,LZj,j. Therefore, it follows from (Equation8) that iZjexpn(LjZj,j)TSjZj,jZjLjZj,j2dj×exp(LZj,j)TAZj,ZjLZj,j2τdjiZjLij2rdLZj,j=iZjLij2rexp(LZj,j+(S~Zj,Zj)1SZj,j)TS~Zj,Zj(LZj,j+(S~Zj,Zj)1SZj,j)2dj/n×expSjj(SZj,j)T(S~Zj,Zj)1SZj,j2dj/ndLZj,j, where S~Zj=SZj+AZjnτ. Hence, by (Equation8), we have (9) π(Zj|Y)π(Zj)0πYD,Lπ(LZj,jdj,Zj)×π(dj)dLZj,jddjπ(Zj)m|Zj||AZj,Zj|12τr|Zj||Zj|21|nS~Zj,Zj|12×0djn2+(r12)|Zj|+α1+1×expnS~j|Zj+2α22djE|Zj|iZjLij2rddj.(9)

In particular, these posterior probabilities can be used to select a model by computing the posterior mode defined by (10) Zjˆ=argmaxZjπ(Zj|Y).(10)

4. Main results

In this section we aim to investigate the high-dimensional asymptotic properties for the proposed model in Section 3. For this purpose, we will work in a setting where the data dimension p=pn and the hyperparameters vary with the sample size n and pnn. Assume that the data is actually being generated from a true model specified as follows. Let Y1n,Y2n,,Ynn be independent and identically distributed multivariate variate Gaussian vectors with mean 0pn and true covariance matrix Σ0n=(Ω0n)1, where Ω0n=L0n(D0n)1(L0n)T is the modified Cholesky decomposition of Ω0n. The sparsity pattern of the true Cholesky factor L0n is uniquely encoded in the true binary variable set denoted as Z0n.

In order to establish our asymptotic consistency results, we need the following mild assumptions with corresponding discussion/interpretation. Denote dn=max1jp1|Z0nj| as the maximum number of non-zero entries in each column of L0n. Let sn=min1j,ip,iZj|(L0n)ij| as the smallest (in absolute value) non-zero off-diagonal entry in L0n, and can be interpreted as the ‘signal size’. For sequences an and bn, anbn means an/bnc for some constant c>0. Let an=o(bn) represent an/bn0 as n.

4.1. Assumptions

Assumption 1

There exists ϵ01, such that for every n1, 0<ϵ0eig1(Ω0n)eigpn(Ω0n)ϵ01.

Assumption 2

dnlogpn/n0 as n.

Assumption 3

dnlogpn/(sn2n)0 as n.

Assumption 4

For each Zj(1j<p), a uniform prior is placed over all models of size less than or equal to qn, i.e., π(Zj)I(|Zj|qn), where qn=o(n/logpn).

Assumption 5a

The hyperparameters Apn, τ, α1, α2 in (Equation5) and (Equation6) satisfy 0<a1<eig1(Apn)eig2(Apn)eigpn(Apn)<a2< and 0<α1,α2,τ<a2. Here a1,a2 are constants not depending on n.

Assumption 1 has been commonly used for establishing high-dimensional covariance asymptotic properties (Banerjee & Ghosal, Citation2014Citation2015; Bickel & Levina, Citation2008; El Karoui, Citation2008; Xiang et al., Citation2015). Assumption 2 essentially allow the number of variables pn to grow slower than en/dn2 compared to previous literatures with rate en/dn4 (Banerjee & Ghosal, Citation2014Citation2015; Xiang et al., Citation2015). Assumption 2 also states the maximum number of parents for all the nodes for the true model (i.e., dn) must be at a smaller order than n/logpn.

Assumption 3 also known as the ‘beta-min’ condition provides a lower bound for the minimum values of L0n that is needed for establishing consistency. This type of condition has been used for the exact support recovery of the high-dimensional linear regression models as well as Gaussian DAG models (Khare et al., Citation2017; Lee et al., Citation2019; Yang et al., Citation2016). Assumption 4 essentially states that the uniform-like prior on the space of the 2pn(pn1)/2 possible models, places zero mass on unrealistically large models. Since Assumption 2 already restricts dn to be o(n/logpn), Assumption 4 does not affect the probability assigned to the true model. See similar assumptions in Johnson and Rossell (Citation2012) and Shin et al. (Citation2018) in the context of regression.

Assumption 5a is standard which assumes the eigenvalues of the scale matrix in the pMOM Cholesky prior are uniformly bounded in n. Note that for the default value of Apn=Ipn, Assumption 5a is immediately satisfied. See similar assumptions in Shin et al. (Citation2018) and Johnson and Rossell (Citation2012). This assumption also states the hyperparameter τ in pMOM Cholesky prior and α1,α2 in the Inverse-Gamma prior are bounded by a constant.

For the rest of this paper, pn, Ω0n, Σ0n,L0n,D0n,Z0n,Zn,dn,sn will be denoted as p, Ω0, Σ0, L0, D0, Z0,Z,d,s by leaving out the superscript for notational convenience. Let PΩ0 and EΩ0 denote the probability measure and expected value corresponding to the ‘true’ model specified in the beginning of Section 4, respectively.

4.2. Posterior ratio consistency

Since the posterior probabilities in (Equation7) are not available in closed form, we need to leverage the following lemma that gives the upper bound for the Bayes factor between any ‘non-true’ model Zj and the true model Z0j. Proof for this lemma will be provided in the supplement.

Lemma 4.1

Under Assumptions 1–5a, for each 1j<p, the Bayes factor between any ‘non-true’ model Zj and the true model Z0j under the pMOM Cholesky prior will be bound above by, (11) π(Y|Zj)π(Y|Z0j)((Mτ)r+1/2n1/2)(|Zj||Z0j|)×|S~jZ0j,jZ0j|S~j|Z0j|S~jZj,jZj|S~j|Zj(V|Zj|1)r|Zj|s22r|Z0j|×Γn2+r12|Zj|+α1Γn2+r12|Z0j|+α1×(nS~j|Z0j/2+α2)n2+(r12)|Z0j|+α1(nS~j|Zj/2+α2)n2+(r12)|Zj|+α1+((Mτ)r+1/2n1/2)(|Zj||Z0j|)×|S~jZ0j,jZ0j|S~j|Z0j|S~jZj,jZj|S~j|Zj×nr|Zj|s22r|Z0j|Γn|Zj|2+α1Γn2+r12|Z0j|+α1×(nS~j|Z0j/2+α2)n2+(r12)|Z0j|+α1(nS~j|Zj/2+α2)n|Zj|2+α1,(11) for some positive constant M, where V=(SZj,j)T×(S~Zj,Zj)2SZj,j.

The upper bound for the Bayes factor in Lemma 4.1 can be used to prove posterior ratio consistency. This notion of consistency implies that the true model will be the posterior mode with probability tending to 1.

Theorem 4.2

Under Assumptions 1–5a, the following holds: for all 1j<p, maxZjZj0π(Zj|Y)π(Z0j|Y)PΩ00,as n.

Proof of this result is provided in the supplement. If one is interested in a point estimate of Zj, the most apparent choice would be the posterior mode defined as (12) Zˆj=argmaxZjπ(Zj|Y).(12) By noting that maxZjZ0jπ(Zj|Y)π(Z0j|Y)<1Zjˆ=Z0j, we have the following corollary.

Corollary 4.1

Under Assumptions 1–5a, the posterior mode Zˆj is equal to the true model Z0j with probability tending to 1, i.e., for all 1j<p, PΩ0(Zjˆ=Z0j)1,as n.

4.3. Strong model selection consistency

Next we establish a stronger notion of consistency (compared to Theorem 4.2) that is referred to as strong selection consistency. which implies that the posterior mass assigned to the true model Z0j converges to 1 in probability (Lee et al., Citation2019; Narisetty & He, Citation2014). For achieving the strong selection consistency, we need the following assumption instead of Assumption 5a on τ. Proof for this theorem is provided in the supplement.

Assumption 5b

The hyperparameters Ap, τ, α1, α2 in (Equation5) and (Equation6) satisfy 0<a1<eig1(Ap)eig2(Ap)eigp(Ap)<a2<, 0<α1,α2<a2 and τp2κ/(r+1/2), for some κ>1. Here a1,a2 are constants not depending on n.

Theorem 4.3

Under Assumptions 1–5b, the following holds: for all 1j<p, π(Z0j|Y)PΩ01,as n.

Remark 4.1

Not that neither Theorem 4.2 nor Corollary 4.1 requires any restriction on the rate of the scale parameter τ in the pMOM Cholesky prior that will be growing, this requirement is only needed for Theorem 4.3. As noted in Johnson and Rossell (Citation2012), the scale parameter τ is of particular importance, as it reflects the dispersion of the non-local prior density around zero, and implicitly determines the size of the regression coefficients that will be shrunk to zero. Shin et al. (Citation2018) treat τ as given, and consider a setting where p and τ vary with the sample size n. In the context of linear regression, they show that high-dimensional model selection consistency is achieved under the peMOM prior under the assumption that τ grows larger than logp.

4.4. Comparison with existing methods

We compare our results and assumptions with those of existing methods in both Bayesian and frequentist literature. Assumption 2 is a weaker assumption for high-dimensional covariance asymptotic than other Bayesian approaches including Xiang et al. (Citation2015) and Banerjee and Ghosal (Citation2014Citation2015). However, compared with methods based on penalized likelihood, Assumption 2 is stronger than the condition dlogp/n<c0 for some constant c0 in Yu and Bien (Citation2017) and van de Geer and Bühlmann (Citation2013), which also study the estimation of Cholesky factor for Gaussian DAG models with and without the known ordering condition, respectively. In terms of undirected graphical models, Assumption 2 is also more restrictive compared to the complexity assumptions logp=o(n) in Cai et al. (Citation2011) and n>dc1logp for some constant c1 in Zhang and Zou (Citation2014).

Interested readers may also find Assumption 4 to be stronger compared with Condition (P) in Lee et al. (Citation2019) where qnn(logpn)1{(logn)1c2} for some constant c2. This is actually a result of the uniform-like prior imposed on the Zj. If we replace the uniform prior with the Erdos–Renyi prior or the complexity prior, this restriction can be relaxed to encompass a larger class of models. However, the simulation results will be compromised by always favouring the sparest model, since the penalty on larger models has already been induced through the pMOM prior itself.

In the context of estimating DAGs using non-local priors, Altamore et al. (Citation2013) deal with structural learning for Gaussian DAG models from an objective Bayesian perspective by assigning a prior distribution on the space of DAGs, together with an improper pMOM prior on the Cholesky factor corresponding to each DAG. The authors in Altamore et al. (Citation2013) proposed the FBF approach, but did not take the opportunity to examine the theoretical consistency. The major contributions of this paper are to fill the gap of high-dimensional asymptotic properties for pMOM and hyper-pMOM priors in Gaussian graphical models, and to develop efficient algorithms for exploring the massive candidate space containing 2p(p1)/2 models, as we discuss in the next section.

5. Computation

In this section, we will take on the task to illustrate the computational strategy for the proposed model. The integral formulation in (Equation7) is quite complicated, and the posterior probabilities can not be obtained in closed form. Hence, we use the Laplace approximation to compute π(Zj|Y). Detailed formulas are provided in the supplement. A similar approach to compute posterior probabilities based on Laplace approximations has been used in Johnson and Rossell (Citation2012) and Shin et al. (Citation2018). In practice, when the computation burden for Laplace approximation becomes intensive as p increases, we also suggest using the upper bound of the posterior ((A.5) in the supplement) as an approximation, since our proofs are based on these upper bounds and the consistency results are therefore already guaranteed. Based on these approximations, we consider the following MCMC algorithm for exploring the model space:

  1. Set the initial value Zcurr.

  2. For each j=1,,p1,

    1. Given the current Zjcurr, propose Zjcand by either

      1. changing a non-zero entry in Zjcurr to zero with probability (1αZ) or

      2. changing a zero entry in Zjcurr to one with probability αZ.

    2. Compute pa=min1,π(Zjcand|Y)q(Zjcurr|Zjcand)π(Zjcurr|Y)q(Zjcand|Zjcurr).

    3. Draw uU(0,1). If pa>u, Set Zjcurr=Zjcand.

    4. Repeat (a)–(c) until a sufficiently long chain is acquired.

Note that the inference for Z, the steps 2-(a) and 2-(d) in the above algorithm can be parallelized for each column of Z. For more details about the parallel MCMC algorithm, we refer the interested readers to Lee et al. (Citation2019), Bhadra and Mallick (Citation2013) and Johnson and Rossell (Citation2012). The above algorithm is coded in R and publicly available at https://github.com/xuan-cao/Non-local-Cholesky.

6. Simulation studies

In this section, we demonstrate our main results through simulation studies. To serve this purpose, we consider several different combinations of (n,p) including both the low-dimensional and high-dimensional cases. For each fixed p, a p×p lower triangular matrix with unit diagonals is constructed. In particular, we randomly choose 4% or 8% of the lower triangular entries of the Cholesky factor and set them to be non-zero values according to the following three scenarios. The remaining entries are set to zero. We refer to this matrix as L0. The matrix L0 also reflects the true underlying DAG structure encoded in Z0.

  1. Scenario 1: All the non-zero off-diagonal entries in L0 are set to be 1.

  2. Scenario 2: All the non-zero off-diagonal entries in L0 are generated from N(0,1).

  3. Scenario 3: Each non-zero off-diagonal entry is set to be 0.25, 0.5 or 0.75 with equal probability.

Next, we generate n i.i.d. observations from the N(0p,(L01)TL01) distribution, and set the hyperparameters as r = 2, Ap=Ip, α1=α2=0.01. The above process ensures all the assumptions are satisfied. Since our posterior ratio consistency in Theorem 4.2 and strong model selection consistency in Theorem 4.3 require different constraints on the scale parameter τ, we also consider three values for τ: (a) τ=1; (b) τ=2; (c) τp=p2.01. Then, we perform model selection on the Cholesky factor using the four procedures outlined below.

  1. Lasso-DAG with quantile based tuning: We implement the Lasso-DAG approach in Shojaie and Michailidis (Citation2010) by choosing penalty parameters (separate for each variable i) given by λi=2n12Φ1(0.12p(i1)), where Φ() denotes the cumulative distribution function of the standard normal distribution. This choice is justified in Shojaie and Michailidis (Citation2010) based on asymptotic considerations.

  2. ESC Metropolis–Hastings algorithm: We implement the Rao-Blackwellized Metropolis–Hastings algorithm for the empirical sparse Cholesky (ESC) prior introduced in Lee et al. (Citation2019) for exploring the space of the Cholesky factor. The hyperparameters and the initial states are taken as suggested in Lee et al. (Citation2019). Each MCMC chain for each row of the Cholesky factor runs for 5000 iterations with a burn-in period of 2000. All the active components in L with inclusion probability larger than 0.5 are selected.

  3. FBF Fractional Bayes factor approach: We implement the stochastic search algorithm based on fractional Bayes factors for non-local moment priors suggested in Altamore et al. (Citation2013). The stochastic search algorithm is similar to that proposed by Scott and Carvalho (Citation2008), which includes re-sampling moves, local moves and global moves. The rationale can be summarized by saying that edge moves which already improved some models are likely to improve other models as well. The final model is constructed by collecting the entries with inclusion probabilities greater than 0.5.

  4. pMOM Cholesky MCMC algorithm: We ran the MCMC algorithm outlined in Section 5 with αZ=0.5 for each combination and data set to conduct the posterior inference for each column of Z. The initial value for Z is set by thresholding the modified Cholesky factor of (S+0.3I)1 (S is the sample covariance matrix) and setting the entries with absolute values larger than 0.1 to be 1 and 0 otherwise. Each MCMC chain runs for an iteration of 10,000 times with a burn-in period of 5000, which gives us 5000 posterior samples. In our simulation settings, we use four separate cores for parallel computing. We construct the final model by collecting the entries with inclusion probabilities greater than 0.5.

The model selection performance of these four methods is then compared using several different measures of structure such as false discovery rate, true positive rate and Mathews correlation coefficient (average over 100 independent repetitions). False Discovery Rate (FDR) represents the proportion of true non-zero entries among all the entries detected by the given procedure, True Positive Rate (TPR) measures the proportion of true non-zero entries detected by the given procedure among all the non-zero entries from the true model. FDR and TPR are defined as FDR=FPTP+FP,TPR=TPTP+FN. Mathews Correlation Coefficient (MCC) is commonly used to assess the overall performance of binary classification methods and is defined as MCC=TP×TNFP×FN(FP+TN)(TP+FN)(TN+FP)(TN+FN), where TP, TN, FP and FN correspond to true positive, true negative, false positive and false negative, respectively. Note that the value of MCC ranges from −1 to 1 with larger values corresponding to better fits (−1 and 1 represent worst and best fits, respectively). One would like the FDR values to be as close to 0 and TPR values to be as close to 1 as possible. The results are provided in Tables , corresponding to different simulation settings.

Table 1. Model selection performance table under Scenario 1 with 4% non-zero entries.

Table 2. Model selection performance table under Scenario 1 with 8% non-zero entries.

Table 3. Model selection performance table under Scenario 2 with 4% non-zero entries.

Table 4. Model selection performance table under Scenario 2 with 8% non-zero entries.

Table 5. Model selection performance table under Scenario 3 with 4% non-zero entries.

Table 6. Model selection performance table under Scenario 3 with 8% non-zero entries.

Note that this cutoff value of 0.5 for obtaining the posterior estimator in our MCMC procedure is a natural default choice and could be changed in different contexts. However, it turns out that compared with other methods, our results are quite robust with respect to the thresholding value as we draw out the ROC curves under the setting with 4% non-zero entries given in Figure . In particular, we observe the fixed τ2 pMOM Cholesky model overall outperforms the other three methods including the pMOM model with growing τp, especially when n and p increase.

Figure 1. ROC curves for sparsity selection. Top: n = 100; bottom: n = 200. (a) p = 100, (b) p = 200, (c) p = 500, (d) p = 200, (e) p = 400 and (f) p = 1000.

Figure 1. ROC curves for sparsity selection. Top: n = 100; bottom: n = 200. (a) p = 100, (b) p = 200, (c) p = 500, (d) p = 200, (e) p = 400 and (f) p = 1000.

It is clear that our hierarchical Bayesian approach with pMOM Cholesky prior under two different values of τ outperforms Lasso-DAG, ESC and FBF approaches based on almost all measures. The FDR values for our Bayesian pMOM Cholesky approaches are mostly below 0.3 except when p = 1000, while the ones for the other methods are around or beyond 0.5. The TPR values for the proposed approaches are all beyond 0.6 in most cases, while the ones for the penalized likelihood approaches and other two Bayesian approaches are all below 0.55 in most scenarios. For the most comprehensive measure of MCC, our proposed Bayesian approach outperforms all the other three methods under all the cases of (n,p) and two different sparsity settings. It is also worthwhile to compare the simulation performance between three different values of τ under the pMOM Cholesky prior. We can tell that the higher order of τp though could guarantee the strong model selection consistency (Theorem 4.3), compared with the constant τ1 and τ2 case, the selection performance slightly suffers from the strong penalty induced by both the pMOM prior itself and the larger τp value. The performance under τ=1 and τ=2 are very similar with a slightly better performance given by τ=1. Hence, from a practical standpoint, one would prefer treating τ as a smaller constant (not growing with p) for better estimation accuracy.

It is also meaningful to compare the computational runtime between different methods. In Figure , we plot the run time comparison among our pMOM Cholesky approach, ESC and FBF. We can see that the run time for pMOM is significantly lessened compared to ESC and FBF, especially under the setting where we ran each ESC-based chain for 5000 iterations, while for pMOM, we ran 10,000 iterations. The computational cost of ESC is also extremely expensive in the sense that it requires not only additional run time, but also larger memory (more than 30 GB when p>900).

Figure 2. Run time comparison.

Figure 2. Run time comparison.

Overall, this experiment illustrates that the proposed hierarchical Bayesian approach with our pMOM Cholesky prior can be used for a broad yet computationally feasible model search, and at the same time can lead to a much more significant improvement in model selection performance for estimating the sparsity pattern of the Cholesky factor and the underlying DAG.

7. Results for hyper-pMOM Cholesky prior

In the generalized linear regression setting, Wu (Citation2016) proposes a fully Bayesian approach with the hyper-pMOM prior where an appropriate Inverse-Gamma prior Inverse-Gamma(λ1,λ2) is placed on the parameter τ in the pMOM prior. Following the nomenclature in Wu (Citation2016), we refer to the following mixture of priors as the hyper-pMOM Cholesky prior, (13) LZj,jdj,ZjindpMOM Cholesky,1j<p,(13) (14) djindInverse-Gamma(α1,α2),1jp,(14) (15) τInverse-Gamma(λ1,λ2),(15) where λ1 and λ2 are positive constants.

Note that as indicated in Wu (Citation2016) and Cao et al. (Citation2020), compared to the pMOM density in (Equation2) with given τ, the marginal hyper-pMOM now possesses thicker tails that induce prior dependence. In addition, this type of mixture of priors could achieve better model selection performance especially for small samples (Liang et al., Citation2008).

By (Equation13)–(Equation15), under the hyper-pMOM Cholesky prior, the resulting posterior probability for Zj is given by, (16) π(Zj|Y)π(Zj)m|Zj||AZj,Zj|12×00τr|Zj||Zj|2(λ1+1)eλ2τ1|nS~Zj,Zj|12×djn2+(r12)|Zj|+α1+1×expnS~j|Zj+2α22djE|Zj|iZjLij2rddjdτ,(16) where S~Zj,Zj=SZj,Zj+AZj,Zjnτ, S~j|Zj=S~jj(S~Zj,j)T×(S~Zj,Zj)1S~Zj,j, and E|Zj|(.) denotes the expectation with respect to a multivariate normal distribution with mean (S~Zj,Zj)1S~Zj,j, and covariance matrix dj(nS~Zj,Zj)1. Since these posterior probabilities are still not available in closed, we have the following lemma that provides the upper bound for the Bayes factor under the following assumption.

Assumption 5c

The hyperparameters Ap, α1,α2,λ1,λ2 in (Equation13)–(Equation15) satisfy 0<a1<eig1(Ap)eig2(Ap)eigp(Ap)<a2< and 0<α1,α2,λ1,λ2<a2. Here a1,a2 are constants not depending on n.

Lemma 7.1

Under Assumption 1–5c, for each 1j<p, the Bayes factor between any ‘non-true’ model Zj and the true model Z0j under the hyper-pMOM Cholesky prior will be bound above by, (17) π(Y|Zj)π(Y|Z0j)(Mn1/2)(|Zj||Z0j|)×(|Zj|1V)r|Zj|(s2)2r|Z0j|Γn2+r12|Zj|+α1Γn2+r12|Z0j|+α1×Γ(r|Zj|+|Zj|2+λ1)Γ(r|Z0j|+|Z0j|2+λ1)×(nS~j|Z0j/2+α2)n2+(r12)|Z0j|+α1(nS~j|Zj/2+α2)n2+(r12)|Zj|+α1×(λ2+c3|Z0j|/n+c4)r|Z0j|+|Z0j|2+λ1(λ2c2|Zj|/(2n))r|Zj|+|Zj|2+λ1+(Mn1/2)(|Zj||Z0j|)nr|Zj|s22r|Z0j|×Γn|Zj|2+α1Γn2+r12|Z0j|+α1Γ(r|Zj|+|Zj|2+λ1)Γ(r|Z0j|+|Z0j|2+λ1)×(nS~j|Z0j/2+α2)n2+(r12)|Z0j|+α1(nS~j|Zj/2+α2)n|Zj|2+α1×(λ2+c3|Z0j|/n+c4)r|Z0j|+|Z0j|2+λ1(λ2c2|Zj|/(2n))r|Zj|+|Zj|2+λ1,(17) for some constants M,c2,c3,c4>0.

The upper bound in (Equation17) can be used to show the posterior ratio consistency illustrated in the following theorem.

Theorem 7.2

Under Assumptions 1–5c, if we assume λ1 and λ2 are some fixed positive constants, the following holds under the hyper-pMOM Cholesky prior: for all 1j<p, maxZjZ0jπ(Zj|Y)π(Z0j|Y)PΩ00,andPΩ0(Zˆj=Z0j)1,as n.

In order to achieve strong model selection consistency, we need the following assumption on the hyperparameter λ2 instead of Assumption 5c.

Assumption 5d

The hyperparameters Ap, α1,α2,λ1,λ2 in (Equation13)–(Equation15) satisfy 0<a1<eig1(Ap)eig2(Ap)eigp(Ap)<a2<, 0<α1,α2,λ1<a2 and λ2p2κ/(r+1/2), for some κ>1. Here a1,a2 are constants not depending on n.

The next theorem establishes the strong selection consistency under the hyper-pMOM Cholesky prior. See proofs for Theorems 7.2 and 7.3 in the supplement.

Theorem 7.3

Under Assumptions 1–5d, for the hyper-pMOM Cholesky prior, the following holds: for all 1j<p, π(Z0j|Y)PΩ01,as n.

Note that for the hyper-pMOM Cholesky prior with the extra layer of prior on τ, the Newton-type algorithm used for optimizing the likelihood could be quite time consuming, and the estimation accuracy will be compromised, especially when the size of the model and the dimension p are large. Therefore, from a practical standpoint, we would still prefer the pMOM Cholesky prior for carrying out the model selection.

8. Discussion

In this paper, we investigate the theoretical consistency properties for the high-dimensional sparse DAG models based on proper non-local priors, namely the pMOM Cholesky and the hyper-pMOM Cholesky priors. We establish both posterior ratio consistency and strong model selection consistency under comparably more general conditions than those in the existing literature. In addition, by putting a uniform-like prior over the space of sparsity pattern for Cholesky factors, we avoid the potential issues of the model being stuck in rather sparse space caused by the priors over the graph space aiming to penalize larger models like the Erdos–Renyi prior, the beta-mixture prior or the multiplicative prior. Also, through simulation studies where we implement an efficient parallel MCMC algorithm for exploring the sparsity pattern of each column of L, we demonstrate that the models studied in this paper can outperform existing state-of-the-art methods including both penalized likelihood and Bayesian approaches in different settings.

Acknowledgments

We would like to thank the Editor, the Associate Editor and the reviewer for their insightful comments which have led to improvements of an earlier version of this paper.

Additional information

Funding

This work was supported by Simons Foundation's collaboration grant (No. 635213).

References

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.