406
Views
0
CrossRef citations to date
0
Altmetric
Articles

Posterior propriety of an objective prior for generalized hierarchical normal linear models

, &
Pages 309-326 | Received 23 Jan 2021, Accepted 31 Aug 2021, Published online: 30 Jul 2022

ABSTRACT

Bayesian Hierarchical models has been widely used in modern statistical application. To deal with the data having complex structures, we propose a generalized hierarchical normal linear (GHNL) model which accommodates arbitrarily many levels, usual design matrices and ‘vanilla’ covariance matrices. Objective hyperpriors can be employed for the GHNL model to express ignorance or match frequentist properties, yet the common objective Bayesian approaches are infeasible or fraught with danger in hierarchical modelling. To tackle this issue, [Berger, J., Sun, D., & Song, C. (2020b). An objective prior for hyperparameters in normal hierarchical models. Journal of Multivariate Analysis, 178, 104606. https://doi.org/10.1016/j.jmva.2020.104606] proposed a particular objective prior and investigated its properties comprehensively. Posterior propriety is important for the choice of priors to guarantee the convergence of MCMC samplers. James Berger conjectured that the resulting posterior is proper for a hierarchical normal model with arbitrarily many levels, a rigorous proof of which was not given, however. In this paper, we complete this story and provide an user-friendly guidance. One main contribution of this paper is to propose a new technique for deriving an elaborate upper bound on the integrated likelihood, but also one unified approach to checking the posterior propriety for linear models. An efficient Gibbs sampling method is also introduced and outperforms other sampling approaches considerably.

1. Introduction

Bayesian hierarchical models (or multilevel models) have been extensively used in the modern application, including education (Raudenbush & Bryk, Citation1986), psychology (Lindenberger & Pötter, Citation1998), clinical trials (Xia et al., Citation2011), economics (Shimotsu, Citation2010) and many other applied statistical fields. The fundamental idea of hierarchical modelling is to think of the lowest-level units (smallest and most numerous) as organized into a hierarchy of successively higher-level units. For example, students are in classes, classes are in schools, schools are in districts, and districts are in states. Accordingly, hierarchical models are naturally applicable to the survey, observational or experimental data involved with complicated nesting. However, the most commonly used and fully discussed hierarchical models are merely of two levels. Goldstein (Citation2011) and Berger et al. (Citation2020b) have ever defined 3-level hierarchical model and implemented statistical analysis on that. The hierarchical model with more levels was usually avoided by the researchers for the reason of analytical difficulty and intractable computation. To the best of authors' knowledge, a general hierarchical linear model with arbitrary levels seems to have never been defined or studied. In this paper, we will introduce the definition of a generalized hierarchical normal linear (GHNL) model and carry out an in-depth theoretical investigation of Bayesian inference for the GHNL models.

In order to implement fully Bayesian analysis, priors are supposed to be specified on the hyperparameters (parameters at higher levels of the hierarchical model). Improper (objective) priors are often used to express ignorance or to match frequentist properties (see the review article, Consonni et al., Citation2018). When using improper priors, an important issue whether the resulting posterior distributions are proper arises. As Hobert and Casella (Citation1996) stated, without proper precaution, misuse of improper priors, sometimes unknowingly, will result in practical difficulties, such as the non-convergence of the Gibbs sampler. The enormous practical importance of posterior propriety motivates us to explore it in the framework of GHNL modelling afterwards. There is also a vast modern literature investigating the posterior propriety of improper priors applied to a large variety of models, such as, Sun et al. (Citation2001), Speckman and Sun (Citation2003), Berger et al. (Citation2005) and Michalak and Morris (Citation2016).

A great deal of efforts have been devoted to the development of objective hyperpriors in hierarchical modelling, such as, Daniels and Kass (Citation1999), Everson and Morris (Citation2000), Gelman (Citation2006), Gustafson et al. (Citation2006), Berger et al. (Citation2005) and Berger et al. (Citation2020b). Formal objective Bayesian approaches, like the Jeffreys-rule prior or reference prior are only feasible for the simple hierarchical settings. For instance, the exact Jeffreys-rule prior for covariance matrices at higher level depends on the parameters from the lower level of the model, leading to plenty of difficulties in formulation and computation. Therefore, a common way is to use less formal approaches, such as applying formal objective priors from non-hierarchical models to hierarchical modelling. Unfortunately, the non-hierarchical Jefferys-rule prior and reference prior typically yield improper posteriors in the hierarchical settings (cf. Berger et al., Citation2005). Those who can recognize this problem often use constant priors instead for higher level variance components. However, the constant prior is so diffuse that it requires twice as many observations as logically needed to achieve posterior propriety (cf. Berger et al., Citation2005 and Berger et al., Citation2020b). In other words, the extra observations required are wasted on correcting the over-diffuse tail of the constant prior. The most powerful tool known for detecting over-diffuse hyperpriors is by looking at the frequentist notion of admissibility of resulting estimators (see Berger et al., Citation2005 for discussions and references). Sensible choices of objective hyperpriors are on the boundary of admissibility, being as diffuse as possible without leading to inadmissible estimators.

Berger et al. (Citation2005) studied the propriety and admissibility of a number of hyperprios, but no overall conclusion was reached as to a specific prior to recommend. The reasons are as follows: (a) the admissibility of the leading candidate prior was unable to get proved; (b) the proposed computation methods were only efficient for relatively low-dimensional covariance matrices and remained quite challenging for the candidate priors; (c) the hierarchical model discussed was merely of two levels, and the results are not adaptive to a general hierarchical model with many levels. To address this issue, Berger et al. (Citation2020b) recommended a particular objective prior for use in all normal hierarchical models. Consider the following canonical form of 2-level hierarchical normal model. Suppose that, independently, yi  Nk(θi,Ik) and θi  Nk(β,V) for i=1,,m, where Nk(,) denotes the k-dimensional normal distribution, yi are k×1 observation vectors, θi are the k×1 unobserved mean vectors, β is a p×1 ‘hypermean’ vector, and VRk×k is an unknown ‘hypercovariance’ matrix. Berger et al. (Citation2020b) proposed a particular combination of independent priors on hyperparameters β and V as (1) π(β)1(1+β2)(k1)/2,π(V)1|V|11/(2k)1s<tk(vsvt),(1) where v1>v2>>vk>0 are the ordered eigenvalues of V. The recommendation (Equation1) for hyperpriors was justified by Berger et al. (Citation2020b) from the aspects of admissibility, ease of computation and performance. Most importantly, prior (Equation1) is adapted to being used at any level in hierarchical modelling, which is not true for other proposed objective priors as previously mentioned.

Since it is hazardous to skip demonstrating propriety at the risk of making inference from improper posterior, Berger et al. (Citation2020b) has shown the posterior propriety of a 3-level hierarchical model using prior (Equation1), while assuming square design matrices for a technical reason. Berger et al. (Citation2020b) also conjectured that the posterior is proper with the recommended prior being utilized at all levels of a hierarchical normal model with arbitrarily many levels, a rigorous proof of which was unable to be provided, however. In this paper, we complete this story and prove the posterior propriety for the GHNL models in general situations. Besides, as pointed out in Michalak and Morris (Citation2016), researchers have been finding it daunting and time-consuming to inspect posterior propriety when using improper priors, except in the simplest models. For this reason, we supply a user-friendly guidance for checking posterior propriety to practitioners in different practical situations.

In Section 2, we give an explicit definition of the GHNL model which accommodates arbitrarily many levels and usual design matrices. It is important to note that we are considering the ‘vanilla’ covariance matrix problem herein. We are not assuming any special structures or sparsity for hypercovariance matrices. The association between the GHNL model and a linear mixed-effect model is also discussed. In Section 3, we demonstrate that recommended prior yields a proper posterior in the framework of GHNL modelling. In addition, we exhibit a guidance for checking posterior propriety. An efficient MCMC algorithm for sampling from the posterior is introduced in Section 4. Section 5 provides some concluding remarks and further generalizations.

2. Generalized hierarchical normal linear model

In this section, we will introduce the definition of a GHNL model with (r+1) levels, where r1. The association between the GHNL model and a linear mixed-effect model will be demonstrated as well, which brings an insight into the GHNL model. At last, the recommended prior on the hyperparameters of the GHNL model will be presented and discussed. Firstly, we introduce some notations to be used in the main body of this paper.

Notations Let [k]={1,2,,k} for a positive integer k; 1{} stands for the indicator function; Nk(μ,Σ) represents the k-dimensional normal distribution with mean μ and covariance Σ; Nk(μ,Σ) denotes a k-dimensional normal random variable with mean μ and covariance Σ; for a symmetric matrix A, A>(<)0 means that A is a positive (negative) definite matrix, and A()0 denotes that A is a non-negative (non-positive) definite matrix.

2.1. Model structure

Berger et al. (Citation2020b) proposed a 3-level hierarchical model with the form: (2) {Level 1: yi=θi+Nk(0,Ik),i[m];Level 2:θi=Ziβ+Nk(0,V),β=(β1,,βs);Level 3: βj=η+Np(0,W),j[s],(2) where yi are k×1 observation vectors, θi are the k×1 unobserved mean vectors, η is a p×1 ‘hypermean’ vector, VRk×k and WRp×p are unknown ‘hypercovariance’ matrices, and Zi are k×sp known matrices. At last, all the normal random variables in model (Equation2) are mutually independent. Based on the 3-level hierarchical normal model, a more general hierarchical model with (r+1) levels (r1) can be constructed as (3) {Level 1:yi0=Z0i0θ1+Nk0(0,Ik0),i0[m0],θ1=(θ11,,θ1m1);Level 2:θ1i1=Z1i1θ2+Nk1(0,V1),i1[m1], θ2=(θ21,,θ2m2);Level r:θr1,ir1=Zr1,ir1θr+Nkr1(0,Vr1),ir1[mr1],θr=(θr1,,θrmr);Level r+1:θrir=Zrirη+Nkr(0,Vr),ir[mr].(3) Firstly, all the normal random variables noted in the above model are mutually independent. Within model (Equation3), the output of level (j+1) consists of mj units whose values are kj×1 vectors for j=0,1,,r. By stacking the output units of level (j+1) on top of one another, we can obtain the outcome vector of level (j+1) as θj for j[r] and y=(y1,,ym0) for level 1. Then θj's are (mjkj)×1 vectors and y is a (m0k0)×1 vector. In fact, only the outcome of the lowest level can be observed, and the outcomes of higher levels are inaccessible and latent variables. Hence, the outcome variables of interest are always situated at the lowest level of the hierarchy. Different units in the same level share in common input effects (intercept can be included) which are exactly the outcome vectors from the upper level, except that the input effect of level (r+1) is η which is a d×1 vector of fixed effects. In addition, the units from the same level have the same variance component. The variance component within level (j+1) is denoted by VjRkj×kj for j[r] and accounts for the magnitude of random variation within the corresponding level. The covariance matrices Vj are unobserved for j[r]. The matrices Zjij are kj×(mj+1kj+1) matrices and denote the matrices of observed covariates for unit ij in level j, where j=0,1,,r and ij[mj]. It is natural to assume that there exist at least two units in each level and the dimensions of all units and η are no less than 1, mathematically, mj2 and kj1 for j=0,1,,r, and d1. Table  summarizes several important notations that will mainly affect the results for the posterior propriety in Section 3.

Table 1. Summary of certain important notations within model (Equation3) and j=0,1,,r.

The extensions from Berger et al. (Citation2020b)'s model (Equation2) to model (Equation3) are two-fold, and model (Equation3) accommodates arbitrarily many levels and usual covariate matrices. Further define Zj={Zj1,,Zjmj} for j=0,1,,r. Then Zj are (mjkj)×(mj+1kj+1) matrices for j=0,1,,(r1), Zr is an (mrkr)×d matrix, and an alternative representation of the (r+1)-level hierarchical model (Equation3) is thereby given by (4) {Level 1:(y|θ1)Nm0k0(Z0θ1,Im0k0);Level 2:(θ1|θ2,V1)Nm1k1(Z1θ2,Im1V1);Level r:(θr1|θr,Vr1)Nmr1kr1(Zr1θr,Imr1Vr1);Level r+1:(θr|η,Vr)Nmrkr(Zrη,ImrVr).(4)

Remark 2.1

If we assume that the covariance matrix for the units from level 1 in model (Equation3) is a positive definite matrix Σ0 instead of the identity matrix, when Σ0 is known, the two assumptions are actually equivalent by taking reparameterization that yi0=Σ012yi0 and Zi0=Σ012Zi0. Furthermore, for a technical reason, Σ0 must be assumed as known throughout this paper, and this reason will be explained in Section 5.

2.2. Connection with the linear mixed-effect model ( LMM )

The two-level hierarchical normal models are often referred to as LMMs in many places. As for the GHNL model (Equation4), let Θ={θ1,,θr} denote the set of unobserved outcome vectors and V={V1,,Vr} represent the set of unknown covariance matrices. If we take θj's as intermediate variables, then marginalizing out over Θ yields (5) (y|η,V)  Nm0k0(Xrη,Δ),(5) where (6) Δ=Im0k0+t=1rXt1(ImtVt)Xt1,andXj=s=0jZs, j=0,1,,r.(6) Δ is a (m0k0)×(m0k0) matrix and Xj are (m0k0)×(mj+1kj+1) matrices for j=0,1,,r. Suppose that Zj are of full column ranks. Then by Sylvester's rank inequality Xj are also of full column ranks, j=0,1,,r. In the rest of the paper, Zj are assumed to be of full column ranks for j=0,1,,r.

If we consider a particular LMM as (7) y=Xrη+X0θ1++Xr1θr+ϵ,(7) where η is the fixed effect, θj's are random effects and independently distributed as Nmjkj(0,ImjVj) for j[r], and ϵ denotes the vector of random errors and is distributed as Nm0k0(0,Im0k0). By integrating out the random effects, the marginal distribution of y conditioning on (η,V) is identical to the distribution (Equation5). In one word, the GHNL is equivalent to an LMM in the sense of the marginal distribution of observations after integrating out intermediate outcome vectors or random effects. The equivalence between the GHNL models and the LMMs can be illustrated by an example of mixed-effect ANOVA model as well.

Example 2.1

Mixed-effect ANOVA model

Suppose we can observe the scores of p courses for student (ijk) as yijk for i=1,,s1, j=1,,s2 and k=1,,s3. The observed data are within a hierarchy of three levels: student (ijk) is nested within class (ij), and class (ij) is nested within school i. Thus, we have total s1 schools, each school has s2 classes and each class has s3 students. Consider a mixed-effect ANOVA model as (8) yijk=η+αi+βij+ϵijk,(8) where yijk, η, αi, βij and ϵijk are all p×1 vectors for i=1,,s1, j=1,,s2 and k=1,,s3, η denotes the overall mean and is fixed effect, αi  Np(0,Vα) is the effect of school, βij is distributed as Np(0,Vβ) and represents the effect of class, the student-level independent random error is denoted by ϵijk and has distribution Np(0,Σ0), and Σ0 is a known matrix. At the same time, αi, βij and ϵijk are independently distributed. Consequently, Vα, Vβ and Σ0 are the variance components describing the school-level, class-level and student-level variations, respectively. Due to the hierarchical structure of the observations, we can naturally build a hierarchical model as (9) yijkNp(βij,Σ0),βijNp(αi,Vβ)andαiNp(η,Vα)(9) independently, for i=1,,s1, j=1,,s2 and k=1,,s3. Denote that Yi=(yi11yis21yi1s3yis2s3),Ei=(ϵi11ϵis21ϵi1s3ϵis2s3),βi=(βi1βis2)andβi=(βi1βis2),where Yi and Ei are both (s3p)×s2 matrices, and βi and βi are both (s2p)×1 vectors. Let yi=vec(Yi) and ϵi=vec(Ei), where vec(A) denotes the column vector obtained by stacking the columns of the matrix A on top of one another. Define that y=(y1ys1),ϵ=(ϵ1ϵs1),β=(β1βs1),β=(β1βs1),α=(α1αs1),α=(α1αs1).Thus, y and ϵ are (m0p)×1 vectors, and β and β are (m1p)×1 vectors, and α and α are (m2p)×1 vectors, where m0=s1s2s3, m1=s1s2 and m2=s1. Then the hierarchical normal model (Equation9) can be expressed as a GHNL model with the form (10) {Level 1:(y|β,Σ0)Nm0p(Z0β,Im0Σ0);Level 2:(β|α,Vβ)Nm1p(Z1α,Im1Vβ);Level 3:(α|η,Vα)Nm2p(Z2η,Im2Vα),(10) where Z0=diag{1s3Ip,,1s3Ips1s2},Z1=diag{1s2Ip,,1s2Ips1}, Z2=1s1Ip,where 1q denotes the q×1 vector with all elements being one, and Z0, Z1, Z2 are (m0p)×(m1p), (m1p)×(m2p), (m2p)×p matrices, respectively. Denote that X0Z0,X1diag{1s2s3Ip,,1s2s3Ips1}=Z0Z1,X21s1s2s3Ip=Z0Z1Z2,and X0, X1, X2 are (m0p)×(m1p), (m0p)×(m2p), (m0p)×p matrices, respectively. Thus, model (Equation8) can be summarized as (11) y=X2η+X1α+X0β+ϵ,(11) where αNm2p(0,Im2Vα), βNm1p(0,Im1Vβ) and ϵNm0p(0,Im0Σ0), independently. By integrating out (α,β) and (α,β), the marginal distributions of y for model (Equation11) and model (Equation10) are identical and of the form: (y|η,Vα,Vβ,Σ0)  Nm0p(X2η,Ω),Ω=Im0Σ0+X0(Im1Vβ)X0+X1(Im2Vα)X1.

Example 2.1 provides a simple example to illustrate how the hierarchical model and the mixed-effect model can be constructed based on the nested data, and the equivalence between two models is also presented. In Appendix 1, we define a special LMM which is a special case of model (Equation7) with mj=1 for all j[r], and the theoretical investigation of this special LMM is distinct from that of the GHNLM. This special LMM could be common in application, and we just want to provide some theoretical results for those who have interests to refer to. We will still focus on the GHNL model in the following sections.

2.3. Priors on the hyperparameters

In order to implement fully Bayesian analysis, we should specify hyperpriors on the parameters (η,V). It follows the recommendation from Berger et al. (Citation2020b) that we can assume priors on (η,V) as: (12) π(η)1(1+η2)(d1)/2,ηRd,(12) (13) π(Vj)1|Vj|11/(2kj)s<t(ωjsωjt),Vj>0, j[r],(13) where ωj1>ωj2>>ωjkj>0 are the decreasingly ordered eigenvalues of Vj, j[r]. Apart from prior (Equation12), common choices of prior on η include the constant prior and conjugate prior. None of the three priors will result in improper priors or difficulties in computation. However, among the three priors, prior (Equation12) is the most perfect for all k from the perspective of admissibility. Besides, it refers to Berger et al. (Citation2005) that the prior (Equation12) is a mixture-of-normal prior of the hierarchical structure as (14) (η|λ)  Nd(0,λId)and[λ]λ1/2exp(12λ),(14) and those mixture-of-normal priors have shown great success in shrinkage estimation particularly (cf. Fourdrinier et al., Citation1998) and robust Bayesian estimation generally (cf. Berger, Citation1980). Therefore, prior (Equation12) was actually recommended by Berger et al. (Citation2005) for default use.

As for prior (Equation13) on the unknown covariance matrices Vj, j[r], consider the transformation from Vj to Ωj=diag(ωj1,,ωjkj) and the orthogonal matrix Γj of corresponding eigenvectors, the Jacobian is (15) | Vj(Ωj,Γj)|=s<t(ωjsωjt).(15) Consequently, the prior (Equation13) on Vj becomes the prior density of (Ωj,Γj) as (16) π(Ωj,Γj)1|Ωj|11/(2kj)(16) with respect to Lebesgue measure on (ωj1,,ωjkj) and the invariant Haar measure over the space {Γ:ΓΓ=Ikj}. Note that the prior on Ωj is improper and, independently, the prior on Γj is constant. Use of a uniform prior for Γj ranging over a compact space is natural and non-controversial and has no influence on the eigenvalues. The term s<t(ωjsωjt) is eliminated after changing variables for prior (Equation13). In contrast, the commonly used priors on the covariance matrix, such as inverse Wishart, Jeffreys-rule and constant priors, contain the term s<t(ωjsωjt) in the transformed space. This special term gives low mass to close eigenvalues and hence effectively force the eigenvalues apart. It is contrary to the common intuition which would suggest that one should choose a prior that pushes the eigenvalues closer together. As a result, prior (Equation13) is essentially neutral as to expansion or shrinkage of the eigenvalues.

In the context of 2-level hierarchical normal model, Theorem 1 from Berger et al. (Citation2020b) has demonstrated that the combination of priors (Equation12) and (Equation13) on (η,V) is on the boundary of admissibility, being as diffuse as possible without yielding inadmissible estimators. Furthermore, it is shown that the generalization that allows covariates at all levels of the hierarchical model will not affect the result of admissibility (cf. Berger et al., Citation2020b). Nonetheless, the admissibility of the recommended prior for the (r+1)-level hierarchical model with r2 is not clear. Generally speaking, this is a very difficult question to answer, and we mainly justify the recommendation of hyperpriors from the angles of posterior propriety and computation afterwards in the framework of the GHNL model.

3. Posterior propriety

Berger et al. (Citation2020b) has shown that the resulting posterior of the recommended prior is proper for the 3-level hierarchical model (Equation2), but under a narrow set of assumptions. They also conjectured the posterior propriety for a hierarchical model with any number of levels, a rigorous one of which was not given yet. In this section, we will comprehensively investigate the conditions for the posterior propriety of the GHNL model (Equation4) using the recommended prior in more general situations. The dimension of η affects the investigation of posterior propriety considerably, and two cases d2 or d = 1 will be discussed separately.

Based on (Equation5) and (Equation14), by integrating out η, we can obtain the marginal distribution of y conditioning on (V,λ) as (17) (y|V,λ)Nm0k0(0,Δ+λXrXr).(17) The posterior propriety of the GHNL model (Equation4) employing priors (Equation13) and (Equation14) is defined as (18) m(y)=f(y|V,λ)π(λ)s=1rπ(Vs)dλt=1rdVt<,(18) where m(y) denotes the marginal density of the observation vector. Next, we display some definitions and additional notations which are frequently used in this section.

More Notations Let card(A) denote the cardinality of the set A. For 0<I1,I2, denote that I1I2 if there exist constants 0<C1C2 such that C1I2I1C2I2; For a symmetric matrix ARn×n, λi(A) represents the i-th largest eigenvalue of A, namely, λ1(A)λn(A); Let λmax(A) and λmin(A) denote the maximum and minimum eigenvalues of an arbitrary symmetric matrix A.

Definition

For convenience, let ωr+1,1λ, mr+1d and kr+11. Define a function of an arbitrary non-empty set E as F(E){D|DE, D}, such that F(E) denotes the set of all non-empty subsets of E. For any RF([r+1]), let H(R){(j,l)|jR, l[kj]}. Further define the composition of F and H as S(R)(FH)(R)={D|DH(R), D}for any RF([r+1]). Define that (19) cjl,s=1{mj(l1)<smjl},(19) for j[r+1], l[kj] and s[m0k0].

3.1. Two key lemmas

Before we formally investigate the posterior propriety, we first introduce two important lemmas which dominate in the process of proving the main theorems in this paper.

Lemma 3.1

Assume that Aj are pj×pj positive definite matrices, j[r]. Let Xj be n×pj matrices of full column ranks, j[r]. Define that (20) H=j=1rXjAjXj.(20) Then

  1. λmax(H)C1j=1rλmax(Aj), where C1=maxj[r]λmax(XjXj).

  2. Also, (21) |H|(C2r)n|j=1rDj|,(21) where C2=minj[r]λmin(XjXj)>0. For any j[r], Dj=diag(aj1,,ajn), where ajk=λk(Aj) for k[pj] and ajk=0 for pj<kn.

Lemma 3.1 mainly demonstrates two inequalities with respect to the summation of a series of quadratic forms which have matrical inputs. Since Xi's have full column rank and Aj's are positive definite, then npj and rank(XjAjXj)=pj. It is worthwhile to note that the non-zero diagonal elements of Dj are the decreasingly ordered eigenvalues of Aj in the lower bound of |H|, and this relation will deeply influence the last result when we derive the sufficient condition for posterior propriety afterwards. Besides, we can never find a constant C>0 such that |H|+C|j=1rDj|+, where |M|+ denotes the product of all non-zero eigenvalues of M. The proof of Lemma 3.1 can be found in Appendix A.2.

Lemma 3.2

Assume a positive integer k. Suppose M is a subset of F([k]) with cardinality n and let C1,,Cn denote the sequence of all the elements within M. Define an integral I=Ω1[j=1kλjaj][i=1n(1+rCiλr)bi]dλ,where aj j=1,2,,k, and bi, i=1,2,,n, are real constants, and λ=(λ1,,λk)Ω[0,]k. Then the integral I is finite if and only if (iff) the following two conditions are both satisfied.

  • aj<1, j[k];

  • inequalities (22) jDaj+iGDbi>card(D)(22) for all DF([k]) hold, where GD={i|DCi,i[n]}.

Here, Lemma 3.2 (b) may not be straightforward enough, so we will use the following example to elaborate on how Lemma 3.2 can be employed.

Example 3.1

Consider integral I0=Ω1[j=13λjaj](1+λ1)b1(1+λ1+λ2)b2(1+λ2+λ3)b3dλ1dλ2dλ3,where a1, a2, a3, b1, b2, b3 are all real constants. Similar to Lemma 3.2, we can define M={{1},{1,2},{2,3}}. Then I0< iff all the following inequalities hold: (a) aj<1, j[3]; (b) for D={1},a1+b1+b2>1;for D={2},a2+b2+b3>1;for D={3},a3+b3>1;for D={1,2},a1+a2+b1+b2+b3>2;for D={1,3},a1+a3+b1+b2+b3>2;for D={2,3},a2+a3+b2+b3>2;for D={1,2,3},a1+a2+a3+b1+b2+b3>3.

Note that, no matter how M is defined, we always need to check all the inequalities corresponding to all DF([k]). Even though some inequalities could be trivial after being written down, for the sake of assurance, we would better take all non-empty subsets of [k] into account in the early stage. Lemma 3.2 plays a crucial role in obtaining the follow-up theorems, and detailed proof of Lemma 3.2 sees Appendix A.2.

3.2. Conditions for the posterior to be proper when d2

In this subsection, the case d2 is mainly considered.

Theorem 3.1

Consider the GHNL model (Equation4) with priors (Equation12) and (Equation13) on η and ViV, respectively. When d2, a sufficient condition for the posterior propriety is given by (23) s=1m0k01{(j,l)Dcjl,s>0}>(j,l)D1kj,(23) for any DS([r+1]).

Proof.

It follows (Equation17) that the integrated likelihood of (V,λ) after marginalizing out (θ1,,θr,η) is given by (24) L(V,λ;y)1|Δ+λXrXr|1/2×exp{12y(Δ+λXrXr)1y}.(24) By dropping the exponent term involving y (since it is less than one), we have L(V,λ;y)<1|Δ+λXrXr|1/2.By applying Lemma 3.1 (b), we can further bound the integrated likelihood as (25) L(V,λ;y)C1|M1|1/2(25) where C1 is a positive constant that only depends on Xj's and (26) M1=Im0k0+j=1r+1DjandDj=(ΩjImjOqj)(m0k0)×(m0k0),(26) where Ωj are the diagonal matrices of the decreasingly ordered eigenvalues of Vj for j[r], Ωr+1=(ωr+1,1), Oqj are qj×qj zero matrices and qj=m0k0mjkj for j[r+1]. Since Ωr+1=(ωr+1,1), then Ωr+1 is a degenerate matrix as scalar λ and the prior on λ becomes (27) π(Ωr+1)1|Ωr+1|11/(2kr+1)exp{12tr(Ωr+11)}.(27) Combining (Equation16), (Equation18), (Equation27) and (Equation25), we have m(y)C1exp{tr(Ωr+11)/2}j=1r+11{ωj1>>ωjkj>0}|M1|1/2j=1r+1|Ωj|112kj×[j=1rdΩjdΓj]dΩr+1<C1|M1|1/2j=1r+1|Ωj|112kjj=1r+1dΩjI0.The definition of cjl,s in (Equation19) yields |M1|=s=1m0k0(1+j=1r+1l=1kjcjl,sωjl).Therefore, I0j=1r+1dΩjj=1r+1|Ωj|112kjs=1m0k0(1+j=1r+1l=1kjcjl,sωjl)12,which is finite iff (28) (j,l)D(112kj)+12s=1m0k01{(j,l)Dcjl,s>0}>card(D)(28) for any DS([r+1]) by employing Lemma 3.2. It's obvious that the inequality (Equation28) is equivalent to that in (Equation23).

Applying Lemma 3.1 yields the upper bound on the integrated likelihood function of hyperparameters as C0|M1|12, where M1 is an (m0k0)×(m0k0) matrix and defined in (Equation26). Therefore, the special notation cjl,s can be understood as the indicator of whether eigenvalue ωjl appears in the s-th diagonal element of M1 for j[r+1], l[kj] and s[m0k0]. At the same time, the left-hand side of inequality (Equation23) actually stands for the cardinality of set {s| (j,l) D, such that cjl,s>0} for any DS([r+1]).

The cardinality of S([r+1]) is 2j[r+1]kj1, which means that the total number of the inequalities to be checked in (Equation23) is exponential with r and the dimensions kj. It has to be admitted that this will impose considerably heavy computational burden in common practice by applying Theorem 3.1 directly. Nevertheless, the researchers have no need to be anxious about the heavy computational burden, because most of the inequalities in Theorem 3.1 are trivial. To conclude this point, we have the following corollary.

Corollary 3.1

Recursively define that R1={j|mjr+1,j[r+1]};R2={j|mjcard(R1),jR1};Rp={j|mjcard(Rp1),jRp1},where p is the smallest positive integer i such that {j|mj>card(Ri),jRi}=. We call the levels within Rp as kernel levels and denote Rp by Rker. The inequality (Equation23) holds for any DS([r+1]) iff the inequality (Equation23) holds for any DS(Rker). Consequently, if Rker=, then the posterior is always proper.

Proof.

Let R1c=[r+1]R1, thus, mj>r+1 for jR1c by the definition of R1. For any DS([r+1]), if there exists jR1c such that (j,l)D for any l[kj], then s=1m0k01{(j,l)Dcjl,s>0}max(j,l)Dmj>r+1=maxDS([r+1])(j,l)D1kj,which is a trivial one. As a result, inequality (Equation23) holds for any DS([r+1]) iff inequality (Equation23) holds for any DS(R1). Since maxDS(Ri)(j,l)D1kj=card(Ri),it can be recursively shown that inequality (Equation23) holds for any DS([r+1]) iff inequality (Equation23) holds for any DS(Ri) and i[p], where p is the smallest positive integer i such that RiRi+1=.

By using the technique of extracting kernel levels, we dramatically narrow down the checking region for posterior propriety. Since we should only check the inequalities for the levels within Rker, substantially reducing the number of inequalities to be checked from 2j[r+1]kj1 to 2jRkerkj1. Moreover, Corollary 3.1 also indicates two interesting conclusions depicted as follows.

  1. First, it reveals the mechanism how three roles, number of levels, numbers of units in levels and dimensions of levels, affect the posterior propriety simultaneously. Roughly speaking, in the context of GHNL, the more levels with fewer units having lower dimensions, the less likely the posterior is to be proper. For example, if mr2=mr1=mr=2 and kr2=kr1=kr=1, for D={(j,l)|j=r2, r1, r, l=1}, we have (29) 2=s=1m0k01{(j,l)Dcjl,s>0}<(j,l)D1kj=3.(29) Therefore, the posterior propriety can hardly be guaranteed by Theorem 3.1. Conversely, if the units in each level are adequate enough such that the set of kernel levels is empty, i.e., Rker=, then the posterior is always proper. As a consequence, more attention should be focused on the levels with small number of units, namely, the kernel levels.

  2. Second, the recommended prior for use at any level in hierarchical modelling is further justified from the aspect of posterior propriety and ease of implementation. For instance, if we switch the prior on Vj, j[r] from (Equation13) to (30) π(Vj)1|Vj|1a1s<tkj(ωjsωjt),Vj>0, j[r],(30) where 0<a1 (a has to be larger than zero by Lemma (3.2)). Then the condition in Theorem 3.1 becomes (31) s=1m0k01{(j,l)Dcjl,s>0}>2a×card(D),DS([r+1]).(31) On one hand, when a is greater than but close to zero (denoted by a0), all the inequalities in (Equation31) always hold. Thus, the posterior will be always proper. However, it is impractical to decide how small is for a and find such a fixed value that fits all levels. On the other hand, when a1, which denotes that a is less than and close to 1, then 2acard(D)>(j,l)D1kj, concluding that inequality (Equation31) is harder to get reached than inequality (Equation23), especially for large dimensions kj's. Therefore, the posterior using prior (Equation30) is rather unlikely to be proper than using prior (Equation13). Similar to Corollary 3.1, we can recursively define that R1={j|mj2a×card(E0),j[r+1]};R2={j|mj2a×card(E1),jR1};Rp={j|mj2a×card(Ep1),jRp1},where E0=H([r+1]), Ei=H(Ri) for i[p] and p is the smallest positive integer l such that {j|mj>2acard(El), jRl}=. Then the posterior using prior (Equation30) instead is proper if (Equation31) holds for any DS(Rp). When a1, card(Rp) will be remarkably larger than card(Rker) for large values of kj's, imposing a dramatically heavier burden of checking inequalities than that for prior (Equation13). Above all, one sensible choice for a is that let a be inversely proportional to kj for level j, which takes both practical and theoretical considerations into account.

The upper bound on (j,l)D1kj as card(Ri) for DS(Ri), leads to an effective way to extract kernel levels as presented in Corollary 3.1, but this upper bound is still too rough. Next, an elaborate upper bound on (j,l)D1kj is demonstrated and a sufficient condition of clean form for posterior propriety is derived then.

Theorem 3.2

Consider the GHNL model (Equation4) with priors (Equation12) and (Equation13) on η and ViV, respectively. Denote that m=minj[r+1]mj=minjRkermj. When d2, the posterior is always proper if (32) jRker1kj<m.(32)

Proof.

For DS([r+1]), define that (33) L(D)=s=1m0k01{(j,l)Dcjl,s>0},andR(D)={j|(j,l)D}.(33) It follows from Corollary 3.1 that we only need to prove (j,l)D1kj<L(D),DS(Rker).Also, for any D belonging to S(Rker), we have (34) (j,l)D1kj=jR(D)1kjl=1kj1{(j,l)D}(jR(D)1kj)maxjR(D)l=1kj1{(j,l)D}.(34) Distinct eigenvalues from the same level (r-th) never occur in the same row of matrix M1. Mathematically, cjl1,scjl2,s=0, for 1l1<l2kj, j[r], s[m0k0]. Thus, for any j[r], (35) l=1kj1{(j,l)D}1mjL(D),(35) which is also true for j = r + 1 since kr+1=1. It is obvious that minj[r+1]mj=minjRkermj by the definition of Rker, and that is denoted by m. Combining (Equation35) with (Equation34) yields (36) (j,l)D1kj(jR(D)1kj)L(D)minjR(D)mj1m(jRker1kj)L(D)<L(D).(36)

According to the proof procedure of Theorem 3.2, it can be deduced that j[r+1]1kj<m is also sufficient for posterior propriety. Obviously, the condition in Theorem 3.2 is easier to be satisfied. Theorem 3.2 reveals that for fixed m, the posterior is more likely to be proper for higher dimensions of the units in the kernel levels. Theorem 3.2 also provides the researchers with a powerful tool to check the posterior propriety quickly.

Remark 3.1

Consider the model (Equation4) with r = 1, namely, a two-level hierarchical model. When d2, we have m=min{d,m1}2. Then the posterior using the recommended prior is always proper for k12 referring to Theorem 3.2.

Example 3.2

Continue with Example 2.1

Consider the GHNL modelling of the mixed-effect ANOVA as (Equation10), which is a 3-level hierarchical model with r = 2, m0=s1s2s3, m1=s1s2, m2=s1, m3=p, k0=k1=k2=p and k3=1. It is natural to assume that we have at least two schools, each school has at least two classes and each class has at least two students, namely, s12, s22 and s32. If (a) p>2 and s12 or (b) p2 and s1>2 holds, it can be readily derived that the set of kernel levels is empty. Thus, the posterior is always proper according to Corollary 3.1. When s1=p=2, the set of kernel levels is R2={2,3}, since 1k2+1k3<2, then the posterior is proper by applying Theorem 3.2. In conclusion, the posterior using the recommended prior is always proper when p2.

In Berger et al. (Citation2020b)'s work, for a technical reason, they assumed k = sp for 3-level hierarchical normal model (Equation2) such that the design matrices for units within level 2 and 3 are square matrices. They eventually reached a conclusion that the posterior employing the recommended prior in model (Equation2) is always proper for k = sp and p2 (p is the dimension of hypermean in model (Equation2)). However, the assumption that the design matrices for the units at high levels are square matrices appears to be unnatural and hard to be interpreted in practice. Nevertheless, we still generalize Berger et al. (Citation2020b)'s result to the GHNL model in the following so as to draw a consistent conclusion with theirs.

Corollary 3.2

Consider the GHNL model (Equation4) with priors (Equation12) and (Equation13) on η and ViV, respectively. Assume that d2 and kj=mj+1kj+1, j[r]. Then the posterior is always proper.

Proof.

Since mj2, j[r+1], then m2. It remains to show that jRker1kj<2 by Theorem 3.2. By utilizing the condition that kj=mj+1kj+1 for j[r] and kr+1=1, we have kj=s=j+1r+1ms2r+1j for j[r]. Thus, jRker1kjj[r+1]1kjj[r+1]12r+1j=212r<2,which completes the proof.

Above all, a general procedure for checking the posterior propriety of the GHNL models (Equation4) employing the recommended prior for d2 can be summarized as follows.

Guidance for checking the posterior propriety when d2:

  1. If the design matrices for each unit in each level are square matrices, then the posterior is proper, otherwise, turn to (b).

  2. Derive the set of kernel levels, Rker. If Rker= or inequality (Equation32) holds, the the posterior is proper. If neither, turn to (c).

  3. Check inequality (Equation23) for all D belonging to S(Rker). If that always holds, the the posterior is proper. If not, the posterior propriety can hardly be guaranteed.

3.3. Conditions for the posterior to be proper when d = 1

It is quite common that the dimension of the fixed effect η is only one in practice. However, when d=1, note that 1=mr+1=s=1m0k01{(j,l)Dcjl,s>0}=(j,l)D1kj=1kr+1=1for D={(r+1,1)}, resulting in the failure of the sufficient condition in Theorem 3.1. Therefore, in this subsection, we mainly reinvestigate the conditions for the the posterior to be proper for d = 1.

Theorem 3.3

Consider the GHNL model (Equation4) with constant prior on η and prior (Equation13) on VjV. When d = 1, the posterior is proper if (37) s=1m0k01{(j,l)Dcjl,s>0}>1{j[r], (j,1)D}+(j,l)D1kj(37) holds for all DS([r]).

Proof.

When d = 1, vector η will degenerate into a scalar η. Hence the prior (Equation12) on η becomes a constant prior on η. Based on (Equation5), by integrating out over η, we can get the upper bound on the integrated likelihood of V after dropping the exponential term (less than one) L(V)<1|Δ|12|XrΔ1Xr|12.Since |XrΔ1Xr|XrXrλmin(Δ1)=XrXrλmax(Δ)1, using Lemma 3.1 (a) and (b), we have L(V)<C0|Δ|12(1+j=1rωj1)12C1(1+j=1rωj1)12|M2|12,where C0 and C1 are constants that are independent of the Ωj for j[r], M2=Im0k0+j=1rDj and Dj's are defined in (Equation26). Similar to the proof of Theorem 3.1, we can derive the upper bound on m(y) as m(y)C1(1+j=1rωj1)12|M2|1/2j=1r|Ωj|112kjj=1rdΩjI1.It follows from the definition of cjl,s in (Equation19) that |M2|=s=1m0k0(1+j=1rl=1kjcjl,sωjl).Thus, I0(1+j=1rωj1)12j=1rdΩjj=1r|Ωj|112kjs=1m0k0(1+j=1rl=1kjcjl,sωjl)12,which is finite iff (38) (j,l)D(112kj)+12s=1m0k01{(j,l)Dcjl,s>0}121{j[r], (j,1)D}>card(D)(38) for any DS([r]) by employing Lemma 3.2. It's obvious that inequality (Equation38) is equivalent to that in (Equation37).

Resembling the interpretation of Theorem 3.1, the left-hand side of inequality (Equation37) actually denotes the cardinality of set {s| (j,l) D, such that cjl,s>0} for any DS([r]). To reduce the burden of checking inequalities, we have the following corollary.

Corollary 3.3

Recursively define that R~1={j|mjr+1, j[r]};R~2={j|mjcard(R~1)+1, jR~1};R~q={j|mjcard(R~p1)+1, jR~p1},where q is the smallest positive integer i such that {j|mj>card(R~i)+1, jR~i}=. We call the levels within R~q as kernel levels and denote R~q by R~ker. Inequality (Equation37) holds for any DS([r]) iff inequality (Equation37) holds for any DS(R~ker). Consequently, if R~ker=, then the resulting posterior is always proper.

In the process of extracting kernel levels, the thresholds of mj to split up levels are increased by one in Corollary 3.3, when compared with that in Corollary 3.1, and this is because the upper bound on the right-hand side of inequality (Equation37) is increased by one than that of inequality (Equation23). Except for this point, the proof of Corollary 3.1 is same as that of Corollary 3.3. For good measure, a simple tool to check the posterior propriety is shown as follows, which is a counterpart of Theorem 3.2 for d = 1.

Theorem 3.4

Consider the GHNL model (Equation4) with constant prior on η and prior (Equation13) on VjV. When d=1, the posterior is always proper if (39) jR~ker1kj<m1,(39) where m=minj[r]mj and R~ker is the derived set of kernel levels.

Proof.

For any DS([r]), define L(D) and R(D) in the same way as that in (Equation33). According to Corollary (3.3), it suffices to show that L(D)>(j,l)D1kj+1for any DS(R~ker). Similar to (Equation36), we have (j,l)D1kj1m(jR(D)1kj)L(D)1m(jR~ker1kj)L(D)<L(D)L(D)m,for any DS(R~ker). Since L(D)m always holds, then the proof is completed.

Remark 3.2

In model (Equation4), when r = 1 and d = 1, suppose m12 and k12. The posterior using the recommended prior is always proper by applying Theorem 3.4.

Remark 3.3

If all kj's are equal to one, the sufficient condition in Theorem 3.3 can be simplified as card(D)<maxjDmj1,where DF([r]). By employing the technique of extracting kernel levels, Theorem 3.4 is equivalent to Theorem 3.3, rather than a mere sufficient condition.

Example 3.3

Continue with Example 2.1

Consider model (Equation10) with assuming that s12, s22 and s32. If p = 1, we have k0=k1=k2=k3=m3=1. When s1>2, we can easily derive the set of kernel levels as empty set. Thus, the posterior is always proper by Corollary 3.3. If s1=2, inequality (Equation37) fails, and the posterior propriety can hardly be guaranteed. Consequently, when p = 1, the posterior using the recommended prior is always proper for s1>2.

Next, we generalize Berger et al. (Citation2020b)'s result to the GHNL models for d = 1, assuming that the design matrices Zjij's for units are square matrices.

Corollary 3.4

When d=1, consider the same model and prior as Theorem 3.3. Suppose mj3 and kj=mj+1kj+1, j[r]. Then the posterior is always proper.

Proof.

It follows from Theorem 3.4 that we should only present j[r]1kj<m1.According to the conditions that kj=mj+1kj+1, j[r] and mr+1=kr+1=1, we have kj=s=j+1rms, j[r]. Thus, j[r]1kjj[r]13rj=32(113r)<32<2m1.

Summing up the theoretical results above, a general procedure for checking the posterior propriety of the GHNL models (Equation4) employing the recommended prior for d = 1 can be depicted as follows.

Guidance for checking the posterior propriety when d = 1:

  1. If the design matrices for each unit in each level are square matrices and mj3, j[r], then the posterior is proper, otherwise, turn to (b).

  2. Derive the set of kernel levels R~ker. If R~ker= or inequality (Equation39) holds, then the posterior is proper. If neither, turn to (c).

  3. Check inequality (Equation37) for all D belonging to S(R~ker). If that always holds, then the posterior is proper. If not, the posterior propriety can hardly be guaranteed.

4. Computation

In this section, we consider the MCMC sampling from the posterior arising from the model in Section 2. For the GHNL model (Equation4) with prior (Equation14) and (Equation13) on η and V, respectively, the joint posterior of (Θ,V,η,λ) can be written as (40) π(Θ,V,η,λ|y)f(y|θ1)j=1r1f(θj|θj+1,Vj)f(θr|η,Vr)×s=1rπ(Vs)π(η|λ)π(λ).(40) Sampling (Θ,V,η,λ) from the posterior density (Equation40) can be handled by Gibbs sampling method. The main difficulty of the computation is to sample the covariance matrices Vj's efficiently.

4.1. Gibbs sampling for input effects

The full conditionals of the input effects (Θ,η) can be derived from the joint posterior (Equation40) and are illustrated as follows.

  1. Conditioning on θ2 and V1, the posterior distribution of θ1 is (41) (θ1|θ2,V1;y)Nm1k1(θ~1,V~1),(41) where V~1=(Z0Σ1Z0+Im1V11)1,θ~1=V~1[Z0Σ1y+(Im1V11)Z1θ2].

  2. The full conditional posteriors of θj, j=2,,r have the forms: (42) (θj|θj1,θj+1,Vj1,Vj)Nmjkj(θ~j,V~j),(42) where V~j=[Zj1(Imj1Vj11)Zj1+ImjVj1]1,θ~j=V~j[Zj1(Imj1Vj11)θj1+(ImjVj1)Zjθj+1], and θr+1η.

  3. By using (Equation14), the full conditional of η can be derived as (43) (η|θr,λ,Vr)Nd(η~,V~η),(43) where V~η=[Zr(ImrVr1)Zr+λ1Id]1,η~=V~ηZr(ImrVr1)θr.

Input effects θjΘ and η can be readily sampled from their conditionals during the Gibbs sampling procedure, as their full conditional posterior distributions are all standard distributions.

4.2. Gibbs sampling for variance components

The variance components which include Vj's and λ can be updated from their full conditionals, and these conditionals have densities as follows.

  1. Given η, the conditional posterior density of λ is (44) π(λ|η)λd+12exp{1+η22λ},(44) which is actually an inverse gamma distribution as IG(d12,1+η22).

  2. For j[r], define that tj=mj2+112kj. The marginal posterior density of Vj given (θj,θj+1) is (45) π(Vj|θj,θj+1)1|Vj|tj1s<tkj(ωjsωjt)×etr{12Vj1Hj},(45) where etr(A) denotes exp(tr(A)) for a square matrix A, and HjHj(θj,θj+1)=i=1mj(θjiZjiθj+1)(θjiZjiθj+1).

The updating of λ can be simply carried out by sampling from an inverse gamma distribution. The full conditional posteriors of Vj, (Equation45) are actually distributed as a recently proposed class of prior distributions by Berger et al. (Citation2020a) for the covariance matrix, which is called the Shrinkage Inverse Wishart (SIW) distributions. The new class SIW(a,H) for a k×k covariance matrix W has the density as (46) πSIW(W|a,H)etr(12W1H)|W|ai<j(νiνj),(46) where ν1>ν2>>νk>0 are the ordered eigenvalues of W, a is a real constant and H is a k×k non-negative definite matrix. Thus, Vj are distributed as SIW(tj,Hj), j[r]. To sample the covariance matrices from the full conditional posteriors, the previously suggested methods include the Metropolis-Hastings algorithm (cf. Berger et al., Citation2005) and Hit-and-run method (cf. Yang & Berger, Citation1994). The two methods both generate full candidate matrices by utilizing full-parameter proposal distributions, resulting in that they only work for moderate dimensions of the covariance matrices. To tackle this issue, Berger et al. (Citation2020a) proposed a powerful Gibbs method for efficiently sampling the covariance matrices from their conditional densities and this new method works for higher dimensions k. The audience can refer to Berger et al. (Citation2020a) or Appendix 3 for details of this Gibbs sampling method. According to the simulation results of Berger et al. (Citation2020a), the new Gibbs method outperforms the Metropolis-Hastings and Hit-and-run methods for moderate dimensions and work for k up to 100, while the other two algorithms break down in much lower dimensions.

In the framework of 2-level HNLM, Berger et al. (Citation2020b) compared the numerical performance, from the mean square error (MSE) perspective, of a dozen of objective hyperpriors, which are the product of three objective hyperpriors for the hypermean and four objective hyperpriors for the hypercovariance matrix. Priors on the hypermean include constant prior, conjugate prior and the recommended prior (Equation12). Priors on the hypercovariance matrix include constant prior, hierarchical Jefferys prior, hierarchical reference prior and the recommended prior (Equation13). Their simulation results have shown that the recommended combination of hyperpriors dominates all the others in terms of Bayes risk, and the constant prior on the hypercovariance performs the worst. However, neither of the two remaining choices for hypercovariance is computationally easy. Considering the 4-level HNLM, Song et al. (Citation2020) performed numerical experiment to compare the recommended prior with constant prior for the hypercovariance matrices, and the other two priors were canceled due to intractable computation. Also, Song et al. (Citation2020)'s result presented the domination of the recommended hyperpriors over other priors. In conclusion, both Berger et al. (Citation2020b) and Song et al. (Citation2020) have provided strong numerical evidence of the superiority of the recommended hyperpriors for use in GHNLM, since 2-level and 4-level HNLMs both are specific GHNLM.

5. Discussions

We have proposed a generalized hierarchical normal linear model applicable to the nested data with complex structures. The GHNL model proves to be equivalent to a LMM model, while the GHNL model is more natural for researchers to model nested data from scratch, especially when incorporating covariates at high levels. Like generalizations to the simple normal linear model, the GHNL can be generalized to the hierarchical model with generalized liner model in the first level, and thus discrete observations can be handled. Besides, the first level (or even higher levels) of the GHNL model can be also extended to the setting of semiparametric regression models, such as, single index model and partially linear model. The technique of modelling and investigation in this paper can be applied to the linear part of the models mentioned above. The statistical analysis could be complicated and such explorations are beyond the scope of this paper, however.

Berger et al. (Citation2020b) put an end to the endless search for the appropriate hyperpriors in hierarchical modelling and investigated properties comprehensively to justify the recommendation. Nonetheless, when it came to the propriety of the resulting posterior, they only suspected that that is true for use at any level for a general hierarchical normal model, the conditions for which were not given. To complete the story, we have studied the conditions for the posterior to be proper in more general situations than Berger et al. (Citation2020b), when employing the recommended prior for the GHNL model. Theorems 3.1 and 3.3 demonstrate the main result, and Corollaries 3.1 and 3.3 reduce the computational burdens by defining kernel sets for d2 and d = 1, respectively. In addition, Theorems 3.2 and 3.4 provide powerful tools of simple forms for checking propriety of posterior for d2 and d = 1, separately. The user-friendly guidance for checking posterior propriety is eventually supplied. Note that our results only present sufficient conditions, and necessary conditions have never been discussed. The reason is because the derivation of the lower bound on the integrated likelihood of hyperparameters is intractable. Moreover, it is not worthwhile to investigate necessary conditions, as the derived upper bounds are tight enough such that the corresponding sufficient conditions are very modest, according to the remarks and examples in Section 3. At last, an efficient and powerful Gibbs sampling method for sampling from the posterior is introduced, overcoming the bottleneck of computation that the previously proposed sampling method only works for low dimensions or moderate dimensions inefficiently. The numerical evidence supporting the superiority of the recommended prior for hierarchical models was presented in Berger et al. (Citation2020b) and Song et al. (Citation2020).

Though we have made much progress in the hierarchical linear modelling, a major obstacle to applying our results is that the variance component for the first level is supposed to be known, which can hardly be satisfactory in practice. If we assume an unknown covariance matrix Σ0 for the first level and specify prior (Equation13) on it, the exponential term within the likelihood can not be dropped simply any longer when deriving the upper bound, otherwise, the resulting integral will be always infinite. The upper bound on the exponential term with respect to the eigenvalues of covariance matrices is very tricky to be obtained, and the condition for the integrability of the resulting integral remains to be further studied. Thus, the GHNL modelling with unknown Σ0 can be taken as a sequential study of this paper.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The research was supported by the National Natural Science Foundation of China [grant number 11671146].

References

  • Berger, J. (1980). A robust generalized Bayes estimator and confidence region for a multivariate normal mean. Annals of Statistics, 8(4), 716–761. https://doi.org/10.1214/aos/1176345068
  • Berger, J., Strawderman, W., & Tang, D. (2005). Posterior propriety and admissibility of hyperpriors in normal hierarchical models. Annals of Statistics, 33(2), 606–646. https://doi.org/10.1214/009053605000000075
  • Berger, J., Sun, D., & Song, C. (2020a). Bayesian analysis of the covariance matrix of a multivariate normal distribution with a new class of priors. Annals of Statistics, 48(4), 2381–2403. https://doi.org/10.1214/19-AOS1891
  • Berger, J., Sun, D., & Song, C. (2020b). An objective prior for hyperparameters in normal hierarchical models. Journal of Multivariate Analysis, 178(2020), 1–13. https://doi.org/10.1016/j.jmva.2020.104606
  • Consonni, G., Fouskakis, D., Liseo, B., & Ntzoufras, I. (2018). Prior distributions for objective Bayesian analysis. Bayesian Analysis, 13(2), 627–679. https://doi.org/10.1214/18-BA1103
  • Daniels, M. J., & Kass, R. E. (1999). Nonconjugate Bayesian estimation of covariance matrices and its use in hierarchical models. Journal of the American Statistical Association., 94(448), 1254–1263. https://doi.org/10.1080/01621459.1999.10473878
  • Everson, P. J., & Morris, C. N. (2000). Inference for multivariate normal hierarchical models. Journal of the Royal Statistical Society: Series B, 62(2), 399–412. https://doi.org/10.1111/rssb.2000.62.issue-2
  • Fourdrinier, D., Strawderman, W. E., & Wells, M. T. (1998). On the construction of Bayes minimax estimators. Annals of Statistics, 26(2), 660–671. https://doi.org/10.1214/aos/1028144853
  • Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models. Bayesian Analysis, 1(3), 515–534. https://doi.org/10.1214/06-BA117A
  • Goldstein, H. (2011). Multilevel statistical models (Vol. 922). John Wiley & Sons.
  • Gustafson, P., Hossain, S., & Macnab, Y. C. (2006). Conservative prior distributions for variance parameters in hierarchical models. Canadian Journal of Statistics, 34(3), 377–390. https://doi.org/10.1002/cjs.v34:3
  • Hobert, J. P., & Casella, G. (1996). The effect of improper priors on Gibbs sampling in hierarchical linear mixed models. Journal of the American Statistical Association, 91(436), 1461–1473. https://doi.org/10.1080/01621459.1996.10476714
  • Hoff, P. D. (2009b). Simulation of the matrix Bingham-Von Mises-Fisher distribution, with applications to multivariate and relational data. Journal of Computational and Graphical Statistics, 18(2), 438–456. https://doi.org/10.1198/jcgs.2009.07177
  • Horn, R. A., & Johnson, C. R. (2012). Matrix analysis. Cambridge university press.
  • Lindenberger, U., & Pötter, U. (1998). The complex nature of unique and shared effects in hierarchical linear regression: Implications for developmental psychology. Psychological Methods, 3(2), 218–230. https://doi.org/10.1037/1082-989X.3.2.218
  • Michalak, S. E., & Morris, C. N. (2016). Posterior propriety for hierarchical models with log-Likelihoods that have norm bounds. Bayesian Analysis, 11(2), 545–571. https://doi.org/10.1214/15-BA962
  • Raudenbush, S., & Bryk, A. S. (1986). A hierarchical model for studying school effects. Sociology of Education, 59(1), 1–17. https://doi.org/10.2307/2112482
  • Shimotsu, K. (2010). Exact local Whittle estimation of fractional integration with unknown mean and time trend. Econometric Theory, 26(2), 501–540. https://doi.org/10.1017/S0266466609100075
  • Song, C., Sun, D., Fan, K., & Mu, R. (2020). Posterior propriety of an objective prior in a 4-Level normal hierarchical model. Mathematical Problems in Engineering, 2020. https://doi.org/10.1155/2020/8236934
  • Speckman, P. L., & Sun, D. (2003). Fully Bayesian spline smoothing and intrinsic autoregressive priors. Biometrika, 90(2), 289–302. https://doi.org/10.1093/biomet/90.2.289
  • Sun, D., Tsutakawa, R. K., & He, Z. (2001). Propriety of posteriors with improper priors in hierarchical linear mixed models. Statistica Sinica, 11(1), 77–95. http://www.jstor.org/stable/24306811
  • Xia, A., Ma, H., & Carlin, B. P. (2011). Bayesian hierarchical modeling for detecting safety signals in clinical trials. Journal of Biopharmaceutical Statistics, 21(5), 1006–1029. https://doi.org/10.1080/10543406.2010.520181
  • Yang, R., & Berger, J. (1994). Estimation of a covariance matrix using the reference prior. Annals of Statistics, 22(3), 1195–1211. https://doi.org/10.1214/aos/1176325625

Appendices

Appendix 1. A special LMM

Consider a special LMM of the form: (A1) y=Xβ+Z1u1++Zrur+ϵ,(A1) where y denotes the observations and is an n×1 vector, β is the vector of fixed effects and is an p×1 vector. For i[r], ui's are qi×1 vectors and represent the vectors of random effects, and ui's are assumed to be independently distributed as Nqi(0,Wi), where Wi's are qi×qi positive definite matrices and unknown. X is an n×p matrix, Zi's are n×qi matrices, and X and Zi's are known design matrices. ϵ is the vector of random errors and distributed as Nn(0,Σ), Σ is an n×n positive definite matrix and given.

It follows from Berger et al. (Citation2020b) that we can assume independent priors on (β,W1,,Wr) as (A2) π(β)1(1+β2)(p1)/2,βRp,π(Wj)1|Wj|11/(2qj)1s<tqj(νjsνjt),Wj>0, j[r],(A2) where νj1>νj2>>νjqj>0 are the ordered eigenvalues of Wj, j[r]. The prior on β has a hierarchical structure of the form (β|τ)  Nd(0,τId)and[τ]τ1/2exp(12τ).The posterior propriety results for the special LMM (EquationA1) is displayed as follows. Firstly, let τν01 and q0=1. Denote the index set of the variance scale or the eigenvalues of the covariance matrices by F={(j,l)|j=0,1,,r, l[qj]}, and T={D|DF,D} represents the set of the non-empty subsets of F. Define that cjl,s=1{l=s} for j[r], l[qj] and s[n].

Theorem A.1

Consider linear mixed effect model (EquationA1) with prior (EquationA2) on (β,W1,,Wr). Assume p>1, and then the posterior is proper if (A3) s=1n1{1{(0,1)D}1{sp}+j0,(j,l)Dcjl,s>0}>(j,l)D1qj(A3) holds for any DT.

The proof of Theorem A.1 is similar to that of Theorem 3.1 and is omitted here.

Fact A.1

When p>1, (EquationA3) holds for any DT iff (A4) j[r]1qj<1andp>1+j[r]min(p,qj)qj.(A4)

Proof.

For any DT and (0,1)D, (EquationA3) is equivalent to (A5) s=1n1{(j,l)Dcjl,s>0}>(j,l)D1qj.(A5) It can be deduced that inequality (EquationA5) holds for any DT and (0,1)D iff (A6) L>j[r]min(L,qj)qj,for L[n],(A6) which is equivalent to j[r]1qj<1 since qj1 for j[r].

Inequality (EquationA3) holds for any DT with (0,1)D iff (A7) L>1+j[r]min(L,qj)qj,for L=p,,n.(A7) Under the condition that j[r]1qj<1, (EquationA7) is equivalent to p>1+j[r]min(p,qj)qj.

Corollary A.1

Consider model (EquationA1) with prior (EquationA2) on parameters. The posterior is proper if one of the following condition holds,

  1. p>1 + r and j[r]1qj<1;

  2. p>1 and j[r]1qj<11p.

Proof.

Since j[r]min(p,qj)qjrandj[r]min(p,qj)qjpj[r]1qj,(a) and (b) follow from Fact A.1 directly.

Remark A.1

Consider model (EquationA1) with r = 1. The posterior using prior (EquationA2) is prior if either (a) p2, q13 or (b) p3, q12 holds. If p = 1, the posterior propriety can hardly be satisfied, the reason of which is two-fold. First, inequality (EquationA3) fails for D={(0,1)}. Second, if we follow the thread of deriving the condition in Theorem 3.3, a sufficient condition can be derived as (A8) s=1n1{(j,l)Dcjl,s>0}>1{j[r], (j,1)D}+(j,l)D1qj,(A8) for any DT with (0,1)D. However, inequality (EquationA8) does not hold for D={(j,1)}, j[r].

Appendix 2

Proof of lemmas in Section 3.1

Lemma A.1

min–max theorem, cf. Horn & Johnson, (Citation2012): For an n×n symmetric matrix A and a non-zero n×1 vector x, the Rayleigh quotient for A and x can be defined as R(A,x)=Ax,xx,x,where , denotes the Euclidean inner product. Then (A9) λk(A)=maxU{minx{R(A,x)|xU,x0}| dim(U)=k},k[n](A9) where U denotes the linear subspace of Rn. Especially, (A10) λmax(A)=maxx{R(A,x)|x0}andλmin=minx(A){R(A,x)|x0}.(A10)

Lemma A.2

For n×n symmetric matrices Aj, j[r], we have

  • λmax(j=1rAj)j=1rλmax(Aj);

  • supposing Aj0, j[r], then λk(j=1rAj)1rj=1rλk(Aj),k[n].

Proof.

For (a), given an n×1 non-zero vector x, by (EquationA10), (A11) R(j=1rAj,x)=j=1rR(Aj,x)j=1rmaxxj0R(Aj,xj)=j=1rλmax(Aj).(A11) The proof for (a) is completed by using (EquationA10) again.

For (b), it suffices to prove that for any j[r] (A12) λk(j=1rAj)λk(Aj),i[n].(A12) Since Al0, l[r], for any j[r], xRn and x0, R(l=1rAl,x)=l=1rR(Al,x)R(Aj,x).Minimize the both sides of the inequality above over {x|xU,x0} first. Then take the maximum over {U|URn,dim(U)=k}. (EquationA12) can be easily obtained by using Lemma A.1.

A.1. Proof of Lemma 3.1

For (a), by Lemma A.2 (a), it suffices to prove that λmax(XjAjXj)C1λmax(Aj), j[r]. For any j[r], applying (EquationA10), yields 0<λmax(XjAjXj)=R(XjAjXj,x)=maxx0R(XjAjXj,x).It is obvious that Xjx0, otherwise, it will result in R(XjAjXj,x)=0, which contradicts. In addition, since Xjx,Xjxλmax(XjXj)x,xC1x,x, we have R(XjAjXj,x)C1R(Aj,Xjx)C1max{R(Aj,z)|zRpj, z0}=C1λmax(Aj).Therefore, we have proved part (a).

For (b), we only need to prove that λk(H)C2rj=1rajk,k[n].It follows from Lemma A.2 (b) that for any k[n] λk(H)1rj=1rλk(XjAjXj).Since rank(XjAjXj)=rank(Aj), then λk(XjAjXj)=0, pj<kn. Thus, it remains to show that (A13) λk(XjAjXj)C2λk(Aj),j[r], k[pj].(A13) Firstly, for any j[r], we introduce a linear transformation Lj: RnRpj defined as Lj(ν)=Xjν, νRn. The kernel space of Lj is denoted by Ker(Lj)={νRn:Lj(ν)=0}. Since Xj is of full column rank pjn, then the dimension of the complementary space of Ker(Lj) is pj, i.e., dim(Ker(Lj))=pj. Thus, the mapping Lj: Ker(Lj)Rpj is a one-to-one mapping. For any URpj with dim(U)=k, k[pj], define that Lj(U)={νKer(Lj): Lj(ν)=x,xU}.It is obvious that Lj(U)Ker(Lj) and dim(Lj(U))=k. For any URpj with dim(U)=k and any xU, there exists one and only one νLj(U) such that Lj(ν)=x. Since x,x=ν(XjXj)νC2ν,ν, we have (A14) R(XjAjXj, ν)C2R(Aj, x).(A14) It refers to Lemma A.1 that (A15) λk(XjAjXj)=maxV{minv{R(XjAjXj,ν)|νV, ν0}|VKer(Lj), dim(V)=k},(A15) for k[pj]. Minimize the both sides of inequality (EquationA14) over {x|xU,x0} first. Then take the maximum over {U|URpj,dim(U)=k}, (EquationA13) can be easily obtained by using Lemma A.1 and (EquationA15).

A.2. Proof of Lemma 3.2

The Domain of the integral can be divided into Ω0={λ|0λj1, j[k]},ΩD={λ|λj>1, jD and 0λi1, i[k]/D},DF([k])i.e., Ω=(DF([k])ΩD)Ω0. Thus, the integral I is finite iff the integrals over Ω0 and ΩD for each DF([k]) are finite.

Denote the integrand as F(λ). Then Ω0F(λ)dλΩ01j=1kλjajdλ,which is finite iff condition (a) is satisfied.

To verify condition (b), we only need to justify the following statement. Also, we assume condition (a) is always satisfied hereafter.

Fact A.2

For all DF([k]) with card(D)=L, 1Lk, the integrals ΩDF(λ)dλ are finite iff inequalities (A16) jEaj+iGEbi>card(E)(A16) hold for all EF([k]) with card(E)L and GE={i|ECi,i[n]}. Under the condition above, (A17) ΘD(t)GD(λ)(rDdλr)exp{logt(jDaj+iGDbicard(D))}(A17) always holds, where ΘD(t)={λ|λjt, jDand0λi1, i[k]/D},GD(λ)=1[jDλjaj][iGD(1+rCiλr)bi].

The reason why formula (EquationA17) is required is that it plays an important role in verifying condition (EquationA16).

Proof.

We prove the result by the technique of mathematical induction. First, we assume that the statement in Fact A.2 is true for L = l, 1l(k1). With this assumption, we must show that the statement is true for its successor, L=(l+1). Write an arbitrary set DF([k]) with cardinality (l+1) as {j1,,jl+1}, where 1j1<<jl+1k. Denote that Dji=D/[{ji}], i=1,,(l+1).

Step 1: We first prove that ΩDF(λ)dλ is finite iff inequalities (EquationA16) hold for L=(l+1).

Region ΩD can be divided into Σ1={λ|λjλj1>1,jDj1 and0λi1,i[k]/D}Σl+1={λ|λjλjl+1>1,jDjl+1and 0λi1,i[k]/D}.Therefore, integral ΩDF(λ)dλ is finite iff ΣiF(λ)dλ< for any i=1,,(l+1). For i=1,,(l+1), we have ΣiF(λ)dλΣi1sDλsas(1λji)aji+rHibrGDji(λ)dλ,where Hi={r|CrDji= and jiCr,r[n]}, and it's easy to see that GD=GDjiHi and GDjiHi=. Since (A18) ΣiGDji(λ)(rDjidλr)ΘDji(λji)GDji(λ)(rDjidλr),(A18) (A19) ΩDjiF(λ)dλΩDji1sDjiλsasGDji(λ)dλ(A19) and the RHS (Right-Hand Side) of (EquationA19) and (EquationA18) are finite simultaneously under condition (a). Hence, The LHS (Left-Hand Side) of (EquationA18) is finite iff ΩDjiF(λ)dλ is finite.

Furthermore, by assumption and (EquationA17), we have ΣiGDji(λ)(rDjidλr)exp{logλji(jDjiaj+rGDjibrl)}.Thus, under condition (a) and assumption, we have ΣiF(λ)dλ1(1λji)jDaj+rGDbrldλji,the RHS of which is finite iff jDaj+rGDbr>1+l=card(D).

In conclusion, ΩDF(λ)dλ is finite iff jDaj+rGDbr>card(D) and ΩDjiF(λ)dλ is finite for any i[l+1]. Since D is arbitrary and card(Dji)=l, we have accomplished the goal of Step 1.

Step 2: Next, we prove that formula (EquationA17) holds for D with cardinality (l+1).

Region ΘD(t) can be divided into ΘD(1)(t)={λ|λjλj1t,jDj1 and0λi1,i[k]/D}ΘD(l+1)(t)={λ|λjλjl+1t,jDjl+1 and0λi1,i[k]/D}.Similar to the proof of Step 1, we can prove that for i=1,,(l+1), ΘD(i)(t)GD(λ)(rDdλr)t(1λji)jDaj+rGDbrldλji=exp{logt(jDaj+rGDbrcard(D))}.Therefore, we get Step 2 proved.

Step 3: We need to present that the statement is true for L = 1 to complete the proof, on the basis of mathematical induction.

Denote that D={r}, r=1,,k. Then ΩDF(λ)dλ1(1λr)ar+iGDbidλr,which is finite iff jDaj+iGDbi>1=card(D), under which, ΘD(t)GD(λ)(iDdλi)t(1λr)ar+iGDbidλr=exp{logt(jDaj+iGDbicard(D))},which accomplishes the proof of Fact A.2.

Appendix 3

Gibbs sampling from the SIW distributions

As for the SIW distribution (Equation46), we first consider the change of variables from W to Ξ=diag(ν1,,νk) and the orthogonal matrix O of corresponding eigenvectors. The Jacobian is (A20) | W(Ξ,O)|=i<j(νiνj).(A20) According to (EquationA20) and Lemma 4 in Berger et al. (Citation2020a), (Equation46) can be transformed to (A21) π(Ξ,O)1|Ξ|aetr(12Ξ1OHO).(A21) Gibbs sampling of Ξ: We first sample Ξ given (O,H) from π(Ξ|O,H)1i=1kνiaetr(12Ξ1OHO)=i=1k1νiaetr(ciνi),where ci is the i-th diagonal element of OHO/2, i[k]. Therefore, we can sample νi independently from IG(a1,ci).

Gibbs sampling of O: Given (Ξ,H), the marginal density of O has the form: π(O|Ξ,H)etr(12HOΞ1O).Let H=LUL, where LL=Ik and U=diag(u1,,uk) is the diagonal matrix of corresponding eigenvalues with u1uk. Define G=LO. Since the invariant right Haar measure is invariant to the orthonormal transformation, the conditional density of G is (A22) π(G|Ξ,H)etr(12UGΞ1G).(A22) The updating of G from (EquationA22) can be implemented by applying a Gibbs update to two randomly selected columns (cf. Hoff, Citation2009b) or rows (cf. Berger et al., Citation2020a). The two ways are essentially equivalent when rank(H)=k, but Berger et al. (Citation2020a)'s method is considerably faster if rank(H)<k. Without any loss, assume that the two randomly selected rows are the first and second rows. The updated value of G can be written as Gnew=diag(Φ,Ik2)(G12oldG12old), where G12old denotes the first two rows of the old value of G which is Gold, G12old is the remaining k−2 rows of Gold and Φ=DϵΦ0=(ϵ100ϵ2)(cosϕsinϕsinϕcosϕ),with ϕ(π2,π2] and ϵi=±1 for i = 1, 2. Let U1=diag(u1,u2). The full conditional density of ϕ has the form: π(ϕ|Gold,Ξ,H)etr{12U1Φ0G12oldΞ1(G12old)Φ0}.Write G12oldΞ1(G12old)=(cosθsinθsinθcosθ)(s100s2)×(cosθsinθsinθcosθ)where θ(π2,π2] and s1>s2. Then the conditional density of ϕ can be rewritten as π(ϕ|Gold,Ξ,H)exp{c0cos2(ϕ+θ)},where c0=12(s1s2)(u1u2)0. Define α=cos2(ϕ+θ). Then the full conditional density of α has the form: π(α|Gold,Ξ,H)exp{c0α}α12(1α)12,α[0, 1].Simulating α[0,1] can proceed with a rejection sampler by setting the proposal distribution as Beta(12,12).