![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
ABSTRACT
Bayesian Hierarchical models has been widely used in modern statistical application. To deal with the data having complex structures, we propose a generalized hierarchical normal linear (GHNL) model which accommodates arbitrarily many levels, usual design matrices and ‘vanilla’ covariance matrices. Objective hyperpriors can be employed for the GHNL model to express ignorance or match frequentist properties, yet the common objective Bayesian approaches are infeasible or fraught with danger in hierarchical modelling. To tackle this issue, [Berger, J., Sun, D., & Song, C. (2020b). An objective prior for hyperparameters in normal hierarchical models. Journal of Multivariate Analysis, 178, 104606. https://doi.org/10.1016/j.jmva.2020.104606] proposed a particular objective prior and investigated its properties comprehensively. Posterior propriety is important for the choice of priors to guarantee the convergence of MCMC samplers. James Berger conjectured that the resulting posterior is proper for a hierarchical normal model with arbitrarily many levels, a rigorous proof of which was not given, however. In this paper, we complete this story and provide an user-friendly guidance. One main contribution of this paper is to propose a new technique for deriving an elaborate upper bound on the integrated likelihood, but also one unified approach to checking the posterior propriety for linear models. An efficient Gibbs sampling method is also introduced and outperforms other sampling approaches considerably.
1. Introduction
Bayesian hierarchical models (or multilevel models) have been extensively used in the modern application, including education (Raudenbush & Bryk, Citation1986), psychology (Lindenberger & Pötter, Citation1998), clinical trials (Xia et al., Citation2011), economics (Shimotsu, Citation2010) and many other applied statistical fields. The fundamental idea of hierarchical modelling is to think of the lowest-level units (smallest and most numerous) as organized into a hierarchy of successively higher-level units. For example, students are in classes, classes are in schools, schools are in districts, and districts are in states. Accordingly, hierarchical models are naturally applicable to the survey, observational or experimental data involved with complicated nesting. However, the most commonly used and fully discussed hierarchical models are merely of two levels. Goldstein (Citation2011) and Berger et al. (Citation2020b) have ever defined 3-level hierarchical model and implemented statistical analysis on that. The hierarchical model with more levels was usually avoided by the researchers for the reason of analytical difficulty and intractable computation. To the best of authors' knowledge, a general hierarchical linear model with arbitrary levels seems to have never been defined or studied. In this paper, we will introduce the definition of a generalized hierarchical normal linear (GHNL) model and carry out an in-depth theoretical investigation of Bayesian inference for the GHNL models.
In order to implement fully Bayesian analysis, priors are supposed to be specified on the hyperparameters (parameters at higher levels of the hierarchical model). Improper (objective) priors are often used to express ignorance or to match frequentist properties (see the review article, Consonni et al., Citation2018). When using improper priors, an important issue whether the resulting posterior distributions are proper arises. As Hobert and Casella (Citation1996) stated, without proper precaution, misuse of improper priors, sometimes unknowingly, will result in practical difficulties, such as the non-convergence of the Gibbs sampler. The enormous practical importance of posterior propriety motivates us to explore it in the framework of GHNL modelling afterwards. There is also a vast modern literature investigating the posterior propriety of improper priors applied to a large variety of models, such as, Sun et al. (Citation2001), Speckman and Sun (Citation2003), Berger et al. (Citation2005) and Michalak and Morris (Citation2016).
A great deal of efforts have been devoted to the development of objective hyperpriors in hierarchical modelling, such as, Daniels and Kass (Citation1999), Everson and Morris (Citation2000), Gelman (Citation2006), Gustafson et al. (Citation2006), Berger et al. (Citation2005) and Berger et al. (Citation2020b). Formal objective Bayesian approaches, like the Jeffreys-rule prior or reference prior are only feasible for the simple hierarchical settings. For instance, the exact Jeffreys-rule prior for covariance matrices at higher level depends on the parameters from the lower level of the model, leading to plenty of difficulties in formulation and computation. Therefore, a common way is to use less formal approaches, such as applying formal objective priors from non-hierarchical models to hierarchical modelling. Unfortunately, the non-hierarchical Jefferys-rule prior and reference prior typically yield improper posteriors in the hierarchical settings (cf. Berger et al., Citation2005). Those who can recognize this problem often use constant priors instead for higher level variance components. However, the constant prior is so diffuse that it requires twice as many observations as logically needed to achieve posterior propriety (cf. Berger et al., Citation2005 and Berger et al., Citation2020b). In other words, the extra observations required are wasted on correcting the over-diffuse tail of the constant prior. The most powerful tool known for detecting over-diffuse hyperpriors is by looking at the frequentist notion of admissibility of resulting estimators (see Berger et al., Citation2005 for discussions and references). Sensible choices of objective hyperpriors are on the boundary of admissibility, being as diffuse as possible without leading to inadmissible estimators.
Berger et al. (Citation2005) studied the propriety and admissibility of a number of hyperprios, but no overall conclusion was reached as to a specific prior to recommend. The reasons are as follows: (a) the admissibility of the leading candidate prior was unable to get proved; (b) the proposed computation methods were only efficient for relatively low-dimensional covariance matrices and remained quite challenging for the candidate priors; (c) the hierarchical model discussed was merely of two levels, and the results are not adaptive to a general hierarchical model with many levels. To address this issue, Berger et al. (Citation2020b) recommended a particular objective prior for use in all normal hierarchical models. Consider the following canonical form of 2-level hierarchical normal model. Suppose that, independently, and
for
, where
denotes the k-dimensional normal distribution,
are
observation vectors,
are the
unobserved mean vectors,
is a
‘hypermean’ vector, and
is an unknown ‘hypercovariance’ matrix. Berger et al. (Citation2020b) proposed a particular combination of independent priors on hyperparameters
and
as
(1)
(1) where
are the ordered eigenvalues of
. The recommendation (Equation1
(1)
(1) ) for hyperpriors was justified by Berger et al. (Citation2020b) from the aspects of admissibility, ease of computation and performance. Most importantly, prior (Equation1
(1)
(1) ) is adapted to being used at any level in hierarchical modelling, which is not true for other proposed objective priors as previously mentioned.
Since it is hazardous to skip demonstrating propriety at the risk of making inference from improper posterior, Berger et al. (Citation2020b) has shown the posterior propriety of a 3-level hierarchical model using prior (Equation1(1)
(1) ), while assuming square design matrices for a technical reason. Berger et al. (Citation2020b) also conjectured that the posterior is proper with the recommended prior being utilized at all levels of a hierarchical normal model with arbitrarily many levels, a rigorous proof of which was unable to be provided, however. In this paper, we complete this story and prove the posterior propriety for the GHNL models in general situations. Besides, as pointed out in Michalak and Morris (Citation2016), researchers have been finding it daunting and time-consuming to inspect posterior propriety when using improper priors, except in the simplest models. For this reason, we supply a user-friendly guidance for checking posterior propriety to practitioners in different practical situations.
In Section 2, we give an explicit definition of the GHNL model which accommodates arbitrarily many levels and usual design matrices. It is important to note that we are considering the ‘vanilla’ covariance matrix problem herein. We are not assuming any special structures or sparsity for hypercovariance matrices. The association between the GHNL model and a linear mixed-effect model is also discussed. In Section 3, we demonstrate that recommended prior yields a proper posterior in the framework of GHNL modelling. In addition, we exhibit a guidance for checking posterior propriety. An efficient MCMC algorithm for sampling from the posterior is introduced in Section 4. Section 5 provides some concluding remarks and further generalizations.
2. Generalized hierarchical normal linear model
In this section, we will introduce the definition of a GHNL model with levels, where
. The association between the GHNL model and a linear mixed-effect model will be demonstrated as well, which brings an insight into the GHNL model. At last, the recommended prior on the hyperparameters of the GHNL model will be presented and discussed. Firstly, we introduce some notations to be used in the main body of this paper.
Notations Let for a positive integer k;
stands for the indicator function;
represents the k-dimensional normal distribution with mean
and covariance
;
denotes a k-dimensional normal random variable with mean
and covariance
; for a symmetric matrix
,
means that
is a positive (negative) definite matrix, and
denotes that
is a non-negative (non-positive) definite matrix.
2.1. Model structure
Berger et al. (Citation2020b) proposed a 3-level hierarchical model with the form:
(2)
(2) where
are
observation vectors,
are the
unobserved mean vectors,
is a
‘hypermean’ vector,
and
are unknown ‘hypercovariance’ matrices, and
are
known matrices. At last, all the normal random variables in model (Equation2
(2)
(2) ) are mutually independent. Based on the 3-level hierarchical normal model, a more general hierarchical model with
levels (
) can be constructed as
(3)
(3) Firstly, all the normal random variables noted in the above model are mutually independent. Within model (Equation3
(3)
(3) ), the output of level
consists of
units whose values are
vectors for
. By stacking the output units of level
on top of one another, we can obtain the outcome vector of level
as
for
and
for level 1. Then
's are
vectors and
is a
vector. In fact, only the outcome of the lowest level can be observed, and the outcomes of higher levels are inaccessible and latent variables. Hence, the outcome variables of interest are always situated at the lowest level of the hierarchy. Different units in the same level share in common input effects (intercept can be included) which are exactly the outcome vectors from the upper level, except that the input effect of level
is
which is a
vector of fixed effects. In addition, the units from the same level have the same variance component. The variance component within level
is denoted by
for
and accounts for the magnitude of random variation within the corresponding level. The covariance matrices
are unobserved for
. The matrices
are
matrices and denote the matrices of observed covariates for unit
in level j, where
and
. It is natural to assume that there exist at least two units in each level and the dimensions of all units and
are no less than 1, mathematically,
and
for
, and
. Table summarizes several important notations that will mainly affect the results for the posterior propriety in Section 3.
Table 1. Summary of certain important notations within model (Equation3(3)
(3) ) and
.
The extensions from Berger et al. (Citation2020b)'s model (Equation2(2)
(2) ) to model (Equation3
(3)
(3) ) are two-fold, and model (Equation3
(3)
(3) ) accommodates arbitrarily many levels and usual covariate matrices. Further define
for
. Then
are
matrices for
,
is an
matrix, and an alternative representation of the
-level hierarchical model (Equation3
(3)
(3) ) is thereby given by
(4)
(4)
Remark 2.1
If we assume that the covariance matrix for the units from level 1 in model (Equation3(3)
(3) ) is a positive definite matrix
instead of the identity matrix, when
is known, the two assumptions are actually equivalent by taking reparameterization that
and
. Furthermore, for a technical reason,
must be assumed as known throughout this paper, and this reason will be explained in Section 5.
2.2. Connection with the linear mixed-effect model ![](//:0)
LMM ![](//:0)
![](//:0)
The two-level hierarchical normal models are often referred to as LMMs in many places. As for the GHNL model (Equation4(4)
(4) ), let
denote the set of unobserved outcome vectors and
represent the set of unknown covariance matrices. If we take
's as intermediate variables, then marginalizing out over
yields
(5)
(5) where
(6)
(6)
is a
matrix and
are
matrices for
. Suppose that
are of full column ranks. Then by Sylvester's rank inequality
are also of full column ranks,
. In the rest of the paper,
are assumed to be of full column ranks for
.
If we consider a particular LMM as
(7)
(7) where
is the fixed effect,
's are random effects and independently distributed as
for
, and
denotes the vector of random errors and is distributed as
. By integrating out the random effects, the marginal distribution of
conditioning on
is identical to the distribution (Equation5
(5)
(5) ). In one word, the GHNL is equivalent to an LMM in the sense of the marginal distribution of observations after integrating out intermediate outcome vectors or random effects. The equivalence between the GHNL models and the LMMs can be illustrated by an example of mixed-effect ANOVA model as well.
Example 2.1
Mixed-effect ANOVA model
Suppose we can observe the scores of p courses for student (ijk) as for
,
and
. The observed data are within a hierarchy of three levels: student (ijk) is nested within class (ij), and class (ij) is nested within school i. Thus, we have total
schools, each school has
classes and each class has
students. Consider a mixed-effect ANOVA model as
(8)
(8) where
,
and
are all
vectors for
,
and
,
denotes the overall mean and is fixed effect,
is the effect of school,
is distributed as
and represents the effect of class, the student-level independent random error is denoted by
and has distribution
, and
is a known matrix. At the same time,
,
and
are independently distributed. Consequently,
,
and
are the variance components describing the school-level, class-level and student-level variations, respectively. Due to the hierarchical structure of the observations, we can naturally build a hierarchical model as
(9)
(9) independently, for
,
and
. Denote that
where
and
are both
matrices, and
and
are both
vectors. Let
and
, where
denotes the column vector obtained by stacking the columns of the matrix
on top of one another. Define that
Thus,
and
are
vectors, and
and
are
vectors, and
and
are
vectors, where
,
and
. Then the hierarchical normal model (Equation9
(9)
(9) ) can be expressed as a GHNL model with the form
(10)
(10) where
where
denotes the
vector with all elements being one, and
,
,
are
,
,
matrices, respectively. Denote that
and
,
,
are
,
,
matrices, respectively. Thus, model (Equation8
(8)
(8) ) can be summarized as
(11)
(11) where
,
and
, independently. By integrating out
and
, the marginal distributions of
for model (Equation11
(11)
(11) ) and model (Equation10
(10)
(10) ) are identical and of the form:
Example 2.1 provides a simple example to illustrate how the hierarchical model and the mixed-effect model can be constructed based on the nested data, and the equivalence between two models is also presented. In Appendix 1, we define a special LMM which is a special case of model (Equation7(7)
(7) ) with
for all
, and the theoretical investigation of this special LMM is distinct from that of the GHNLM. This special LMM could be common in application, and we just want to provide some theoretical results for those who have interests to refer to. We will still focus on the GHNL model in the following sections.
2.3. Priors on the hyperparameters
In order to implement fully Bayesian analysis, we should specify hyperpriors on the parameters . It follows the recommendation from Berger et al. (Citation2020b) that we can assume priors on
as:
(12)
(12)
(13)
(13) where
are the decreasingly ordered eigenvalues of
,
. Apart from prior (Equation12
(12)
(12) ), common choices of prior on
include the constant prior and conjugate prior. None of the three priors will result in improper priors or difficulties in computation. However, among the three priors, prior (Equation12
(12)
(12) ) is the most perfect for all k from the perspective of admissibility. Besides, it refers to Berger et al. (Citation2005) that the prior (Equation12
(12)
(12) ) is a mixture-of-normal prior of the hierarchical structure as
(14)
(14) and those mixture-of-normal priors have shown great success in shrinkage estimation particularly (cf. Fourdrinier et al., Citation1998) and robust Bayesian estimation generally (cf. Berger, Citation1980). Therefore, prior (Equation12
(12)
(12) ) was actually recommended by Berger et al. (Citation2005) for default use.
As for prior (Equation13(13)
(13) ) on the unknown covariance matrices
,
, consider the transformation from
to
and the orthogonal matrix
of corresponding eigenvectors, the Jacobian is
(15)
(15) Consequently, the prior (Equation13
(13)
(13) ) on
becomes the prior density of
as
(16)
(16) with respect to Lebesgue measure on
and the invariant Haar measure over the space
. Note that the prior on
is improper and, independently, the prior on
is constant. Use of a uniform prior for
ranging over a compact space is natural and non-controversial and has no influence on the eigenvalues. The term
is eliminated after changing variables for prior (Equation13
(13)
(13) ). In contrast, the commonly used priors on the covariance matrix, such as inverse Wishart, Jeffreys-rule and constant priors, contain the term
in the transformed space. This special term gives low mass to close eigenvalues and hence effectively force the eigenvalues apart. It is contrary to the common intuition which would suggest that one should choose a prior that pushes the eigenvalues closer together. As a result, prior (Equation13
(13)
(13) ) is essentially neutral as to expansion or shrinkage of the eigenvalues.
In the context of 2-level hierarchical normal model, Theorem 1 from Berger et al. (Citation2020b) has demonstrated that the combination of priors (Equation12(12)
(12) ) and (Equation13
(13)
(13) ) on
is on the boundary of admissibility, being as diffuse as possible without yielding inadmissible estimators. Furthermore, it is shown that the generalization that allows covariates at all levels of the hierarchical model will not affect the result of admissibility (cf. Berger et al., Citation2020b). Nonetheless, the admissibility of the recommended prior for the
-level hierarchical model with
is not clear. Generally speaking, this is a very difficult question to answer, and we mainly justify the recommendation of hyperpriors from the angles of posterior propriety and computation afterwards in the framework of the GHNL model.
3. Posterior propriety
Berger et al. (Citation2020b) has shown that the resulting posterior of the recommended prior is proper for the 3-level hierarchical model (Equation2(2)
(2) ), but under a narrow set of assumptions. They also conjectured the posterior propriety for a hierarchical model with any number of levels, a rigorous one of which was not given yet. In this section, we will comprehensively investigate the conditions for the posterior propriety of the GHNL model (Equation4
(4)
(4) ) using the recommended prior in more general situations. The dimension of
affects the investigation of posterior propriety considerably, and two cases
or d = 1 will be discussed separately.
Based on (Equation5(5)
(5) ) and (Equation14
(14)
(14) ), by integrating out
, we can obtain the marginal distribution of
conditioning on
as
(17)
(17) The posterior propriety of the GHNL model (Equation4
(4)
(4) ) employing priors (Equation13
(13)
(13) ) and (Equation14
(14)
(14) ) is defined as
(18)
(18) where
denotes the marginal density of the observation vector. Next, we display some definitions and additional notations which are frequently used in this section.
More Notations Let denote the cardinality of the set A. For
, denote that
if there exist constants
such that
; For a symmetric matrix
,
represents the i-th largest eigenvalue of
, namely,
; Let
and
denote the maximum and minimum eigenvalues of an arbitrary symmetric matrix
.
Definition
For convenience, let ,
and
. Define a function of an arbitrary non-empty set E as
, such that
denotes the set of all non-empty subsets of E. For any
, let
. Further define the composition of
and H as
for any
. Define that
(19)
(19) for
,
and
.
3.1. Two key lemmas
Before we formally investigate the posterior propriety, we first introduce two important lemmas which dominate in the process of proving the main theorems in this paper.
Lemma 3.1
Assume that are
positive definite matrices,
. Let
be
matrices of full column ranks,
. Define that
(20)
(20) Then
, where
.
Also,
(21)
(21) where
. For any
,
, where
for
and
for
.
Lemma 3.1 mainly demonstrates two inequalities with respect to the summation of a series of quadratic forms which have matrical inputs. Since 's have full column rank and
's are positive definite, then
and
. It is worthwhile to note that the non-zero diagonal elements of
are the decreasingly ordered eigenvalues of
in the lower bound of
, and this relation will deeply influence the last result when we derive the sufficient condition for posterior propriety afterwards. Besides, we can never find a constant
such that
, where
denotes the product of all non-zero eigenvalues of
. The proof of Lemma 3.1 can be found in Appendix A.2.
Lemma 3.2
Assume a positive integer k. Suppose is a subset of
with cardinality n and let
denote the sequence of all the elements within
. Define an integral
where
and
are real constants, and
. Then the integral I is finite if and only if (iff) the following two conditions are both satisfied.
,
;
inequalities
(22)
(22) for all
hold, where
.
Here, Lemma 3.2 (b) may not be straightforward enough, so we will use the following example to elaborate on how Lemma 3.2 can be employed.
Example 3.1
Consider integral
where
are all real constants. Similar to Lemma 3.2, we can define
. Then
iff all the following inequalities hold: (a)
; (b)
Note that, no matter how is defined, we always need to check all the inequalities corresponding to all
. Even though some inequalities could be trivial after being written down, for the sake of assurance, we would better take all non-empty subsets of
into account in the early stage. Lemma 3.2 plays a crucial role in obtaining the follow-up theorems, and detailed proof of Lemma 3.2 sees Appendix A.2.
3.2. Conditions for the posterior to be proper when ![](//:0)
![](//:0)
In this subsection, the case is mainly considered.
Theorem 3.1
Consider the GHNL model (Equation4(4)
(4) ) with priors (Equation12
(12)
(12) ) and (Equation13
(13)
(13) ) on
and
, respectively. When
, a sufficient condition for the posterior propriety is given by
(23)
(23) for any
.
Proof.
It follows (Equation17(17)
(17) ) that the integrated likelihood of
after marginalizing out
is given by
(24)
(24) By dropping the exponent term involving
(since it is less than one), we have
By applying Lemma 3.1 (b), we can further bound the integrated likelihood as
(25)
(25) where
is a positive constant that only depends on
's and
(26)
(26) where
are the diagonal matrices of the decreasingly ordered eigenvalues of
for
,
,
are
zero matrices and
for
. Since
, then
is a degenerate matrix as scalar λ and the prior on λ becomes
(27)
(27) Combining (Equation16
(16)
(16) ), (Equation18
(18)
(18) ), (Equation27
(27)
(27) ) and (Equation25
(25)
(25) ), we have
The definition of
in (Equation19
(19)
(19) ) yields
Therefore,
which is finite iff
(28)
(28) for any
by employing Lemma 3.2. It's obvious that the inequality (Equation28
(28)
(28) ) is equivalent to that in (Equation23
(23)
(23) ).
Applying Lemma 3.1 yields the upper bound on the integrated likelihood function of hyperparameters as , where
is an
matrix and defined in (Equation26
(26)
(26) ). Therefore, the special notation
can be understood as the indicator of whether eigenvalue
appears in the s-th diagonal element of
for
,
and
. At the same time, the left-hand side of inequality (Equation23
(23)
(23) ) actually stands for the cardinality of set
for any
.
The cardinality of is
, which means that the total number of the inequalities to be checked in (Equation23
(23)
(23) ) is exponential with r and the dimensions
. It has to be admitted that this will impose considerably heavy computational burden in common practice by applying Theorem 3.1 directly. Nevertheless, the researchers have no need to be anxious about the heavy computational burden, because most of the inequalities in Theorem 3.1 are trivial. To conclude this point, we have the following corollary.
Corollary 3.1
Recursively define that
where p is the smallest positive integer i such that
. We call the levels within
as kernel levels and denote
by
. The inequality (Equation23
(23)
(23) ) holds for any
iff the inequality (Equation23
(23)
(23) ) holds for any
. Consequently, if
, then the posterior is always proper.
Proof.
Let , thus,
for
by the definition of
. For any
, if there exists
such that
for any
, then
which is a trivial one. As a result, inequality (Equation23
(23)
(23) ) holds for any
iff inequality (Equation23
(23)
(23) ) holds for any
. Since
it can be recursively shown that inequality (Equation23
(23)
(23) ) holds for any
iff inequality (Equation23
(23)
(23) ) holds for any
and
, where p is the smallest positive integer i such that
.
By using the technique of extracting kernel levels, we dramatically narrow down the checking region for posterior propriety. Since we should only check the inequalities for the levels within , substantially reducing the number of inequalities to be checked from
to
. Moreover, Corollary 3.1 also indicates two interesting conclusions depicted as follows.
First, it reveals the mechanism how three roles, number of levels, numbers of units in levels and dimensions of levels, affect the posterior propriety simultaneously. Roughly speaking, in the context of GHNL, the more levels with fewer units having lower dimensions, the less likely the posterior is to be proper. For example, if
and
, for
, we have
(29)
(29) Therefore, the posterior propriety can hardly be guaranteed by Theorem 3.1. Conversely, if the units in each level are adequate enough such that the set of kernel levels is empty, i.e.,
, then the posterior is always proper. As a consequence, more attention should be focused on the levels with small number of units, namely, the kernel levels.
Second, the recommended prior for use at any level in hierarchical modelling is further justified from the aspect of posterior propriety and ease of implementation. For instance, if we switch the prior on
,
from (Equation13
(13)
(13) ) to
(30)
(30) where
(a has to be larger than zero by Lemma (3.2)). Then the condition in Theorem 3.1 becomes
(31)
(31) On one hand, when a is greater than but close to zero (denoted by
), all the inequalities in (Equation31
(31)
(31) ) always hold. Thus, the posterior will be always proper. However, it is impractical to decide how small is for a and find such a fixed value that fits all levels. On the other hand, when
, which denotes that a is less than and close to 1, then
, concluding that inequality (Equation31
(31)
(31) ) is harder to get reached than inequality (Equation23
(23)
(23) ), especially for large dimensions
's. Therefore, the posterior using prior (Equation30
(30)
(30) ) is rather unlikely to be proper than using prior (Equation13
(13)
(13) ). Similar to Corollary 3.1, we can recursively define that
where
,
for
and
is the smallest positive integer l such that
. Then the posterior using prior (Equation30
(30)
(30) ) instead is proper if (Equation31
(31)
(31) ) holds for any
. When
,
will be remarkably larger than
for large values of
's, imposing a dramatically heavier burden of checking inequalities than that for prior (Equation13
(13)
(13) ). Above all, one sensible choice for a is that let a be inversely proportional to
for level j, which takes both practical and theoretical considerations into account.
The upper bound on as
for
, leads to an effective way to extract kernel levels as presented in Corollary 3.1, but this upper bound is still too rough. Next, an elaborate upper bound on
is demonstrated and a sufficient condition of clean form for posterior propriety is derived then.
Theorem 3.2
Consider the GHNL model (Equation4(4)
(4) ) with priors (Equation12
(12)
(12) ) and (Equation13
(13)
(13) ) on
and
, respectively. Denote that
. When
, the posterior is always proper if
(32)
(32)
Proof.
For , define that
(33)
(33) It follows from Corollary 3.1 that we only need to prove
Also, for any D belonging to
, we have
(34)
(34) Distinct eigenvalues from the same level (
-th) never occur in the same row of matrix
. Mathematically,
, for
,
,
. Thus, for any
,
(35)
(35) which is also true for j = r + 1 since
. It is obvious that
by the definition of
, and that is denoted by
. Combining (Equation35
(35)
(35) ) with (Equation34
(34)
(34) ) yields
(36)
(36)
According to the proof procedure of Theorem 3.2, it can be deduced that is also sufficient for posterior propriety. Obviously, the condition in Theorem 3.2 is easier to be satisfied. Theorem 3.2 reveals that for fixed
, the posterior is more likely to be proper for higher dimensions of the units in the kernel levels. Theorem 3.2 also provides the researchers with a powerful tool to check the posterior propriety quickly.
Remark 3.1
Consider the model (Equation4(4)
(4) ) with r = 1, namely, a two-level hierarchical model. When
, we have
. Then the posterior using the recommended prior is always proper for
referring to Theorem 3.2.
Example 3.2
Continue with Example 2.1
Consider the GHNL modelling of the mixed-effect ANOVA as (Equation10(10)
(10) ), which is a 3-level hierarchical model with r = 2,
,
,
,
,
and
. It is natural to assume that we have at least two schools, each school has at least two classes and each class has at least two students, namely,
,
and
. If (a) p>2 and
or (b)
and
holds, it can be readily derived that the set of kernel levels is empty. Thus, the posterior is always proper according to Corollary 3.1. When
, the set of kernel levels is
, since
, then the posterior is proper by applying Theorem 3.2. In conclusion, the posterior using the recommended prior is always proper when
.
In Berger et al. (Citation2020b)'s work, for a technical reason, they assumed k = sp for 3-level hierarchical normal model (Equation2(2)
(2) ) such that the design matrices for units within level 2 and 3 are square matrices. They eventually reached a conclusion that the posterior employing the recommended prior in model (Equation2
(2)
(2) ) is always proper for k = sp and
(p is the dimension of hypermean in model (Equation2
(2)
(2) )). However, the assumption that the design matrices for the units at high levels are square matrices appears to be unnatural and hard to be interpreted in practice. Nevertheless, we still generalize Berger et al. (Citation2020b)'s result to the GHNL model in the following so as to draw a consistent conclusion with theirs.
Corollary 3.2
Consider the GHNL model (Equation4(4)
(4) ) with priors (Equation12
(12)
(12) ) and (Equation13
(13)
(13) ) on
and
, respectively. Assume that
and
,
. Then the posterior is always proper.
Proof.
Since ,
, then
. It remains to show that
by Theorem 3.2. By utilizing the condition that
for
and
, we have
for
. Thus,
which completes the proof.
Above all, a general procedure for checking the posterior propriety of the GHNL models (Equation4(4)
(4) ) employing the recommended prior for
can be summarized as follows.
Guidance for checking the posterior propriety when :
If the design matrices for each unit in each level are square matrices, then the posterior is proper, otherwise, turn to (b).
Derive the set of kernel levels,
. If
or inequality (Equation32
(32)
(32) ) holds, the the posterior is proper. If neither, turn to (c).
Check inequality (Equation23
(23)
(23) ) for all D belonging to
. If that always holds, the the posterior is proper. If not, the posterior propriety can hardly be guaranteed.
3.3. Conditions for the posterior to be proper when d = 1
It is quite common that the dimension of the fixed effect is only one in practice. However, when
, note that
for
, resulting in the failure of the sufficient condition in Theorem 3.1. Therefore, in this subsection, we mainly reinvestigate the conditions for the the posterior to be proper for d = 1.
Theorem 3.3
Consider the GHNL model (Equation4(4)
(4) ) with constant prior on η and prior (Equation13
(13)
(13) ) on
. When d = 1, the posterior is proper if
(37)
(37) holds for all
.
Proof.
When d = 1, vector will degenerate into a scalar η. Hence the prior (Equation12
(12)
(12) ) on
becomes a constant prior on η. Based on (Equation5
(5)
(5) ), by integrating out over η, we can get the upper bound on the integrated likelihood of
after dropping the exponential term (less than one)
Since
, using Lemma 3.1 (a) and (b), we have
where
and
are constants that are independent of the
for
,
and
's are defined in (Equation26
(26)
(26) ). Similar to the proof of Theorem 3.1, we can derive the upper bound on
as
It follows from the definition of
in (Equation19
(19)
(19) ) that
Thus,
which is finite iff
(38)
(38) for any
by employing Lemma 3.2. It's obvious that inequality (Equation38
(38)
(38) ) is equivalent to that in (Equation37
(37)
(37) ).
Resembling the interpretation of Theorem 3.1, the left-hand side of inequality (Equation37(37)
(37) ) actually denotes the cardinality of set
for any
. To reduce the burden of checking inequalities, we have the following corollary.
Corollary 3.3
Recursively define that
where q is the smallest positive integer i such that
. We call the levels within
as kernel levels and denote
by
. Inequality (Equation37
(37)
(37) ) holds for any
iff inequality (Equation37
(37)
(37) ) holds for any
. Consequently, if
, then the resulting posterior is always proper.
In the process of extracting kernel levels, the thresholds of to split up levels are increased by one in Corollary 3.3, when compared with that in Corollary 3.1, and this is because the upper bound on the right-hand side of inequality (Equation37
(37)
(37) ) is increased by one than that of inequality (Equation23
(23)
(23) ). Except for this point, the proof of Corollary 3.1 is same as that of Corollary 3.3. For good measure, a simple tool to check the posterior propriety is shown as follows, which is a counterpart of Theorem 3.2 for d = 1.
Theorem 3.4
Consider the GHNL model (Equation4(4)
(4) ) with constant prior on η and prior (Equation13
(13)
(13) ) on
. When
, the posterior is always proper if
(39)
(39) where
and
is the derived set of kernel levels.
Proof.
For any , define
and
in the same way as that in (Equation33
(33)
(33) ). According to Corollary (3.3), it suffices to show that
for any
. Similar to (Equation36
(36)
(36) ), we have
for any
. Since
always holds, then the proof is completed.
Remark 3.2
In model (Equation4(4)
(4) ), when r = 1 and d = 1, suppose
and
. The posterior using the recommended prior is always proper by applying Theorem 3.4.
Remark 3.3
If all 's are equal to one, the sufficient condition in Theorem 3.3 can be simplified as
where
. By employing the technique of extracting kernel levels, Theorem 3.4 is equivalent to Theorem 3.3, rather than a mere sufficient condition.
Example 3.3
Continue with Example 2.1
Consider model (Equation10(10)
(10) ) with assuming that
,
and
. If p = 1, we have
. When
, we can easily derive the set of kernel levels as empty set. Thus, the posterior is always proper by Corollary 3.3. If
, inequality (Equation37
(37)
(37) ) fails, and the posterior propriety can hardly be guaranteed. Consequently, when p = 1, the posterior using the recommended prior is always proper for
.
Next, we generalize Berger et al. (Citation2020b)'s result to the GHNL models for d = 1, assuming that the design matrices 's for units are square matrices.
Corollary 3.4
When , consider the same model and prior as Theorem 3.3. Suppose
and
,
. Then the posterior is always proper.
Proof.
It follows from Theorem 3.4 that we should only present
According to the conditions that
,
and
, we have
,
. Thus,
Summing up the theoretical results above, a general procedure for checking the posterior propriety of the GHNL models (Equation4(4)
(4) ) employing the recommended prior for d = 1 can be depicted as follows.
Guidance for checking the posterior propriety when d = 1:
If the design matrices for each unit in each level are square matrices and
,
, then the posterior is proper, otherwise, turn to (b).
Derive the set of kernel levels
. If
or inequality (Equation39
(39)
(39) ) holds, then the posterior is proper. If neither, turn to (c).
Check inequality (Equation37
(37)
(37) ) for all D belonging to
. If that always holds, then the posterior is proper. If not, the posterior propriety can hardly be guaranteed.
4. Computation
In this section, we consider the MCMC sampling from the posterior arising from the model in Section 2. For the GHNL model (Equation4(4)
(4) ) with prior (Equation14
(14)
(14) ) and (Equation13
(13)
(13) ) on
and
, respectively, the joint posterior of
can be written as
(40)
(40) Sampling
from the posterior density (Equation40
(40)
(40) ) can be handled by Gibbs sampling method. The main difficulty of the computation is to sample the covariance matrices
's efficiently.
4.1. Gibbs sampling for input effects
The full conditionals of the input effects can be derived from the joint posterior (Equation40
(40)
(40) ) and are illustrated as follows.
Conditioning on
and
, the posterior distribution of
is
(41)
(41) where
The full conditional posteriors of
,
have the forms:
(42)
(42) where
and
.
By using (Equation14
(14)
(14) ), the full conditional of
can be derived as
(43)
(43) where
Input effects and
can be readily sampled from their conditionals during the Gibbs sampling procedure, as their full conditional posterior distributions are all standard distributions.
4.2. Gibbs sampling for variance components
The variance components which include 's and λ can be updated from their full conditionals, and these conditionals have densities as follows.
Given
, the conditional posterior density of λ is
(44)
(44) which is actually an inverse gamma distribution as
.
For
, define that
. The marginal posterior density of
given
is
(45)
(45) where
denotes
for a square matrix
, and
The updating of λ can be simply carried out by sampling from an inverse gamma distribution. The full conditional posteriors of , (Equation45
(45)
(45) ) are actually distributed as a recently proposed class of prior distributions by Berger et al. (Citation2020a) for the covariance matrix, which is called the Shrinkage Inverse Wishart (SIW) distributions. The new class
for a
covariance matrix
has the density as
(46)
(46) where
are the ordered eigenvalues of
, a is a real constant and
is a
non-negative definite matrix. Thus,
are distributed as
,
. To sample the covariance matrices from the full conditional posteriors, the previously suggested methods include the Metropolis-Hastings algorithm (cf. Berger et al., Citation2005) and Hit-and-run method (cf. Yang & Berger, Citation1994). The two methods both generate full candidate matrices by utilizing full-parameter proposal distributions, resulting in that they only work for moderate dimensions of the covariance matrices. To tackle this issue, Berger et al. (Citation2020a) proposed a powerful Gibbs method for efficiently sampling the covariance matrices from their conditional densities and this new method works for higher dimensions k. The audience can refer to Berger et al. (Citation2020a) or Appendix 3 for details of this Gibbs sampling method. According to the simulation results of Berger et al. (Citation2020a), the new Gibbs method outperforms the Metropolis-Hastings and Hit-and-run methods for moderate dimensions and work for k up to 100, while the other two algorithms break down in much lower dimensions.
In the framework of 2-level HNLM, Berger et al. (Citation2020b) compared the numerical performance, from the mean square error (MSE) perspective, of a dozen of objective hyperpriors, which are the product of three objective hyperpriors for the hypermean and four objective hyperpriors for the hypercovariance matrix. Priors on the hypermean include constant prior, conjugate prior and the recommended prior (Equation12(12)
(12) ). Priors on the hypercovariance matrix include constant prior, hierarchical Jefferys prior, hierarchical reference prior and the recommended prior (Equation13
(13)
(13) ). Their simulation results have shown that the recommended combination of hyperpriors dominates all the others in terms of Bayes risk, and the constant prior on the hypercovariance performs the worst. However, neither of the two remaining choices for hypercovariance is computationally easy. Considering the 4-level HNLM, Song et al. (Citation2020) performed numerical experiment to compare the recommended prior with constant prior for the hypercovariance matrices, and the other two priors were canceled due to intractable computation. Also, Song et al. (Citation2020)'s result presented the domination of the recommended hyperpriors over other priors. In conclusion, both Berger et al. (Citation2020b) and Song et al. (Citation2020) have provided strong numerical evidence of the superiority of the recommended hyperpriors for use in GHNLM, since 2-level and 4-level HNLMs both are specific GHNLM.
5. Discussions
We have proposed a generalized hierarchical normal linear model applicable to the nested data with complex structures. The GHNL model proves to be equivalent to a LMM model, while the GHNL model is more natural for researchers to model nested data from scratch, especially when incorporating covariates at high levels. Like generalizations to the simple normal linear model, the GHNL can be generalized to the hierarchical model with generalized liner model in the first level, and thus discrete observations can be handled. Besides, the first level (or even higher levels) of the GHNL model can be also extended to the setting of semiparametric regression models, such as, single index model and partially linear model. The technique of modelling and investigation in this paper can be applied to the linear part of the models mentioned above. The statistical analysis could be complicated and such explorations are beyond the scope of this paper, however.
Berger et al. (Citation2020b) put an end to the endless search for the appropriate hyperpriors in hierarchical modelling and investigated properties comprehensively to justify the recommendation. Nonetheless, when it came to the propriety of the resulting posterior, they only suspected that that is true for use at any level for a general hierarchical normal model, the conditions for which were not given. To complete the story, we have studied the conditions for the posterior to be proper in more general situations than Berger et al. (Citation2020b), when employing the recommended prior for the GHNL model. Theorems 3.1 and 3.3 demonstrate the main result, and Corollaries 3.1 and 3.3 reduce the computational burdens by defining kernel sets for and d = 1, respectively. In addition, Theorems 3.2 and 3.4 provide powerful tools of simple forms for checking propriety of posterior for
and d = 1, separately. The user-friendly guidance for checking posterior propriety is eventually supplied. Note that our results only present sufficient conditions, and necessary conditions have never been discussed. The reason is because the derivation of the lower bound on the integrated likelihood of hyperparameters is intractable. Moreover, it is not worthwhile to investigate necessary conditions, as the derived upper bounds are tight enough such that the corresponding sufficient conditions are very modest, according to the remarks and examples in Section 3. At last, an efficient and powerful Gibbs sampling method for sampling from the posterior is introduced, overcoming the bottleneck of computation that the previously proposed sampling method only works for low dimensions or moderate dimensions inefficiently. The numerical evidence supporting the superiority of the recommended prior for hierarchical models was presented in Berger et al. (Citation2020b) and Song et al. (Citation2020).
Though we have made much progress in the hierarchical linear modelling, a major obstacle to applying our results is that the variance component for the first level is supposed to be known, which can hardly be satisfactory in practice. If we assume an unknown covariance matrix for the first level and specify prior (Equation13
(13)
(13) ) on it, the exponential term within the likelihood can not be dropped simply any longer when deriving the upper bound, otherwise, the resulting integral will be always infinite. The upper bound on the exponential term with respect to the eigenvalues of covariance matrices is very tricky to be obtained, and the condition for the integrability of the resulting integral remains to be further studied. Thus, the GHNL modelling with unknown
can be taken as a sequential study of this paper.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Additional information
Funding
References
- Berger, J. (1980). A robust generalized Bayes estimator and confidence region for a multivariate normal mean. Annals of Statistics, 8(4), 716–761. https://doi.org/10.1214/aos/1176345068
- Berger, J., Strawderman, W., & Tang, D. (2005). Posterior propriety and admissibility of hyperpriors in normal hierarchical models. Annals of Statistics, 33(2), 606–646. https://doi.org/10.1214/009053605000000075
- Berger, J., Sun, D., & Song, C. (2020a). Bayesian analysis of the covariance matrix of a multivariate normal distribution with a new class of priors. Annals of Statistics, 48(4), 2381–2403. https://doi.org/10.1214/19-AOS1891
- Berger, J., Sun, D., & Song, C. (2020b). An objective prior for hyperparameters in normal hierarchical models. Journal of Multivariate Analysis, 178(2020), 1–13. https://doi.org/10.1016/j.jmva.2020.104606
- Consonni, G., Fouskakis, D., Liseo, B., & Ntzoufras, I. (2018). Prior distributions for objective Bayesian analysis. Bayesian Analysis, 13(2), 627–679. https://doi.org/10.1214/18-BA1103
- Daniels, M. J., & Kass, R. E. (1999). Nonconjugate Bayesian estimation of covariance matrices and its use in hierarchical models. Journal of the American Statistical Association., 94(448), 1254–1263. https://doi.org/10.1080/01621459.1999.10473878
- Everson, P. J., & Morris, C. N. (2000). Inference for multivariate normal hierarchical models. Journal of the Royal Statistical Society: Series B, 62(2), 399–412. https://doi.org/10.1111/rssb.2000.62.issue-2
- Fourdrinier, D., Strawderman, W. E., & Wells, M. T. (1998). On the construction of Bayes minimax estimators. Annals of Statistics, 26(2), 660–671. https://doi.org/10.1214/aos/1028144853
- Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models. Bayesian Analysis, 1(3), 515–534. https://doi.org/10.1214/06-BA117A
- Goldstein, H. (2011). Multilevel statistical models (Vol. 922). John Wiley & Sons.
- Gustafson, P., Hossain, S., & Macnab, Y. C. (2006). Conservative prior distributions for variance parameters in hierarchical models. Canadian Journal of Statistics, 34(3), 377–390. https://doi.org/10.1002/cjs.v34:3
- Hobert, J. P., & Casella, G. (1996). The effect of improper priors on Gibbs sampling in hierarchical linear mixed models. Journal of the American Statistical Association, 91(436), 1461–1473. https://doi.org/10.1080/01621459.1996.10476714
- Hoff, P. D. (2009b). Simulation of the matrix Bingham-Von Mises-Fisher distribution, with applications to multivariate and relational data. Journal of Computational and Graphical Statistics, 18(2), 438–456. https://doi.org/10.1198/jcgs.2009.07177
- Horn, R. A., & Johnson, C. R. (2012). Matrix analysis. Cambridge university press.
- Lindenberger, U., & Pötter, U. (1998). The complex nature of unique and shared effects in hierarchical linear regression: Implications for developmental psychology. Psychological Methods, 3(2), 218–230. https://doi.org/10.1037/1082-989X.3.2.218
- Michalak, S. E., & Morris, C. N. (2016). Posterior propriety for hierarchical models with log-Likelihoods that have norm bounds. Bayesian Analysis, 11(2), 545–571. https://doi.org/10.1214/15-BA962
- Raudenbush, S., & Bryk, A. S. (1986). A hierarchical model for studying school effects. Sociology of Education, 59(1), 1–17. https://doi.org/10.2307/2112482
- Shimotsu, K. (2010). Exact local Whittle estimation of fractional integration with unknown mean and time trend. Econometric Theory, 26(2), 501–540. https://doi.org/10.1017/S0266466609100075
- Song, C., Sun, D., Fan, K., & Mu, R. (2020). Posterior propriety of an objective prior in a 4-Level normal hierarchical model. Mathematical Problems in Engineering, 2020. https://doi.org/10.1155/2020/8236934
- Speckman, P. L., & Sun, D. (2003). Fully Bayesian spline smoothing and intrinsic autoregressive priors. Biometrika, 90(2), 289–302. https://doi.org/10.1093/biomet/90.2.289
- Sun, D., Tsutakawa, R. K., & He, Z. (2001). Propriety of posteriors with improper priors in hierarchical linear mixed models. Statistica Sinica, 11(1), 77–95. http://www.jstor.org/stable/24306811
- Xia, A., Ma, H., & Carlin, B. P. (2011). Bayesian hierarchical modeling for detecting safety signals in clinical trials. Journal of Biopharmaceutical Statistics, 21(5), 1006–1029. https://doi.org/10.1080/10543406.2010.520181
- Yang, R., & Berger, J. (1994). Estimation of a covariance matrix using the reference prior. Annals of Statistics, 22(3), 1195–1211. https://doi.org/10.1214/aos/1176325625
Appendices
Appendix 1. A special LMM
Consider a special LMM of the form:
(A1)
(A1) where
denotes the observations and is an
vector,
is the vector of fixed effects and is an
vector. For
,
's are
vectors and represent the vectors of random effects, and
's are assumed to be independently distributed as
, where
's are
positive definite matrices and unknown.
is an
matrix,
's are
matrices, and
and
's are known design matrices.
is the vector of random errors and distributed as
,
is an
positive definite matrix and given.
It follows from Berger et al. (Citation2020b) that we can assume independent priors on as
(A2)
(A2) where
are the ordered eigenvalues of
,
. The prior on
has a hierarchical structure of the form
The posterior propriety results for the special LMM (EquationA1
(A1)
(A1) ) is displayed as follows. Firstly, let
and
. Denote the index set of the variance scale or the eigenvalues of the covariance matrices by
, and
represents the set of the non-empty subsets of F. Define that
for
,
and
.
Theorem A.1
Consider linear mixed effect model (EquationA1(A1)
(A1) ) with prior (EquationA2
(A2)
(A2) ) on
. Assume p>1, and then the posterior is proper if
(A3)
(A3) holds for any
.
The proof of Theorem A.1 is similar to that of Theorem 3.1 and is omitted here.
Fact A.1
When p>1, (EquationA3(A3)
(A3) ) holds for any
iff
(A4)
(A4)
Proof.
For any and
, (EquationA3
(A3)
(A3) ) is equivalent to
(A5)
(A5) It can be deduced that inequality (EquationA5
(A5)
(A5) ) holds for any
and
iff
(A6)
(A6) which is equivalent to
since
for
.
Inequality (EquationA3(A3)
(A3) ) holds for any
with
iff
(A7)
(A7) Under the condition that
, (EquationA7
(A7)
(A7) ) is equivalent to
Corollary A.1
Consider model (EquationA1(A1)
(A1) ) with prior (EquationA2
(A2)
(A2) ) on parameters. The posterior is proper if one of the following condition holds,
p>1 + r and
;
p>1 and
.
Proof.
Since
(a) and (b) follow from Fact A.1 directly.
Remark A.1
Consider model (EquationA1(A1)
(A1) ) with r = 1. The posterior using prior (EquationA2
(A2)
(A2) ) is prior if either (a)
,
or (b)
,
holds. If p = 1, the posterior propriety can hardly be satisfied, the reason of which is two-fold. First, inequality (EquationA3
(A3)
(A3) ) fails for
. Second, if we follow the thread of deriving the condition in Theorem 3.3, a sufficient condition can be derived as
(A8)
(A8) for any
with
. However, inequality (EquationA8
(A8)
(A8) ) does not hold for
,
.
Appendix 2
Proof of lemmas in Section 3.1
Lemma A.1
min–max theorem, cf. Horn & Johnson, (Citation2012): For an symmetric matrix
and a non-zero
vector
, the Rayleigh quotient for
and
can be defined as
where
denotes the Euclidean inner product. Then
(A9)
(A9) where U denotes the linear subspace of
. Especially,
(A10)
(A10)
Lemma A.2
For symmetric matrices
,
, we have
;
supposing
,
, then
Proof.
For (a), given an non-zero vector
, by (EquationA10
(A10)
(A10) ),
(A11)
(A11) The proof for (a) is completed by using (EquationA10
(A10)
(A10) ) again.
For (b), it suffices to prove that for any
(A12)
(A12) Since
,
, for any
,
and
,
Minimize the both sides of the inequality above over
first. Then take the maximum over
. (EquationA12
(A12)
(A12) ) can be easily obtained by using Lemma A.1.
A.1. Proof of Lemma 3.1
For (a), by Lemma A.2 (a), it suffices to prove that ,
. For any
, applying (EquationA10
(A10)
(A10) ), yields
It is obvious that
, otherwise, it will result in
, which contradicts. In addition, since
, we have
Therefore, we have proved part (a).
For (b), we only need to prove that
It follows from Lemma A.2 (b) that for any
Since
, then
,
. Thus, it remains to show that
(A13)
(A13) Firstly, for any
, we introduce a linear transformation
defined as
,
. The kernel space of
is denoted by
. Since
is of full column rank
, then the dimension of the complementary space of
is
, i.e.,
. Thus, the mapping
is a one-to-one mapping. For any
with
,
, define that
It is obvious that
and
. For any
with
and any
, there exists one and only one
such that
. Since
, we have
(A14)
(A14) It refers to Lemma A.1 that
(A15)
(A15) for
. Minimize the both sides of inequality (EquationA14
(A14)
(A14) ) over
first. Then take the maximum over
, (EquationA13
(A13)
(A13) ) can be easily obtained by using Lemma A.1 and (EquationA15
(A15)
(A15) ).
A.2. Proof of Lemma 3.2
The Domain of the integral can be divided into
i.e.,
. Thus, the integral I is finite iff the integrals over
and
for each
are finite.
Denote the integrand as . Then
which is finite iff condition (a) is satisfied.
To verify condition (b), we only need to justify the following statement. Also, we assume condition (a) is always satisfied hereafter.
Fact A.2
For all with
,
, the integrals
are finite iff inequalities
(A16)
(A16) hold for all
with
and
. Under the condition above,
(A17)
(A17) always holds, where
The reason why formula (EquationA17(A17)
(A17) ) is required is that it plays an important role in verifying condition (EquationA16
(A16)
(A16) ).
Proof.
We prove the result by the technique of mathematical induction. First, we assume that the statement in Fact A.2 is true for L = l, . With this assumption, we must show that the statement is true for its successor,
. Write an arbitrary set
with cardinality
as
, where
. Denote that
,
.
Step 1: We first prove that is finite iff inequalities (EquationA16
(A16)
(A16) ) hold for
.
Region can be divided into
Therefore, integral
is finite iff
for any
. For
, we have
where
, and it's easy to see that
and
. Since
(A18)
(A18)
(A19)
(A19) and the RHS (Right-Hand Side) of (EquationA19
(A19)
(A19) ) and (EquationA18
(A18)
(A18) ) are finite simultaneously under condition (a). Hence, The LHS (Left-Hand Side) of (EquationA18
(A18)
(A18) ) is finite iff
is finite.
Furthermore, by assumption and (EquationA17(A17)
(A17) ), we have
Thus, under condition (a) and assumption, we have
the RHS of which is finite iff
.
In conclusion, is finite iff
and
is finite for any
. Since D is arbitrary and
, we have accomplished the goal of Step 1.
Step 2: Next, we prove that formula (EquationA17(A17)
(A17) ) holds for D with cardinality
.
Region can be divided into
Similar to the proof of Step 1, we can prove that for
,
Therefore, we get Step 2 proved.
Step 3: We need to present that the statement is true for L = 1 to complete the proof, on the basis of mathematical induction.
Denote that ,
. Then
which is finite iff
, under which,
which accomplishes the proof of Fact A.2.
Appendix 3
Gibbs sampling from the SIW distributions
As for the SIW distribution (Equation46(46)
(46) ), we first consider the change of variables from
to
and the orthogonal matrix
of corresponding eigenvectors. The Jacobian is
(A20)
(A20) According to (EquationA20
(A20)
(A20) ) and Lemma 4 in Berger et al. (Citation2020a), (Equation46
(46)
(46) ) can be transformed to
(A21)
(A21) Gibbs sampling of
: We first sample
given
from
where
is the i-th diagonal element of
,
. Therefore, we can sample
independently from
.
Gibbs sampling of : Given
, the marginal density of
has the form:
Let
, where
and
is the diagonal matrix of corresponding eigenvalues with
. Define
. Since the invariant right Haar measure is invariant to the orthonormal transformation, the conditional density of
is
(A22)
(A22) The updating of
from (EquationA22
(A22)
(A22) ) can be implemented by applying a Gibbs update to two randomly selected columns (cf. Hoff, Citation2009b) or rows (cf. Berger et al., Citation2020a). The two ways are essentially equivalent when
, but Berger et al. (Citation2020a)'s method is considerably faster if
. Without any loss, assume that the two randomly selected rows are the first and second rows. The updated value of
can be written as
, where
denotes the first two rows of the old value of
which is
,
is the remaining k−2 rows of
and
with
and
for i = 1, 2. Let
. The full conditional density of ϕ has the form:
Write
where
and
. Then the conditional density of ϕ can be rewritten as
where
. Define
. Then the full conditional density of α has the form:
Simulating
can proceed with a rejection sampler by setting the proposal distribution as
.