2,823
Views
1
CrossRef citations to date
0
Altmetric
Research Article

Bayesian Multilevel Structural Equation Modeling: An Investigation into Robust Prior Distributions for the Doubly Latent Categorical Model

ABSTRACT

Bayesian estimation of multilevel structural equation models (MLSEMs) offers advantages in terms of sample size requirements and computational feasibility, but does require careful specification of the prior distribution especially for the random effects variance parameters. The traditional “non-informative” conjugate choice of an inverse-Gamma prior with small hyperparameters has been shown time and again to be problematic. In this paper, we investigate alternative, more robust prior distributions for the doubly latent categorical multilevel model. In contrast to multilevel models without latent variables, MLSEMs have multiple random effects variance parameters both for the multilevel structure and for the latent variable structure. It is therefore even more important to construct reasonable priors for these parameters. We find that, although the robust priors outperform the traditional inverse-Gamma prior, their hyperparameters do require careful consideration.

Introduction

Multilevel structural equation modeling (MLSEM) has become increasingly popular to test complex theories in samples that are hierarchically structured (Muthén, Citation1994). While SEM allows researchers to control for measurement error due to the fact that only a finite number of items are sampled, multilevel analysis takes into account sampling error due to the fact that only a finite number of individuals is sampled. Combined, MLSEM offers a powerful tool that is being used throughout educational, psychological, and sociological research.

Traditionally, maximum likelihood (ML) algorithms have been used to estimate MLSEMs. However, in the case of categorical indicators, such algorithms require multidimensional numerical integration, which quickly becomes computationally infeasible (Asparouhov & Muthen, Citation2007). Weighted least squares (WLS) algorithms avoid the high dimensional integration, but are restricted to random intercept models (Asparouhov & Muthén, Citation2012). Additionally, both ML and WLS rely on the frequentist asymptotical framework and thus require a substantial number of groups to obtain admissible and accurate parameter estimates (Hox & Maas, Citation2001). For example, Meuleman and Billiet (Citation2009) concluded that for relatively simple MLSEMs, at least 60 groups are needed to detect large structural effects at the between level.

Due to these problems with frequentist estimation methods, Bayesian estimation of MLSEMs has become increasingly popular. In a Bayesian analysis, a prior distribution is specified for each of the parameters in the model. Combined with the likelihood of the data this results in a posterior distribution, which is generally obtained through Markov Chain Monte Carlo (MCMC) sampling. The Bayesian approach offers several advantages compared to the frequentist framework. First, MCMC sampling does not require multidimensional numerical integration enabling complex models with categorical indicators to be estimated (see, e.g., Asparouhov & Muthén, Citation2012). Second, estimates of the transformed parameters, such as indirect structural effects, can be easily obtained by transforming the MCMC samples and credible intervals are obtained automatically (Yuan & MacKinnon, Citation2009). Credible intervals are the Bayesian alternative of confidence intervals, but unlike confidence intervals do not rely on normality assumptions or asymptotical theory. Third, through the prior distribution, the problem of variances that are estimated to be negative (i.e., Heywood cases) can be solved; so long as the prior on the variance has zero mass at negative values, the estimate can never become negative. Moreover, prior information about parameters in the model can be included in the prior distribution. This information might be based on previous investigations or expert knowledge. Finally, it has been shown that Bayesian MLSEM can provide accurate estimates with a smaller number of groups compared to frequentist estimation (Hox et al., Citation2012; Zitzmann et al., Citation2016).

A main focus of the Bayesian literature on multilevel models is the specification of the prior for the random effects variance parameter. The random effects variance parameter decides how much pooling across observations takes place. A random effects variance parameter equal to zero implies complete pooling of observations, that is, that there are no random effects and thus that we fit the same regression line for all higher level units. As the value for the random effects variance increases, the amount of pooling decreases until there is no pooling at all, which means that each higher level unit has its own regression line, a solution that will generally overfit the data. The prior for the random effects variance parameter determines where the model lies on this continuum from complete pooling to no pooling. Historically, inverse-Gamma priors were popular choices for variance parameters, including those at the between level, since they are conjugate and thus result in a posterior that is an inverse-Gamma distribution as well. Specifically, an inverse-Gamma (ε, ε) prior with small ε has often been used as a default uninformative prior distribution. However, it is now well-known in the literature that for random effects variances, this specification is actually very informative and highly sensitive to the exact choice of the hyperparameters (see, e.g., Gelman, Citation2006; Klein & Kneib, Citation2016; Lunn et al., Citation2009). Moreover, as we will show later, the inverse-Gamma prior does not have support for random effects variances equal to zero and will therefore always favor more complex solutions with less pooling. A second often used class of default priors for variance parameters is the class of uniform improper priors. This is in line with the “objective” Bayesian approach (Berger, Citation2006) since these priors will, in simple models, result in the same estimates as classical ML estimation. Various improper uniform prior options exist. The most straightforward approach is to specify a uniform prior directly on the variance, that is, p(σ2)1. A second option, advocated by Berger (Citation2006), Berger and Strawderman (Citation1996), and Gelman (Citation2006), is to specify a uniform prior on the standard deviation, that is, p(σ2)σ1. Finally, a uniform prior can be specified on the logarithm of the variance, that is, p(σ2)σ2. The main issue with specifying improper priors for random effects variances is that the resulting posterior might be improper as well in cases where there is a limited amount of information in the data on the higher level (Berger, Citation2006; Gelman, Citation2006). This has been shown empirically in an SEM where improper priors resulted in lower convergence rates (Van Erp et al., Citation2018). Note that the inverse-Gamma (ε, ε) prior approximates the improper prior p(σ2)σ2 as ε goes to zero. As a result, the inverse-Gamma prior can also lead to an unstable MCMC sampler.

Recently, several weakly informative prior distributions have been proposed as alternative priors for random effects variances. Compared to the uniform improper priors, weakly informative priors perform some shrinkage of the random effects variance parameter toward zero, that is, the complete pooling situation, to avoid overfitting. For example, Gelman (Citation2006) and Polson and Scott (Citation2012) propose to use the half-Cauchy prior, which is proper and thus avoids the issue of uniform improper priors. Additionally, the half-Cauchy prior has heavy tails and is therefore considered robust to misspecification of the scale. Simpson et al. (Citation2017) discuss a general approach to construct so-called “scale dependent” priors, which can be straightforwardly applied to random effects variances. Although some studies exist that compare some of these priors in general multilevel models (see, e.g., Klein & Kneib, Citation2016), they have yet to be investigated in the context of MLSEM. Studies investigating prior choice in MLSEM generally focus on conjugate prior distributions that vary in the degree of informativeness (Depaoli & Clifton, Citation2015; Helm, Citation2018; Zitzmann et al., Citation2016). These studies show that informative priors, when correctly specified, perform best. However, in reality, we do not know the true population value and therefore require prior distributions that are robust against possible prior misspecification. Moreover, MLSEMs have multiple random effects variance parameters, both for the multilevel structure and for the latent variable structure. It is therefore even more important to construct reasonable and robust priors for these parameters.

The goal of this study is to compare robust prior distributions for random effects variances in MLSEM. We will compare default specifications of the robust priors as well as informative (in)accurate specifications. The outline of this paper is as follows: in Section 2 we describe the categorical MLSEM used throughout this paper along with the motivating application from education. The robust prior distributions are presented and compared in Section 3, and Section 4 presents two simulation studies to investigate the performance of the priors, followed by a discussion in Section 5.

Bayesian doubly latent categorical multilevel model

Model

The MLSEM used in this investigation is based on the doubly latent multilevel model (DLMM) proposed in Marsh et al. (Citation2009) to estimate contextual effects. A well-known application of the DLMM is to estimate the “Big-Fish-Little-Pond-Effect” (BFLPE) that predicts that at the individual (or within) level, student achievement has a positive effect on academic self-concept, whereas at the between level, school-average achievement has a negative effect on academic self-concept. The term “doubly latent” arises from the fact that the model takes into account both measurement and sampling error. Measurement error, which is a consequence of the fact that only a finite number of items are sampled, is controlled for by specifying a measurement model for each of the factors. Sampling error, on the other hand, is the result of sampling only a finite number of individuals and is taken into account by including random effects in the model. The model used is depicted in . We use a similar representation as in Marsh et al. (Citation2009).Manifest variables are denoted by squares, while latent variables are denoted by circles. Black dots indicate a random intercept at the between level for that variable.

Figure 1. Graphic representation of the doubly latent multilevel model (DLMM) used throughout this paper. SA stands for student achievement and SC stands for academic self-concept in the context of the “Big-Fish-Little-Pond-Effect” (BFLPE)

Figure 1. Graphic representation of the doubly latent multilevel model (DLMM) used throughout this paper. SA stands for student achievement and SC stands for academic self-concept in the context of the “Big-Fish-Little-Pond-Effect” (BFLPE)

We assume that academic self-concept is measured by six binary indicators y for individual i in group j measured on item k.Footnote1 We assume an underlying continuous latent response variable y˜ikj that is linked to the binary observed responses yikj as follows:

(1) yikj=0 if <y˜ikj01 if 0<y˜ikj(1)

The measurement model can then be defined for the continuous latent responses y˜ikj at the within level:

(2) y˜ikj=μkj+λkWηijW+εikjW,(2)

where μkj denotes the intercept for item k in group j, λkW is the loading for item k at the within level, ηijW is the factor score for individual i in group j, and εikjW is the residual at the within level for individual i in group j measured on item k. Throughout this paper, we assume a logistic distribution for the residuals at the within level, that is, εikjW logistic(0,σW,k2) and we fix the variance to that of the standard logistic distribution, that is, σW,k2=π2/3 for identification.Footnote2

We focus on the random intercept model such that the measurement model at the between level is defined as:

(3) μkj=μk+λkBηjB+εkjB,(3)

where μk reflects the overall intercept for item k and is fixed to 0 for identification, λkB is the loading for item k at the between level, ηjB is the factor score at the between level for group j, and εkjB denotes the residual at the between level for item k in group j, which we assume to be normally distributed, that is, εkjBN(0,σB,k2).

The factor scores on self-concept are predicted by the academic achievement scores xij leading to the following structural model:

(4) ηijWN(βWxij,ωW2)(4)
(5) ηjBN(α+βBxbj,ωB2)(5)

Here, βW represents the effect of achievement on self-concept at the within level, whereas βB represents the effect of achievement on self-concept at the between level. To identify the latent variables, we restrict the mean, that is, α=0 in EquationEquation (5) and one loading at each level,Footnote3 that is, λ1B=λ1W=1. As recommended by Asparouhov and Muthén (Citation2019), we use latent mean centering which estimates the group mean achievement in each group j, that is, xbj to take into account measurement error. The parameter of interest is the contextual effect, which equals βBβW.

An important quantity in multilevel models is the variance partition coefficient (VPC; Goldstein et al., Citation2002), which is a measure of the correlation between students in the same school. In multilevel SEM, the VPC ρk for item k is defined as (see EquationEquation (8) in Muthen, Citation1991):

(6) ρk=λB,k2ωB2+σB,k2(λB,k2ωB2+σB,k2)+(λW,k2ωW2+σW,k2)(6)

Prior distributions

In a Bayesian analysis, prior distributions need to be specified for each parameter in the model, that is, p(λk,σB,k2,βW,ωW2,βB,ωB2). The focus of this paper is on priors for the random effects variances σB,k2,ωW2,andωB2 and the next section discusses the priors we consider for these parameters in detail. For the other parameters in the model, we use the following independent, uninformative or weakly informative prior distributions throughout this paper: p(λk)Normal(0,5),fork=1,,6, p(βW)Normal(0,1), p(βB)Normal(0,1). Note that other choices are possible and it is always important to investigate the sensitivity of the results to the specification of the prior distribution (Van Erp et al., Citation2018).

Robust prior distributions for random effects variances

The problems with the traditional inverse-Gamma family and the improper uniform priors for random effects variances have inspired many researchers to investigate alternative, more robust prior distributions. Here, we focus on three such alternative classes of priors: 1) Student’s t priors; 2) F priors; and 3) Scale dependent priors. Note that some priors are specified on the variances while others are specified on the standard deviations. Throughout this section, we will use θ to denote standard deviations and θ2 to denote variance parameters.

Student’s t family

A widespread proposal for the prior on random effects is to specify a proper, weakly informative half-t distribution on the standard deviation (see, e.g., Gelman, Citation2006; Polson & Scott, Citation2012). The probability density function of the half-t distribution is given by:

(7) p(θ;ν,σ02)=2Γ(ν+12)Γ(ν2)νπσ021+θ2νσ02ν+12,(7)

where σ02 denotes the scale and ν the degrees of freedom. Different choices for the scale parameter σ02 are discussed in Subsection “Specification of the hyperparameters”. For the degrees of freedom ν, smaller values result in heavier tails. For ν=, we obtain the half-normal distribution, which has been proposed, among others, by Frühwirth-Schnatter and Wagner (Citation2010). Roos and Held (Citation2011) have shown that the half-normal prior indeed results in estimates of the random effects variances that are less sensitive to the chosen hyperparameters compared to the inverse-Gamma prior. Nevertheless, the half-normal distribution still has very light tails and will therefore be less robust to variance parameters that are larger than expected. Therefore, a more robust proposal in the literature is to specify the degrees of freedom to be smaller. Gelman (Citation2006) notes the special case of the half-Cauchy prior with ν=1 as a reasonable default option. The half-Cauchy prior has been further investigated by Polson and Scott (Citation2012), who have shown that it has excellent frequentist risk properties. There are some reports, however, that the heavy tails of the half-Cauchy prior can lead to numerical difficulties in MCMC sampling (Ghosh et al., Citation2018; Piironen & Vehtari, Citation2015). This issue generally arises in situations where there is not enough information in the data such that the parameter is weakly or non-identified. The heavy tails of the (half-)Cauchy prior do not perform enough shrinkage in such cases to identify the parameter.

F family

A popular family of distributions for variance parameters in the Bayesian literature is the family of F priors, also known as scaled Beta2 or generalized beta prime priors (Fúquene et al., Citation2014; Mulder & Pericchi, Citation2018; Pérez et al., Citation2017; Polson & Scott, Citation2012). The probability density function of the F prior is given by (Mulder & Pericchi, Citation2018):Footnote4

(8) p(θ2;ν1,ν2,σ02)=Γ(ν2+ν12)Γ(ν12)Γ(ν22)(σ02)ν12(θ2)ν121(1+θ2/σ02)ν1+ν22,(8)

where σ02>0 denotes a scale parameter, the degrees of freedom ν1>0 determines the behavior around zero, and the degrees of freedom ν2>0 influences the tail behavior in a similar manner as the degrees of freedom for Student’s t distribution. Specifically, ν2=2 results in similar tails as a Cauchy distribution. With regard to the behavior at zero, ν12>1 results in a density that is zero at the origin, for ν12=1 the density is bounded at the origin, and for ν12<1 the density goes to infinity at the origin.

Interestingly, if p(θ2)F(ν1=1,ν2,σ02), then the prior on the standard deviation p(θ) corresponds to a half-Student’s t distribution with degrees of freedom equal to ν2 and scale σ02. Thus, the F family can be seen as a generalization of the Student’s t family. Moreover, if p(θ2)F(ν1,ν2,σ02), then the prior on the precision h=1θ2 will also be an F distribution, with a scale of 1σ02 (i.e., the reciprocality property).

To implement the F prior, the following Gamma mixture of inverse-Gamma parametrization can be used:

(9) p(θ2)IG(ν22,τ2)(9)
(10) p(τ2)G(ν12,σ02),(10)

which results in the F prior when integrating out τ2.

Scale dependent family

Simpson et al. (Citation2017) have proposed the use of penalized complexity priors as general default priors. The basic idea of these priors is that deviations from a simpler base model are penalized thereby avoiding a model that overfits. This characteristic is especially useful in the context of random effects variance parameters which are generally weakly identified due to the small amount of information available at the higher level. Simpson et al. (Citation2017) derive the penalized complexity prior for the precision of a random effect as follows: first, define the base model. For example, for the random effects component εkjBN(0,σB,k2) in (3), the base model, which is the simplest model in its class, would correspond to σB,k2=0 or the absence of random effects. Next, the distance of the flexible extension of the base model, that is, εkjBN(0,σB,k2) to the base model is computed based on the Kullback-Leibler divergence (KLD). Simpson et al. (Citation2017) define the distance between the two models with densities f and g as: d(f||g)=(2KLD(f||g)). Then, a prior is specified for the distance d such that deviations from the base model are penalized. By assuming a constant rate of penalization, Simpson et al. (Citation2017) specify an exponential prior for the distance p(d)=λexp(λd), although they note that it is possible to relax this assumption if needed. For example, when interest lies in variable selection, the exponential tails are too light and a heavier tailed prior can be more sensible, for example, a half-Cauchy prior on the distance will recover the well-known horseshoe prior. Transforming the prior on the distance to the original space leads to a type-2 Gumbel prior for the precision or, equivalently, an exponential distribution with scale σ02 for the standard deviationFootnote5

(11) p(θ;σ02)=1σ02exp(1σ02θ)(11)

Lower values for σ02 will result in an increased penalization for deviating from the base model. The choice of σ02 will be discussed in Subsection 3.4.

Specification of the hyperparameters

An overview of the different priors considered is provided in . It is clear from that the priors vary in flexibility. Each prior has a parameter that determines the scale or how spread out the prior is. In addition, the half-Student’s t and F priors have a parameter that influence the tail behavior of these priors. Specifically, this degrees of freedom hyperparameter determines how heavy the tails are and thus how much prior mass is put on extreme values. In general, a prior with heavier tails is more robust since extreme values will not be shrunken toward zero as much compared to a prior with thinner tails. For the half-Student’s t prior, we consider the special cases of the half-normal prior (ν=) and the half-Cauchy prior (ν=1), but other values for ν can be used, with higher degrees of freedom leading to thinner tails. For the F prior, we will consider ν2=0.5 throughout this paper to obtain tails that are heavier than the Cauchy. The larger ν2, the thinner the tails will be. Additionally, the F prior has a parameter that determines the behavior at the origin. This degrees of freedom hyperparameter, ν1, is set to 1 so that the density goes to infinity at zero. Although other values for ν1 are possible, it is important to ensure that ν121 so that the prior has support at zero.

Table 1. Overview of robust prior distributions for random effects variances

Throughout this paper, we consider and compare various settings for the scales in each prior (σ02). A general informal recommendation for a default prior is a half-Student’s t prior with scale equal to 1 (see, e.g., Stan Development Team, Citation2015, the section on priors for scale parameters in hierarchical models). Therefore, we use σ02=1 throughout this paper as default prior specification. Additionally, if prior information is available, this can be included in the priors as follows. Specify a value for the standard deviation C and a threshold α such that the probability P(θ>C)=α. This equation can then be solved for θ using the cumulative distribution function. Section 4.2 describes which values for C and α we will consider in the simulation study and provides some examples of the resulting prior scales θ.

Comparison of the priors

In this section, we compare the probability density functions of the various priors to better understand their behavior. shows the densities for the standard deviation. We use a scale of 1 for the robust priors and hyperparameters equal to 0.1 for the inverse-Gamma prior. The main difference between the inverse-Gamma prior and the more robust alternatives is that the inverse-Gamma prior has zero density at zero. As a result, the prior favors standard deviations that are away from 0, which is problematic, especially for standard deviations of random effects, which are often very close to 0. The inverse-Gamma prior thus automatically favors more complex models with less pooling. The other priors do have prior probability around standard deviations equal to zero and are therefore more in line with what we might expect in reality and are less likely to overfit.

Figure 2. Densities of the priors on the standard deviations

Figure 2. Densities of the priors on the standard deviations

There are multiple standard deviations in the model for which a prior needs to be specified, specifically: σB,k, ωW, and ωB. Together with the factor loadings at the between and within level, λkB and λkW, these standard deviations play a role in computing the VPC ρk for each item (see EquationEquation (6)). As a result, specifying a prior on the standard deviations and factor loadings implies a certain prior on the VPC. This prior is considered in . Recall from Section 2.1 that we fixed the loading of one item to 1 for identification purposes. The left figure shows the VPC of the restricted item while the figure on the right shows the VPC of the unrestricted items. It is clear that this restriction has consequences for the implied prior of the VPC. This is due to the fact that the VPC depends on the loadings (see EquationEquation (6)). Specifically, we see that some priors put less probability on a VPC close to 1 for the restricted item compared to the free items. This is most pronounced for the priors with the thinnest tails, that is, the half-normal and exponential priors. Furthermore, the implied priors on the VPC are not symmetrical. This is due to the fact that we assume a standard logistic distribution for the residuals at the within level. As a result, σW,k2, which arises in the denominator of EquationEquation (6) is fixed as well.

Figure 3. Densities of the implied priors on the variance partition coefficient (VPC) of the restricted and free items

Figure 3. Densities of the implied priors on the variance partition coefficient (VPC) of the restricted and free items

Simulation studies

In order to investigate the performance of the robust priors for the random effects variances, we conduct two simulation studies. In the first study, we vary the population values for various parameters while keeping the number of groups fixed. We focus on comparing the traditional priors for variance parameters to a default setting of the more robust alternatives. In the second study, we use the same generated data sets as in the first study, but now we focus on informative prior settings with either correct or incorrect information to determine how robust the priors are to misspecifications. In both studies, we generate data according to the model in Section 2 using the following population values.

Population values for the variance parameters

The population value for the latent variable variance parameters ω2 is equal to 0.01 or 1. The sensitivity of the inverse-Gamma priors to the hyperparameter choice arises especially in situations in which the random effects variance is close to zero (Gelman, Citation2006). Depending on which value is specified for each variance parameter, the VPC will vary. Hedges and Hedberg (Citation2007) provide an overview of VPC values for educational achievement based on longitudinal surveys conducted in the United States. They found VPCs ranging from .03 to .3. These values are in line with VPCs used in other simulation studies of multilevel SEMs (Depaoli & Clifton, Citation2015; Helm, Citation2018; Zitzmann et al., Citation2016). Therefore, we consider several combinations of the population values for the variance parameters that result in VPCs within this range. Note that the VPC depends on the following parameters: VPC=f(λkB,ωB2,σB,k2,λkW,ωW2,σW,k2) (see (6)). We fix the population values for λkB and λkW to 1, and σW,k2 to π23 for all items. We consider different combinations of population values for the variances of the latent variables, ωB2 and ωW2 and subsequently compute σB,k2 given these values and a VPC of .03 or .3. An overview of the different population values for the variance parameters and the resulting VPC is presented in .

Table 2. Overview population values for the variance parameters and variance partition coefficient

Population values for the contextual effect

When using group-mean centering, the contextual effect is defined as βBβW. Marsh et al. (Citation2009) provide three different standardized effect size measures that can be interpreted similarly to Cohen’s d. They recommend to use either one of the two more conservative effect sizes. Here, we rely on the second effect size in Marsh et al. (Citation2009) to define the population values for the regression coefficients at the between and within level. This effect size standardizes the contextual effect with respect to the total variance of self-concept at the within level, that is, ηW, and is defined as:

(12) ES=2βCσxBσxW(βW)2+ωW,(12)

where βC denotes the contextual effect, which is equal to βBβW in the case of group-mean centering, and σxB and σxW denote the standard deviations of the predictor at the between and within level, respectively. We consider a small negative standardized contextual effect and no contextual effect, that is, ES=c(0.2,0). We choose these values such that we are able to investigate the power to detect a small contextual effect as well as the type 1 error rate. We fix βW to 0.2 and σxB and σxW to 1 and compute the value for βB to obtain the required standardized effect size. The resulting values are shown in .

Combinations of the various population values result in 16 different conditions (2 values ωB2×2 values ωW2×2 values VPC×2 values ES). We consider these 16 conditions for a balanced design with 20 groupsFootnote6 and a sample size of 20 within each group. We use 20 groups based on Hox et al. (Citation2012) who concluded that 20 groups are sufficient for multilevel SEM. Moreover, 20 groups are still feasible in terms of data collection. The sample size of 20 within each group is based on educational contexts in which this can be seen as a realistic size for a school class.

Study 1: Influence of population values

In the first study, we analyze the data sets using the traditional uniform prior and the inverse-Gamma prior with ε = .1, .01, .001 (denoted by: IG.1, IG.01, and IG.001, respectively) as well as the robust half-Normal (HNdef), half-Cauchy (HCdef), F (Fdef), and Exponential (EXPdef) priors. For the robust priors, we consider a default setting in which the scale equals σ02=1, following a general recommendation often made for a robust default specification (see also Subsection 3.4).

Study 2: Influence prior misspecifications

In the second study, we focus on informative prior distributions. Specifically, we choose the prior scale σ0 in such a way that P(σ0>σtrue)=α, where σtrue equals the population value under which that parameter was simulated. We consider two settings: in the “correct” informative prior (i.e., HNinf, HCinf, Finf, and EXPinf), α=0.5 such that the resulting prior has sufficient probability around the population value. In the “incorrect” informative prior (i.e., HNinc, HCinc, Finc, and EXPinc), on the other hand, α=0.05 such that the prior has only very little mass on the population value. Since we never know the true value in practice, it is important to consider this situation to assess the robustness of the various priors to possible misspecifications. In total, we consider eight different prior specifications (four types of priors × two settings) for 16 population conditions in Study 2. show the various informative prior densities used in the simulation study when the population value for the standard deviation equals 0.1 or 1, respectively. Note that the incorrect F prior specification is missing when the population standard deviation equals 0.1 since this setting resulted in a scale σ02 of 0, whereas the scale should be positive. However, in order to be consistent across priors in specifying the incorrect setting, we did not consider an alternative specification that did result in a positive scale. All priors are more peaked around zero in the incorrect specification, and therefore have less prior mass around the true population value compared to the correct specification. It might appear that both the half-Cauchy and F priors have almost no prior mass around the true population value, but this is just a visual consequence of these priors being very spread out due to their heavy tails.

Figure 4. Prior densities for the informative specifications when the population value for the standard deviation equals 0.1

Figure 4. Prior densities for the informative specifications when the population value for the standard deviation equals 0.1

Figure 5. Prior densities for the informative specifications when the population value for the standard deviation equals 1

Figure 5. Prior densities for the informative specifications when the population value for the standard deviation equals 1

Outcomes

We consider the (relative) bias, mean squared error and 95% coverage rates,Footnote7 which are all computed based on the median estimate. The 95% coverage rates are computed based on the equal-tailed credible intervals which are defined by the 2.5% and 97.5% quantiles of the posterior distribution. We use 500 replications per condition and have computed the Monte Carlo SE for every outcome measure to quantify the uncertainty in the simulation results (Morris et al., Citation2019). All Monte Carlo SEs were sufficiently small to conclude that 500 replications were enough (the maximum MCSE was 0.03 for the coverage percentages).

All analyses were run using the R interface to Stan, RStan (Stan Development Team, Citation2018). The Hamiltonian Monte Carlo algorithm that is used in Stan converges much more quickly to the target distribution compared to, for example, Gibbs samplers (Hoffman & Gelman, Citation2014). As a result, less iterations are generally required to obtain sufficient independent samples of the posterior. For each analysis, we ran two MCMC chains with 3000 iterations each, half of which was used as burnin.Footnote8 We used a maximum treedepth of 10 and set adapt_delta to 0.90.

A replication is considered converged if the split Rˆ (a version of the potential scale reduction factor, PSRF; Gelman and Rubin (Citation1992)) is smaller than or equal to 1.05 and there are no divergent transitions. If either one of these criteria is not met, that replication is removed from the results. For each condition, only the converged replications are used to compute the results and the results are only presented for those conditions in which there is at least 50% convergence (i.e., 250 converged replications). We also considered other, less strict convergence criteria, the results of which are available at https://osf.io/pq8gm/. We only report the results for ES=0.2 and VPC=0.03 in cases where these variables did not have a substantial influence and refer to the online materials for the full results and tables on which the figures are based as well as all the code (https://osf.io/pq8gm/).

Results Study 1: Influence population values

Convergence

shows the percentages of converged replications for each prior, with dashed lines indicating 50% convergence. For ES=0, the inverse-Gamma prior sometimes showed substantially different convergence percentages. In general, the inverse-Gamma prior was most problematic with multiple conditions resulting in less than 50% convergence. With regard to the VPC, the smaller VPC of 0.03 generally led to higher convergence compared to the larger VPC of 0.3.

Figure 6. Percentages converged replications per condition according to the strict convergence criteria in Study 1 for ES=0.2

IG stands for inverse-Gamma prior with the number corresponding to the value for the hyperparameters. UN stands for uniform prior. Fdef, EXPdef, HCdef, and HNdef stand for the F, Exponential, half-Cauchy, and half-Normal priors with the default specification σ02=1, respectively.
Figure 6. Percentages converged replications per condition according to the strict convergence criteria in Study 1 for ES=−0.2

Bias

shows the relative bias, which can be multiplied by 100 to obtain a percentage, with the absolute bias in brackets for the variance parameters and the parameter of interest, βBβW. For the standard deviation of the latent variable at the between level, ωB, all priors resulted in some bias, although the values are comparable across most priors with the exception of the inverse-Gamma priors. Most priors underestimated ωB when the population value equals 1 and when the population value equals 0.1 combined with a VPC of 0.03. The IG(0.1, 0.1) prior, however, largely overestimated ωB when the population value equals 0.1, regardless of the value of the VPC. For the standard deviation of the latent variable at the within level, ωW, the robust priors showed only small biases when the population values equal 0.1, with more substantial underestimation when the population values equal 1, whereas the IG(0.1, 0.1) and IG(0.01, 0.01) priors showed a substantial overestimation when the population values equal 0.01. For the standard deviation of the items at the between level, σYB, we see a similar picture across all items. Specifically, the bias was generally small except for the IG(0.1, 0.1) prior when ωB=0.1 and VPC=0.03. All priors underestimated the parameter of interest, the contextual effect βBβW, regardless of the population condition. The relative bias was close to negative 100% when ωW equals 1 for all priors, but differed more across the priors when ωW equals 0.1.

Table 3. Relative bias (multiply by 100 to obtain percentages) with absolute bias in brackets for selected parameters based on strict convergence criteria and all converged replications in Study 1 for conditions in which ES=0.2

Mean squared error (MSE)

shows the MSE for selected parameters in Study 1. For both ωB and ωW, lower population values for the corresponding parameter resulted in a smaller MSE. Differences between the priors are negligible, except for the IG(.1, .1) prior, which showed a smaller MSE compared to the other priors when the population value equals 1. For σy1B, the MSE was slightly higher when ωB=1 and when VPC=0.3 (available online at: https://osf.io/pq8gm/) for all priors, while for the contextual effect, the MSE was close to zero across conditions and priors.

Figure 7. Mean squared error (MSE) per condition according to the strict convergence criteria in Study 1 for ES=0.2 and VPC=0.03

IG stands for inverse-Gamma prior with the number corresponding to the value for the hyperparameters. UN stands for uniform prior. Fdef, EXPdef, HCdef, and HNdef stand for the F, Exponential, half-Cauchy, and half-Normal priors with the default specification σ02=1, respectively.
Figure 7. Mean squared error (MSE) per condition according to the strict convergence criteria in Study 1 for ES=−0.2 and VPC=0.03

Coverage

shows the 95% coverage rates for the selected parameters. For ωB and ωW, coverage was much too low when the corresponding population values equal 1 and around 100% when the population values equal 0.1 for all priors except the IG(.1, .1) prior. Note that, for ωB, the coverage was higher when ωB=1 and VPC=0.3 (available online at: https://osf.io/pq8gm/), although it was still not close to the nominal 95% in this situation. For σy1B, coverage was above the nominal 95% for all priors.

Figure 8. 95% coverage per condition according to the strict convergence criteria in Study 1 for ES=0.2 and VPC=0.03 and selected parameters. Dashed lines indicate the nominal 95%

IG stands for inverse-Gamma prior with the number corresponding to the value for the hyperparameters. UN stands for uniform prior. Fdef, EXPdef, HCdef, and HNdef stand for the F, Exponential, half-Cauchy, and half-Normal priors with the default specification σ02=1, respectively.
Figure 8. 95% coverage per condition according to the strict convergence criteria in Study 1 for ES=−0.2 and VPC=0.03 and selected parameters. Dashed lines indicate the nominal 95%

shows the 95% coverage rates for the contextual effect βBβW. For this parameter, coverage rates depended on effect size and VPC. Specifically, coverage rates were too low when ωW=1 (triangles) but only when ES=0.2 and not for the IG(.1, .1) prior.

Figure 9. 95% coverage per condition according to the strict convergence criteria in Study 1 for the contextual effect βBβW. Dashed lines indicate the nominal 95%

IG stands for inverse-Gamma prior with the number corresponding to the value for the hyperparameters. UN stands for uniform prior. Fdef, EXPdef, HCdef, and HNdef stand for the F, Exponential, half-Cauchy, and half-Normal priors with the default specification σ02=1, respectively.
Figure 9. 95% coverage per condition according to the strict convergence criteria in Study 1 for the contextual effect βB−βW. Dashed lines indicate the nominal 95%

Results Study 2: Influence prior misspecifications

We now turn to the results of the second simulation study in which we investigated the same population conditions as in Study 1, but with informative priors. The informative priors were specified such that they are either in line with the population values or not.

Convergence

shows the percentages of converged replications for each prior in the second study. Again, only the results for ES=0.2 are shown since the percentages for ES=0 were generally very similar, except for the correctly specified exponential prior (EXPinf). It is clear that, compared to the default priors in Study 1, the informative priors have more convergence problems. This is especially the case for the incorrectly specified F and half-Cauchy priors. For the incorrectly specified F prior, the convergence of 0% when at least one of the population standard deviations equals 0.1 is due to the fact that in these situations the prior scale equals zero whereas it should be positive (see also Subsection 3.7). Only the informative half-Normal priors and the incorrectly specified Exponential prior obtained more than 50% convergence in all conditions.

Figure 10. Percentages converged replications per condition according to the strict convergence criteria in Study 2 for ES=0.2

Finc, EXPinc, HCinc, and HNinc stand for the F, Exponential, half-Cauchy, and half-Normal priors with an informative prior specification not in line with the population values, respectively. Finf, EXPinf, HCinf, and HNinf stand for the F, Exponential, half-Cauchy, and half-Normal priors with an informative prior specification in line with the population values, respectively.
Figure 10. Percentages converged replications per condition according to the strict convergence criteria in Study 2 for ES=−0.2

Bias

shows the relative bias with the absolute bias in brackets for the variance parameters and the parameter of interest, βBβW. As many conditions are missing due to non-convergence, it is difficult to compare the correctly and incorrectly specified informative priors. However, for the half-normal prior, the bias for ωB was lower for the correctly specified prior when ωB=0.1. Especially when the VPC=0.3, the difference between the correct and incorrect specifications was substantial. When ωB=1, the two specifications performed similarly. A similar picture arises for ωW, with the bias being lower for the correctly specified priors when the population value is small, and similar performance when the population value equals 1. For σy1B, the biases were lower for the correctly specified half-normal prior across all population conditions, while the contextual effect βBβW was underestimated across the board, regardless of the prior specification. Compared to the default half-normal prior in Study 1, the informative specification resulted in lower bias for ωB mainly when ωB=0.1 and VPC=0.3, while in the other conditions the bias was comparable or slightly higher. For ωW, on the other hand, the default half-normal led to lower bias than the correct informative specification when ωW=0.1 and comparable results otherwise.

Table 4. Relative bias (multiply by 100 to obtain percentages) with absolute bias in brackets for selected parameters based on strict convergence criteria and all converged replications in Study 2 for conditions in which ES=0.2

Mean squared error (MSE)

shows the MSE for selected parameters in Study 2. The results were very similar to those in Study 1 with negligible differences between the priors and mainly differences due to population values for the latent variable variance parameters, with higher population values (i.e., ωB=ωW=1) resulting in larger MSEs for the corresponding parameter.

Figure 11. Mean squared error (MSE) per condition according to the strict convergence criteria in Study 2 for ES=0.2 and VPC=0.03

Finc, EXPinc, HCinc, and HNinc stand for the F, Exponential, half-Cauchy, and half-Normal priors with an informative prior specification not in line with the population values, respectively. Finf, EXPinf, HCinf, and HNinf stand for the F, Exponential, half-Cauchy, and half-Normal priors with an informative prior specification in line with the population values, respectively.
Figure 11. Mean squared error (MSE) per condition according to the strict convergence criteria in Study 2 for ES=−0.2 and VPC=0.03

Coverage

shows the 95% coverage rates for selected parameters in Study 2. Again, the coverage for ωB and ωW was much too low when the corresponding population values equal 1 but, contrary to the coverage for the default priors (Study 1), the coverage remained too low for most priors when the population values equal 0.1, with the exception of the correctly specified exponential and half-normal priors. For σy1B, coverage was close to the nominal 95% for all informative priors when ωB=1, but generally lower when ωB=0.1.

Figure 12. 95% coverage per condition according to the strict convergence criteria in Study 2 for ES=0.2 and VPC=0.03 and selected parameters. Dashed lines indicate the nominal 95%

Finc, EXPinc, HCinc, and HNinc stand for the F, Exponential, half-Cauchy, and half-Normal priors with an informative prior specification not in line with the population values, respectively. Finf, EXPinf, HCinf, and HNinf stand for the F, Exponential, half-Cauchy, and half-Normal priors with an informative prior specification in line with the population values, respectively.
Figure 12. 95% coverage per condition according to the strict convergence criteria in Study 2 for ES=−0.2 and VPC=0.03 and selected parameters. Dashed lines indicate the nominal 95%

shows the 95% coverage rates for the contextual effect βBβW. As in Study 1, coverage rates were close to the nominal 95% when ES=0, but they were too low when ES=0.2 and ωW=1, especially when VPC=0.03.

Figure 13. 95% coverage per condition according to the strict convergence criteria in Study 2 for the contextual effect βBβW. Dashed lines indicate the nominal 95%

Finc, EXPinc, HCinc, and HNinc stand for the F, Exponential, half-Cauchy, and half-Normal priors with an informative prior specification not in line with the population values, respectively. Finf, EXPinf, HCinf, and HNinf stand for the F, Exponential, half-Cauchy, and half-Normal priors with an informative prior specification in line with the population values, respectively.
Figure 13. 95% coverage per condition according to the strict convergence criteria in Study 2 for the contextual effect βB−βW. Dashed lines indicate the nominal 95%

Discussion

The popularity of Bayesian MLSEM has inspired various investigations into the required number of groups (see, e.g., Depaoli & Clifton, Citation2015; Helm, Citation2018; Hox et al., Citation2012). However, these studies mainly rely on traditional default prior specifications, which have been proven to be unreliable, or correctly specified informative priors that are not feasible in practice. The goal of this study was to investigate more robust prior distributions for the random effects variances in the context of MLSEM. In order to do so, we have investigated default and (in)accurate informative prior specifications of popular robust prior distributions in a doubly latent categorical multilevel model.

The differences between the robust prior distributions and the uniform prior were generally small. The main differences in results in the simulation studies appear to arise due to the population conditions rather than the prior distribution. Only the inverse-Gamma specifications showed substantially different, and often worse, performance. The lack of differentiation in results between the robust priors and the uniform prior might be due to the simulated data sets being too large and thereby overruling the influence of the prior on the posterior distribution. It might be expected that in more extreme conditions, either in terms of a smaller number of groups or a smaller number of observations within each group, more substantial differences between the priors will arise. However, given the complexity of the model, we believe that more extreme conditions than the ones considered in this paper would not be realistic scenarios where such a model should be used.

The default specification for the robust priors did show bias for the random effects variance parameters as well as the parameter of interest, the contextual effect. The MSE of the random effects variance parameters was high in certain conditions but close to zero for the contextual effect across all conditions. The coverage for the random effects variance parameters was much too low but only low for the contextual effect in certain population conditions. These results indicate that, although the robust prior distributions are better options for the random effects variance parameter than the inverse-Gamma prior, their hyperparameters still require careful consideration in order to obtain the best results. This was further investigated in Study 2, where we considered (in)accurate informative specifications for the robust priors. Unfortunately, the informative priors resulted in low convergence percentages, which complicated a thorough investigation and comparison of their performance. Again, this illustrates the difficulty in specifying the hyperparameters for these robust prior distributions. Future research should further investigate these and other informative specifications to fully assess their robustness against misspecification.

One limitation of the simulation study lies in the fact that we removed the non-converged replications from the results. If the non-convergence in these removed replications is due to particular sample values, this approach might bias the simulations. We tried to partially remedy this problem by also considering weaker convergence criteria, which resulted in less non-converged replications but generally did not show substantial differences in terms of bias, MSE, and coverage. Still, non-convergence and how to deal with it remains an issue in any simulation study investigating extreme conditions in terms of sample size or population values. Compared to other MCMC algorithms, the Hamiltonian Monte Carlo algorithm used in Stan offers more convergence criteria. An advantage is that convergence can be assessed more accurately; however, it will also flag a replication as non-converged more easily. Especially for large models, such as the model considered here, a few divergent transitions can occur rapidly, resulting in a replication that is flagged as non-converged. It can therefore be useful to compare the results obtained using various convergence criteria.

Throughout this paper, we have focused on the prior for the random effects variance or standard deviation. In regular multilevel models, several authors have proposed to specify priors on the VPC, which leads to an implied prior on the random effects variance (see, e.g., Daniels, Citation1999; Gustafson et al., Citation2006; Mulder & Fox, Citation2018; Natarajan & Kass, Citation2000). Although we have considered specifying a prior directly on the VPC, this approach is not straightforward in MLSEM. Specifically, since the VPC depends on multiple other model parameters (see EquationEquation (6)), a choice needs to be made for which of those model parameters a prior is directly specified and for which of the model parameters the prior is implied. Future research is needed to investigate these choices and their consequences.

In this study, we have focused on the doubly latent categorical multilevel model, a popular multilevel SEM that is often used in practice. Even though some general principles apply regarding the sensitivity of the results to the prior distribution and we, thus, might expect some of the results found here to generalize to other multilevel structural equation models (such as the bad performance of the inverse-Gamma priors), the complexity of SEMs makes generalizations beyond the model considered here difficult. The complex interplay of parameters in SEMs, combined with the small amount of information on higher levels in multilevel models can result in possibly unexpected consequences of particular prior specifications. This illustrates the importance of understanding the behavior of the joint prior that a researcher specifies (e.g., through the use of prior predictive checks; Gabry et al., Citation2019) as well as conducting extensive prior sensitivity checks to investigate the sensitivity of the final results to the specified prior (Van Erp et al., Citation2018).

Bayesian MLSEM offers a powerful modeling framework to incorporate both measurement and sampling error within one model. However, the results presented in this paper indicate that some caution is warranted in applying these models. When the sample size is small, Bayesian estimation does not necessarily perform well (Smid et al., Citation2019). In order for Bayesian estimation to outperform classical estimation methods, the prior distribution needs to be well thought out. This is especially the case for the complex MLSEM considered in this paper: with only 20 groups, it is crucial for the prior distribution to add reasonable prior information to the analysis. More research is needed to investigate how we can achieve this.

Acknowledgments

We would like to thank Irene Klugkist, Duco Veen, and Mariëlle Zondervan-Zwijnenburg for helpful comments on an earlier version of this manuscript.

Additional information

Funding

This research was supported by a Research Talent Grant [406-15-264] from the Netherlands Organisation for Scientific Research.

Notes

1 Note that, generally, empirical data investigating the BFLPE, such as the PISA data (OECD, Citation2007), uses 4-point Likert scale indicators for academic self-concept. However, we assume binary indicators instead to keep the computation time feasible in our simulation studies.

2 Alternatively, a probit link function can be used, in which case εikjWN(0,1).

3 These restrictions are, in theory, sufficient to identify the model. However, it is generally recommended to additionally impose cross-level invariance, that is, λW = λB to improve interpretability of the factors at both levels and to avoid estimation issues (Jak, Citation2018).

4 This corresponds to the parametrization used in Pérez et al. (Citation2017) when p=ν12, q=ν22, and b=σ02.

5 We parametrize the exponential distribution in terms of the scale σ02 for consistency, but generally it is defined using the rate parameter λ, which is the inverse of the scale, that is, λ=1σ02.

6 A third simulation study investigated the influence of the number groups and analyzed a subset of conditions for 50 and 100 groups. However, as the number of groups increased, we ran into more convergence issues. Results for converged conditions are available online at https://osf.io/pq8gm/.

7 We also considered the power and type 1 error rates. However, power to find a small contextual effect was low across the board (ranging from 0.01 to 0.09 even for 50 groups) and type 1 error rates were around the nominal 5% for all conditions, so these results are only available online at https://osf.io/pq8gm/. Initially, we also compared the precision of the various priors through the empirical SE, which is computed as 1N1i=1N(θˆiθˉ)2, with θˉ denoting the mean of θˆi. However, the empirical SE was sufficiently small in all conditions and for all priors (in the order of 1e-14) that we do not include this outcome in the results.

8 To assess the influence of the number of iterations on convergence, we reran one condition with 30,000 iterations per chain. However, this did not increase the convergence percentage or lead to substantial differences in the results. The results for this comparison are available at https://osf.io/pq8gm/.

References