478
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Evaluation of frequentist test statistics using constrained statistical inference in the context of the generalized linear model

, &
Article: 2222164 | Received 12 Oct 2022, Accepted 30 May 2023, Published online: 22 Jun 2023

ABSTRACT

When faced with a binary or count outcome, informative hypotheses can be tested in the generalized linear model using the distance statistic as well as modified versions of the Wald, the Score and the likelihood-ratio test (LRT). In contrast to classical null hypothesis testing, informative hypotheses allow to directly examine the direction or the order of the regression coefficients. Since knowledge about the practical performance of informative test statistics is missing in the theoretically oriented literature, we aim at closing this gap using simulation studies in the context of logistic and Poisson regression. We examine the effect of the number of constraints as well as the sample size on type I error rates when the hypothesis of interest can be expressed as a linear function of the regression parameters. The LRT shows the best performance in general, followed by the Score test. Furthermore, both the sample size and especially the number of constraints impact the type I error rates considerably more in logistic compared to Poisson regression. We provide an empirical data example together with R code that can be easily adapted by applied researchers. Moreover, we discuss informative hypothesis testing about effects of interest, which are a non-linear function of the regression parameters. We demonstrate this by means of a second empirical data example.

1. Introduction

A researcher wants to examine the relevance of five health indicators for the ability to live self-sufficiently, as opposed to living in a nursing home, after the age of 80. The health indicators are continuous variables, which are assessed by means of questionnaires and include ‘access to health care’ (x1), ‘use of preventive services’ (x2), ‘mental health’ (x3), ‘physical activity’ (x4) and ‘nutritional status’ (x5). These indicators, amongst others, have been described as the leading health indicators by the Centers for Disease Control and Prevention (CDC, Citation2020). The outcome Y, being able to live alone after the age of 80, is binary, where 1 represents success and 0 represents failure. This scenario will be used as a running example throughout the paper. A common technique to analyze data with a binary outcome is logistic regression (Hosmer et al., Citation2013). If the outcome is a count variable, for example the number of days spent in an intensive care unit in a hospital, Poisson regression can be used (Agresti, Citation2003).

In situations like this, researchers typically test a variety of hypotheses related to the regression coefficients. For example, the researcher can test whether ‘access to healthcare’ (x1) has a significant effect on the outcome Y after controlling for the other variables. Or the researcher can test if a subset of regression coefficients, for example ‘physical activity’ (x4) and ‘nutritional status’ (x5), have a significant effect on the outcome Y after controlling for the other variables. In other words, the researcher can assess whether regression parameters of interest are significant in the model. For that, standard test statistics like the Wald, the Score or the LRT (see, e.g., Buse, Citation1982) can be used.

However, in some situations, a certain ordering or certain signs of the regression coefficients can be expected. Considering our exemplary predictors x1,,x5 and the binary outcome Y, being able to live self-sufficiently after the age of 80, the researcher assumes that all regression coefficients will be positive. Better access to health care or greater physical activity will lead to a higher probability of success. In the case of regular null hypothesis testing, the researcher can test the null hypothesis that the regression coefficients of all predictors are zero: (1) H0:β1=0,β2=0,β3=0,β4=0,β5=0,(1) against the alternative hypothesis that at least one of the regression coefficients is nonzero in the model: (2) Ha:β10,β20,β30,β40,β50.(2) If H0 can be rejected in favor of Ha, the researcher can assess whether the regression coefficients are greater or smaller than zero via post-hoc tests. This procedure is somewhat unfortunate, since the researcher assumed a positive sign for all the regression coefficients right from the start.

In contrast to regular null hypothesis testing, constrained statistical inference (Hoijtink, Citation2012; Silvapulle & Sen, Citation2005) allows the researcher to take the ordering or the signs of the regression coefficients into account using equality and inequality constraints. In other words, H0:β1=0,β2=0,β3=0,β4=0,β5=0 can be tested against the ‘informative’ hypothesis Ha:β10,β20,β30,β40,β50, where at least one of the inequality constraints must be strictly true, whereas the remaining ones may be equalities. Thus, researchers can formulate their hypotheses of interest directly, instead of making a detour via another hypothesis. This implies that researchers can avoid to increase the risk for inflated type I error rates. Furthermore, informative hypothesis testing provides the researcher with greater power compared to regular null hypothesis testing (see, e.g., Vanbrabant et al., Citation2015).

The method of informative hypothesis testing is especially useful since many research questions in the social and behavioral sciences implicitly include an expectation of the researcher about the sign or the ordering of regression coefficients. Further examples of such research questions are the following: An organizational psychologist might want to assess the effect of ‘job satisfaction’ and ‘workload’ on the number of absence days at work. The expectation could be that the former reduces, whereas the latter increases the number of absence days. Or a clinical psychologist might want to evaluate the relevance of ‘therapy motivation’ or ‘functioning of interpersonal support systems’ on successful hospital discharge after a stationary psychotherapy program. Here, it could be expected that both increase the rates of successful hospital discharge.

In the context of the generalized linear model, informative hypothesis testing can be conducted by means of modified versions of the Wald, the Score and the LRT (Silvapulle & Sen, Citation2005). To calculate the p-value of these statistics, different approaches have been proposed (Silvapulle & Sen, Citation2005). Unfortunately, informative hypothesis testing is rarely used despite the extensive literature resources. This may be because software is lacking or because the constrained statistical inference literature is mainly focused on theory. Therefore, we lack knowledge about the practical performance of informative test statistics under different circumstances. This concerns, for example, the number of constraints, even though larger numbers are quite common, especially when multiple regression coefficients are assumed to be in a certain order. Assume, for instance, that a researcher expects the following ordering of five regression coefficients: β1>β2,β2>β3,β3>β4,β4>β5, which includes four constraints. In that case, the researcher might want to know how including more or less constraints in the hypotheses will affect type I error rates.

Furthermore, we do not know the impact of small sample sizes on the performance of informative test statistics, as the literature primarily describes their asymptotic behavior. This is unfortunate, since applied researchers are typically interested to know whether their available sample size suffices to obtain reasonable results. For the standard linear regression model, there exists some literature that focuses on the impact of sample size on the practical performance of informative hypothesis testing (Keck et al., Citation2021, Citation2022; Vanbrabant et al., Citation2015). However, to this point, similar work is missing for the generalized linear model.

In this paper, we aim at closing this gap. We want to assess the performance of various informative test statistics in the context of the generalized linear model by means of simulation studies. We consider the distance statistic (D-statistic) as well as the informative test versions of the Wald, the Score and the LRT. Furthermore, we regard different conditions regarding the sample size and the number of constraints. Note that the test statistics that are used in this paper work equally well for all members of the family of generalized linear models. However, we choose to limit our study to logistic and Poisson regression, as these are very widely used.

This paper is structured as follows. First, we briefly review the generalized linear model and discuss ways of parameter estimation. Subsequently, we present ‘regular’ as well as informative test statistics. Then, we introduce the design of our simulation studies and give an overview of the obtained results. In the subsequent sections, we present an empirical data example and explain informative hypothesis testing with non-linear constraints. We finish with a short discussion. All R code that was used is available on the OSF project site for this paper.Footnote1

2. Generalized linear regression model

The generalized linear model has been described by, for example, Agresti (Citation2003), Agresti (Citation2018), McCullagh and Nelder (Citation1989), Nelder and Wedderburn (Citation1972). It differs from the linear regression model in two aspects. First, it can handle non-normally distributed outcomes and second, it models non-linear functions of the mean response variable Y (Agresti, Citation2003). Three features constitute the generalized linear model, namely a random component, a systematic component and a link function.

The random component refers to the response variable Y and its probability distribution from the exponential family. The systematic component specifies a linear function of the explanatory variables and is called the linear predictor: (3) β0xi0+β1xi1++βpxip=j=0pβjxij.(3) Usually xi0=1, which makes β0 the intercept of the model.

The link function g() connects the systematic and the random component: (4) g[E(Yi)]=j=0pβjxij,(4) where g() is a possibly non-linear monotone differentiable function. Since we focus on logistic and Poisson regression in this paper, we present only these models in more depth.

The logistic regression model is a generalized linear model with a Bernoulli or binomial random component. If Yi has a Bernoulli distribution, the distribution is specified by the parameter πi, which is the probability of success P(Yi=1), while 1πi represents the probability of failure. The canonical link function is a logit link function: (5) g(πi)=logit(πi)=log(πi1πi),(0<πi<1),(5) where the value of πi changes with the values of the explanatory variables: (6) πi=exp(j=0pβjxij)1+exp(j=0pβjxij).(6) The Poisson regression model is a generalized linear model with a count random component. If Yi has a Poisson distribution, the distribution is specified by the parameter μi, which represents the expected count. The canonical link function is a log link function: (7) g(μi)=log(μi),(μi>0),(7) where the value of μi changes with the values of the explanatory variables: (8) μi=exp(j=0pβjxij).(8) Given a random sample of n observations (yi,xi), i=1,,n, the maximum likelihood (ML) estimates of the regression coefficients βˆ=(βˆ0,,βˆp) are usually obtained using iteratively reweighted least squares or (quasi-)Newton methods. This can be done using standard software, for example the glm() function in R (R Core Team, Citation2020). To compute standard errors, we also need the information matrix: (9) I1=1nXWX,(9) where X is the design matrix and W is a diagonal matrix of size n×n whose elements depend on the model (Agresti, Citation2003, p. 135).

To obtain β~, the vector of inequality constrained regression coefficients, different approaches can be followed. We created custom functions using constrained optimization algorithms (Nocedal & Wright, Citation1999). More specifically, we employed a quasi-Newton method with box constraints, as implemented in the R function nlminb(). Note that this approach can only handle informative hypotheses specifying the sign of regression coefficients (for instance Ha:β1>0). There are, however, other types of informative hypotheses (Hoijtink, Citation2012), for example assuming a certain ordering of the regression coefficients (such as Ha:β1β2β3).

A more flexible approach is implemented in the R package restriktor (Vanbrabant, Citation2020). It includes ML estimation comparable to iteratively reweighted least squares (IRLS), where the least squares solver is replaced by a quadratic program. This approach can handle all kinds of informative hypotheses, as long as they can be expressed as a linear function of the model parameters.

3. Hypothesis testing

In this section, we present regular test statistics used in classical null hypothesis testing, as well as informative test statistics used in informative hypothesis testing. Note that the test statistics from classical null hypothesis testing are denoted by means of a ‘reg’ subscript, whereas the test statistics used in informative hypothesis testing are specified by means of an ‘info’ subscript. Furthermore, R is the constraint matrix specifying the linear combination of regression coefficients expressing the hypothesis of interest.

For example, assume the researcher wants to test whether the effect of ‘physical activity’ (x4) is more than twice as large as the effect of ‘access to health care’ (x1) and the effect of ‘mental health’ (x3) is more than twice as large as the effect of ‘use of preventive services’ (x2) after standardizing all variables. In that case, the researcher aims to test H0:β4=2β1,β3=2β2 against Ha:β42β1,β32β2, in the context of classical null hypothesis testing, or Ha:β42β1,β32β2 in the context of informative hypothesis testing. Then, in both the regular and the informative case, the rows of R are specified as (10) r1=(020010)(10) (11) r2=(002100),(11) leading to the full constraint matrix: (12) R=(020010002100)(12) with row rank h = 2.

3.1. Classical null hypothesis testing

The test statistics from classical null hypothesis testing that will be presented include the Wald, the Score and the LRT. These large sample test statistics are explained in Buse (Citation1982) and are defined as follows: (13) Waldreg=n(Rβˆ)(RIˆ11R)1(Rβˆ),(13) (14) Scorereg=1nS(β¯)I¯11S(β¯),(14) (15) LRTreg=2[(β¯)(βˆ)],(15) where S(β¯)=β¯(β¯) is the score function evaluated at β¯, the vector of equality constrained estimates, (β¯) is the log-likelihood evaluated at β¯ and (βˆ) is the log-likelihood evaluated at βˆ. All three test statistics follow asymptotically a χ2-distribution under the null hypothesis with df = h, if the model is correct.

The Wald, the Score and the LRT are asymptotically equivalent. But it has been shown that the values of the Wald test are always slightly larger than the values of the LRT, which in turn are always slightly larger than the values of the Score test (Buse, Citation1982, p. 157). Thus, using the same critical χ2 value, the tests may have different power properties. This can be one aspect guiding the choice between them. Another aspect may be the computational resources it needs to compute the three tests. For the Wald test, we need to fit the unconstrained model, whereas for the Score test, we need to fit the equality constrained model and for the LRT, we need to fit both the unconstrained and the equality constrained model. Oftentimes, fitting the unconstrained model takes the least amount of time, which is why the Wald test is chosen frequently. However, in some cases, for example if the equality constrained model has a lot less parameters than the unconstrained model, it may need less computational resources to compute the equality constrained model compared to the unconstrained model. Furthermore, the LRT is, in contrast to the Wald and Score test, scale-invariant (see, e.g., Lehmann, Citation1986).

3.2. Informative hypothesis testing

Often, the informative test statistics are a modified version of the regular test statistics. For example, Waldinfo can be found in Silvapulle and Sen (Citation2005, p. 154): (16) Waldinfo=n(Rβ~)(RIˆ11R)1(Rβ~).(16) It uses the same constraint matrix R but a different vector of regression coefficients compared to the regular Wald test. While Waldreg uses βˆ, the vector of unconstrained regression coefficients, Waldinfo uses β~, the vector of inequality constrained regression coefficients. Scoreinfo can be computed as follows (Silvapulle & Sen, Citation2005, p. 159): (17) Scoreinfo=1n[S(β~)S(β¯)]Iˆ11[S(β~)S(β¯)].(17) Again, Scorereg and Scoreinfo use the same R matrix, but Scorereg uses βˆ and Scoreinfo uses β~. LRTinfo is defined as (Silvapulle & Sen, Citation2005, p. 157): (18) LRTinfo=2[(β¯)(β~)],(18) where (β¯) is the log-likelihood evaluated at β¯ and (β~) is the log-likelihood evaluated at β~. The difference between LRTreg and LRTinfo is that the former uses (βˆ), whereas the latter uses (β~).

Finally, lesser known is the D-statistic. It is calculated as (Silvapulle & Sen, Citation2005, p. 159): (19) Dinfo=2n[d(β¯)d(β~)],(19) where d(β¯) and d(β~) are the values of the following two functions at their solutions: (20) f(β)=(βˆβ)Iˆ1(βˆβ)undertheconstraintRβ=0,(20) (21) f(β)=(βˆβ)Iˆ1(βˆβ)undertheconstraintRβ0.(21) When minimizing these functions, we treat βˆ and Iˆ1 as known constants. Note that to compute the D-statistic, we have to use quadratic programming.

In case the model is correct, the informative Wald, Score and LRT as well as the D-statistic asymptotically follow a χ¯2-distribution under the null hypothesis. This is a mixture of χ2-distributions.

3.2.1. P-values

Silvapulle and Sen (Citation2005) present two approaches for calculating the p-value of informative test statistics. In the first part of this paper, where the informative hypothesis of interest can be expressed as a linear function of the regression coefficients, we use the approach where we first calculate the weights w0,.,wq of the χ¯2 mixture distribution (Silvapulle & Sen, Citation2005, p. 79). The sum of the weights from 0 to q is one, where q is the rank of X under the null hypothesis. Once we have computed the weights, the p-value of the observed χ¯2-value (χ¯obs2) is obtained as follows (Silvapulle & Sen, Citation2005, p. 86): (22) Pr(χ¯2χ¯obs2)=i=0qwi(H0,Ha)Pr[(hq+i)χhq+i2χ¯obs2].(22) The second approach to calculate the p-value of informative test statistics is described and demonstrated in Keck et al. (Citation2021). We use this approach in the second part of this paper, where the informative hypothesis of interest is expressed as a non-linear function of the regression coefficients. Both approaches are explained in more detail in the document ‘A-pValues.pdf’ on the OSF project site.

Note that if the hypothesis of interest only refers to the sign of one regression coefficient or to the sign of one quantity of interest, which is defined as a function of regression coefficients, we have a special case. Then the informative p-value equals the regular (non-informative) p-value divided by 2.

4. Simulation studies

We conducted several simulation studies using logistic and Poisson regression. The goal was to compare the presented informative test statistics in terms of their type I error rates under different conditions. One of the design factors in our simulation studies was sample size. This was of interest, as it is only known how the test statistics behave asymptotically, but not in finite sample sizes. The other design factor was the number of constraints that was included in the informative hypothesis. This coincides with h, the row rank of R. As a benchmark, the presented regular test statistics were also included in the simulation studies. By means of these simulation studies, we would like to give applied users a sense of what they might expect when using informative hypothesis testing in the context of the generalized linear model.

4.1. Design

The model we used had a single outcome Y and five predictors x1,,x5. We assumed that all predictors were continuous and normally distributed. Since we are interested in type I error, we generated data under the null hypothesis and thus set all regression coefficients β0,,β5 to 0. For the logistic regression, Y was sampled from the Binomial distribution with probability .50. For the Poisson regression, Y was sampled from the Poisson distribution with μ=1. Sample sizes of 10, 25, 50, 100, 200, 300, 400, 500, 1000, 2000, 10000 were examined. This way, we regarded small sample sizes that are typical in the social sciences (see, e.g., Van de Schoot & Miočević, Citation2020) as well as medium and large sample sizes. Even though the large sample sizes are unrealistic, they provide insights into the pattern emerging when increasing the sample size. Furthermore, we considered three constraint matrices, namely (23) R1=(010000),(23) (24) R2=(010000001000000100000010000001)(24) and (25) R3=(011000001100000110000011).(25) The first constraint matrix represents the hypothesis that only β1 is greater than zero: Ha:β1>0. The second constraint matrix states that at least one regression coefficient, except the intercept, is greater than zero: Ha:β10,β20,β30,β40,β50. And the third constraint matrix specifies a hypothesis, where regression parameters are assumed to be in a certain order, namely Ha:β1β2,β2β3,β3β4,β4β5. These constraint matrices were chosen to consider a typical range of the number of regression parameters (see, e.g., Vanbrabant et al., Citation2015).

For each condition, we generated data, fitted the model and computed the test statistics as well as the corresponding p-values. This was repeated 10000 times. Subsequently, we defined the type I error rate as the proportion of p-values that were lower than the significance level. In this paper, we used α=.05.

In the following sections, the results are presented for logistic and Poisson regression. Note that we only report the results of the simulation studies using the first two constraint matrices. The reason is that using the second and third constraint matrices led to very similar results with absolute differences smaller than 0.01, given that the flexible approach for parameter estimation that is implemented in restriktor was used. This intuitively makes sense, since both matrices include five constraints. Furthermore, the constraints of R3, such as β1β2 can be re-formulated to β1β20, which makes them very similar to the constraints used in R2, such as β10. In the end, all matrices are based on inequality constraints. The interested reader is referred to the R scripts of the simulation studies using R3, which can be found on the OSF project site (‘B-orderingExampleLogistic.R’ and ‘C-orderingExamplePoisson.R’). Note that the files ‘D-weights.R’ and ‘E-lavUvpois.R’ are also needed to run the script. The latter belongs to the R package lavaan (Rosseel, Citation2012), but at the time of writing, the functions are not included (yet) in the current public version of the package.

4.2. Type I results logistic regression

Tables and show the type I error rates resulting from running the simulation studies using logistic regression. When considering R1, we can see that LRTreg and Scorereg show too large type I error rates for N = 50 and below, whereas Waldreg shows too small type I error rates for N = 25 and below. In contrast, when considering R2, Waldreg shows too small type I error rates for N = 200 and below and LRTreg shows too large type I error rates already for N = 100 and below. Thus, in the context of logistic regression, LRTreg and Waldreg seem to be sensitive to the number of constraints. This does not apply to Scorereg, which only shows too small type I error rates for N = 25 and below when using R2.

Table 1. Type I error rates in logistic regression when using R1. Bold values are above .06 and underlined values are below .04.

Table 2. Type I error rates in logistic regression when using R2. Bold values are above .06 and underlined values are below .04.

The sensitivity for the number of constraints is something that we can also observe for all informative test statistics except LRTinfo. That is, LRTinfo shows too large type I error rates for N = 50 and below, no matter whether R1 or R2 is used. In contrast, Waldinfo and Dinfo only show too small type I error rates for N = 10 when using R1. When using R2, Waldinfo shows too small type I error rates already for N = 100 and below and Dinfo shows too small type I error rates already for N = 200 and below. In conclusion, it seems that most regular as well as informative test statistics, except Scorereg and LRTinfo, are sensitive to the number of constraints and the sample size.

4.3. Type I results Poisson regression

Tables and show the type I error rates resulting from running the simulation studies using Poisson regression. First of all, it can be observed that the results show more stable type I error rates for all test statistics, compared to logistic regression. The regular test statistics do not show a sensitivity for the number of constraints. More specifically, when using R1, LRTreg shows a too large type I error rate and Waldreg shows a too small type I error rate only at a sample size of N = 10. Scorereg even shows no problematic type I error rate at all. When using R2, LRTreg shows no problematic type I error rate at all, but Waldreg shows too small type I error rates for N = 25 and below and Scorereg shows a too small type I error rate for N = 10.

Table 3. Type I error rates in Poisson regression when using R1. Bold values are above .06 and underlined values are below .04.

Table 4. Type I error rates in Poisson regression when using R2. Bold values are above .06 and underlined values are below .04.

Moreover, the informative test statistics also do not show a sensitivity for the number of constraints. That is, when using R1, LRTinfo, Waldinfo and Dinfo show type I error rates that deviate from the nominal level for N = 25 and below. In contrast, Scoreinfo already shows one slightly too small type I error rate for N = 50. When using R2, Waldinfo, Dinfo and Scoreinfo only show problematic type I error rates for N = 10. LRTinfo even shows no problematic type I error rates at all.

To conclude, the sensitivity for the number of constraints that was observed for most regular and informative test statistics in the context of logistic regression could not be observed in the context of Poisson regression. Furthermore, type I error rates were also more stable concerning the sample size when using Poisson regression compared to logistic regression.

5. Empirical data example

The ‘DoctorVisits’ data set (Cameron, Citation1986; Cameron & Trivedi, Citation1998; Mullahy, Citation1997) comes with the AER package (Kleiber & Zeileis, Citation2022) and contains data about Australian health service use. The sample size is n = 5190. For the empirical data example of this paper, we chose the number of doctor visits as the dependent variable. Age, the number of illnesses and the number of days with reduced activity due to illness served as the predictors. The computations were conducted both for Poisson and logistic regression. In the latter case, the dependent variable was dichotomized, meaning that all values greater than zero were re-coded to 1, whereas 0 values stayed the same.

Following the design of the simulation studies, the following constraint matrix was used: (26) R=(010000100001),(26) stating that at least one regression coefficient, except the intercept, is greater than zero. Using Poisson regression, we obtained Dinfo=1640.05,p<.001 and using logistic regression, we obtained Dinfo=504.66,p<.001. The R code of the empirical data example can be found on the OSF project site (‘F-doctorVisits.R’).

6. Informative hypothesis testing with non-linear constraints

Informative hypotheses can not only be specified for linear combinations of the regression coefficients, that is Rβ, but also for non-linear combinations of the regression coefficients, namely c(β). In the latter case, c(β) can either be a scalar- or a vector-based non-linear function of the model parameters that computes a quantity (or a vector of quantities) of interest. Important examples for such non-linear functions include ‘effects of interest’ such as risk ratios, odds ratios, conditional effects, and average or marginal effects. Keck et al. (Citation2021) already demonstrated how to test informative hypotheses about various effects of interest in the context of the linear regression model. In the context of the generalized linear model, we need to use generalizations of the informative test statistics for this purpose. The generalized Wald test is defined as follows (Silvapulle & Sen, Citation2005, p. 166): (27) Waldinfo,gen=nc(β~)[C(β~)I~11C(β~)]1c(β~),(27) where c is a non-linear function of β~, C is the Jacobian matrix of c and I~1 the unit information matrix. Note that if c(β) was linear, it would be equal to Rβ in Waldinfo (see Equation Equation16).

The generalized D-statistic (Silvapulle & Sen, Citation2005, p. 164) can be computed as (28) Dinfo,gen=2n[d(β¯)d(β~)],(28) where d(β¯) and d(β~) are the values of the following two functions at their solutions: (29) f(β)=(βˆβ)Iˆ1(βˆβ)undertheconstraintc(β~)=0,(29) (30) f(β)=(βˆβ)Iˆ1(βˆβ)undertheconstraintc(β~)0.(30) The constraints are different compared to the ones that are used for Dinfo. In this case, we have to use non-linear optimization methods to compute the D-statistic.

The other generalized informative test statistics still look the same as presented before. That is, the generalized Score test can be computed as (Silvapulle & Sen, Citation2005, p. 166): (31) Scoreinfo,gen=1n[S(β~)S(β¯)]Iˆ11[S(β~)S(β¯)],(31) and the generalized LRT is calculated as follows (Silvapulle & Sen, (Citation2005, p. 164): (32) LRTinfo,gen=2[(β¯)(β~)].(32) To compute β~, we can employ non-linear optimization algorithms (see, e.g., Nocedal & Wright, Citation1999). An example for this can be found on the OSF project site in ‘G-estimationExampleNonlin.R’, where data of next section's empirical data example is used. The additional R files ‘H-elrEffects.R’ and ‘E-lavUvpois.R’ are needed. The former belongs to the R package EffectLiteR (Mayer & Dietzfelbinger, (Citation2019); Mayer et al. (Citation2016)), but at the time of writing, the functions are not included (yet) in the current public version of this package.

6.1. Empirical data example

To demonstrate informative hypothesis testing concerning effects of interest, we present a second empirical data example, where we use data from the ACTIVE studyFootnote2 (Ball et al., Citation2002; Jobe et al., Citation2001; Tennstedt et al., Citation2005). The ACTIVE study is a large randomized controlled trial designed to examine the effectiveness of cognitive interventions among older adults. It also served as an empirical data example in Kiefer and Mayer (Citation2021a, Citation2021b), where effects on count outcomes with non-normal as well as latent covariates are discussed.

For our example, we consider two levels of the treatment variable X, where X = 0 is the control group and X = 1 denotes the group that receives memory training. This leads to a sample size of n = 1401. Additionally, we consider the continuous predictor variable Z, which is a depression score. The outcome variable Y is a count variable, which describes the performance on an inductive reasoning assessment. We are interested to test H0:AE10=0 against Ha:AE10>0. In other words, we would like to test if the average effect of memory training is greater than zero. An average effect can be defined as the unconditional expectation of the difference between expected outcomes under treatment and under control, that is E[E(Y|X=1,Z)E(Y|X=0,Z)]. In our example, we use a Poisson regression, where the regressors are related to the logarithm of the conditional expectation of the count outcome (see also Equation Equation8): (33) log[E(Y|X,Z)]=β0+β1X+β2Z+β3XZ,(33) (34) E(Y|X,Z)=exp(β0+β1X+β2Z+β3XZ).(34) Consequently, the average effect can be defined as (35) AE=E[exp(β0+β1X+β2Z+β3XZ)]E[exp(β0+β2Z)],(35) which can be estimated as follows: (36) AEˆ(βˆ)=1ni=1nexp(βˆ0+βˆ1xi+βˆ2zi+βˆ3xizi)exp(βˆ0+βˆ2zi).(36) As discussed before, this is a non-linear function of the regression coefficients β. When testing H0:AE10=0 against Ha:AE10>0, we obtain Waldinfo,gen=1.80,p=.09. The computations can be found in the R script ‘I-averageEffect.R’ on the OSF project site. Since the hypothesis of interest only concerns one effect of interest, the p-value equals half of the value of the regular (non-informative) p-value (0.18). For more complicated informative hypotheses, we demonstrate how to compute the p-value in the R script ‘J-pValueSimulation.R’, which can also be found on the OSF project site.

7. Discussion

In this paper, we reported about simulation studies examining the impact of the sample size and the number of constraints when performing informative hypothesis testing in the context of the generalized linear model. We focused on logistic and Poisson regression, as these are widely used. As a benchmark, we also included regular test statistics in the simulation studies. We considered various sample sizes as well as three different constraint matrices. One included a single constraint, whereas the other two included five constraints.

Our findings are different for logistic and Poisson regression. That is, using logistic regression, most regular and informative test statistics were sensitive to the number of constraints and the sample size. This could be concluded since type I error rates were closer to the nominal level in small sample sizes when considering only one constraint in the hypothesis, in contrast to five constraints. Using Poisson regression, the sensitivity to the number of constraints and the sample size was much less pronounced for both regular and informative test statistics.

The following general recommendations can be made for applied researchers who want to use informative hypothesis testing. First, the more computationally intensive test statistics seem to perform better in terms of type I error rates compared to the less computationally intensive test statistics. That is, the LRT, which requires fitting of both the inequality and the equality constrained model, performs better than the Score test, which requires fitting of the inequality constrained model, and which, in turn, performs better than the Wald test, which only requires fitting of the unconstrained model.

Specifically, in the context of logistic regression, applied researchers should use sample sizes that are larger than n = 50 when dealing with a single constraint and larger than n = 100 when dealing with multiple constraints. In the context of Poisson regression, applied researchers should use sample sizes that are larger than n = 25 when dealing with a single or with multiple constraints.

By means of using R1, R2 and R3 in our simulation studies, we only investigated hypotheses including inequality and equality constraints concerning regression coefficients. However, there are further options to formulate informative hypotheses (Hoijtink, Citation2012). For example, effect sizes can be included as follows: (37) Ha:β1>β2+dσ,(37) where d is an effect size according to Cohen (Citation1988) and σ is the standard deviation. Furthermore, ‘about equality’ constraints can be used to test informative hypotheses such as (38) Ha:|β1β2|<dσ,(38) which corresponds to (39) Ha:β1β2dσ, β1β2dσ.(39) Range constraints are a generalization of ‘about equality’ constraints. They can be used to test informative hypotheses such as (40) Ha:β1β2η1,β1β2η2,(40) where the difference between β1 and β2 is supposed to lie in an interval with lower bound η1 and upper bound η2. Of course, combinations of different types of informative hypotheses are also possible.

Even though these hypotheses are formulated in a different way compared to the ones assessed in our paper, we expect that they should lead to similar results concerning type I error rates in future simulation studies. More specifically, we expect that the number of constraints as well as the sample size will remain the decisive factors. This is because in the end, all these different options to formulate informative hypotheses are based on equality and inequality constraints in their null and alternative hypotheses. Still, this needs to be confirmed in future simulation studies.

When designing these future simulation studies, it should be noted that constraint matrices in informative hypothesis testing must be of full row rank, that is, they must have less or an equal number of rows compared to columns. This way, it is made sure that the product of RI~11R in the Wald statistic can be inverted (see Equation Equation16). In contrast, a rank-deficient R would lead to a rank-deficient product, which could not be inverted.

Furthermore, future research should repeat the simulation studies of this paper for other members of the family of generalized linear models. In this paper, we refrain from drawing conclusions about other generalized linear models, such as gamma regression, that were not considered in our simulation studies.

Another interesting future research endeavour would be to assess the impact of different numbers of predictors on type I error rates. It might be that a higher number of constraints is less problematic for a smaller number of predictors compared to a larger number of predictors. This could be, for example, three constraints on three parameters compared to three constraints on thirty parameters.

As mentioned before, informative hypothesis testing provides the researcher with greater power compared to regular null hypothesis testing (see, e.g., Vanbrabant et al., Citation2015). The question how much power is gained exactly by using informative hypothesis testing in the generalized linear model compared to using regular null hypothesis testing has to be examined in future simulation studies. For these studies, our paper has provided some groundwork to refer to.

The simulation studies of this paper are complemented by an empirical data example using the ‘DoctorVists’ data set. Furthermore, we discussed informative hypothesis testing with non-linear constraints. In this case, we have to use generalizations of the Wald and the D-statistic. We demonstrated this by means of a second empirical data example using data from the ACTIVE study. Here, we showed how informative hypotheses about effects of interest can be computed.

We have provided extensive R code on this paper's OSF project site, which can be adapted by researchers. We hope that the paper, together with the R code, makes it easier for interested readers to use this technology in the future.

Supplemental material

Supplemental Material

Download PDF (758.9 KB)

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The data set ‘DoctorVisits’ that has been used in the first empirical data example can be obtained via the R package AER. The data set of the ACTIVE study that has been used in the second empirical data example can be downloaded from https://doi.org/10.3886/E128941. The data sets that were used in the simulation studies are available upon request from the corresponding author. Furthermore, the scripts ‘A-orderingExampleLogistic.R’ and ‘B-orderingExamplePoisson.R’ on the OSF project site (https://osf.io/6svrm/?view_only=e3ee3ecd9cb442b1bf0eb175f319a815) show the data generating mechanism.

Correction Statement

This article has been corrected with minor changes. These changes do not impact the academic content of the article.

Additional information

Funding

This work has been supported by the Research Foundation Flanders (FWO, grant G002819N to Yves Rosseel and Axel Mayer).

Notes

References

  • Agresti, A. (2003). Categorical data analysis. John Wiley & Sons.
  • Agresti, A. (2018). An introduction to categorical data analysis. John Wiley & Sons.
  • Ball, K., Berch, D. B., Helmers, K. F., Jobe, J. B., Leveck, M. D., Marsiske, M., …S. L. Willis (2002). Effects of cognitive training interventions with older adults. Journal of the American Medical Association, 288(18), 2271. doi:10.1001/jama.288.18.2271
  • Buse, A. (1982). The likelihood ratio, Wald and Lagrange multiplier tests: an expository note. The American Statistician, 36(3), 153–157. doi:10.2307/2683166
  • Cameron, A. C. (1986). Econometric models based on count data: Comparisons and applications of some estimators and tests. Journal of Applied Econometrics, 1(1), 29–53. doi:10.1002/jae.3950010104
  • Cameron, A. C., & Trivedi, P. K. (1998). Regression analysis of count data. Cambridge University Press.
  • CDC. (2020). Leading health indicators.
  • Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Erlbaum.
  • Hoijtink, H. (2012). Informative hypotheses: Theory and practice for behavioral and social scientists. Chapman & Hall/CRC.
  • Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression. John Wiley & Sons.
  • Jobe, J. B., Smith, D. M., Ball, K., Tennstedt, S. L., Marsiske, M., Willis, S. L., …Kleinman, K. (2001). ACTIVE. Controlled Clinical Trials, 22(4), 453–479. doi:10.1016/S0197-2456(01)00139-8
  • Keck, C., Mayer, A., & Rosseel, Y. (2021). Integrating informative hypotheses into the EffectLiteR framework. Methodology, 17(4), 307–325. doi:10.5964/meth.7379
  • Keck, C., Mayer, A., & Rosseel, Y. (2022). Overview and evaluation of various frequentist test statistics using constrained statistical inference in the context of linear regression. Frontiers in Psychology. doi:10.3389/fpsyg.2022.899165. 16648714
  • Kiefer, C., & Mayer, A. (2021a). Accounting for latent covariates in average effects from count regressions. Multivariate Behavioral Research, 56(4), 579–594. doi:10.1080/00273171.2020.1751027
  • Kiefer, C., & Mayer, A. (2021b). Treatment effects on count outcomes with non-normal covariates. British Journal of Mathematical and Statistical Psychology, 74(3), 513–540. doi:10.1111/bmsp.12237
  • Kleiber, C., & Zeileis, A. (2022). AER: Applied econometrics with R. R package version 1.2-10.
  • Lehmann, E. (1986). Testing statistical hypotheses. Springer.
  • Mayer, A., & Dietzfelbinger, L. (2019). EffectLiteR: Average and conditional effects. R package version 0.4-4.
  • Mayer, A., Dietzfelbinger, L., Rosseel, Y., & Steyer, R. (2016). The EffectLiteR approach for analyzing average and conditional effects. Multivariate Behavioral Research, 51(2–3), 374–391. doi:10.1080/00273171.2016.1151334
  • McCullagh, P., & Nelder, J. A. (1989). Generalized linear models. Chapman & Hall.
  • Mullahy, J. (1997). Heterogeneity, excess zeros and the structure of count data models. Journal of Applied Econometrics, 12, 337–350. doi:10.1002/(SICI)1099-1255(199705)12:33.0.CO;2-G
  • Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society, Series A, 135(3), 370–384. doi:10.2307/2344614
  • Nocedal, J., & Wright, S. J. (1999). Numerical optimization. Springer.
  • R Core Team (2020). R: A language and environment for statistical computing. R foundation for statistical computing.
  • Rosseel, Y. (2012). Lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. doi:10.18637/jss.v048.i02
  • Silvapulle, M. J., & Sen, P. K. (2005). Constrained statistical inference: Order, inequality, and shape restrictions. Wiley.
  • Tennstedt, S., Morris, J., Unverzagt, F., Rebok, G., Willis, S., Ball, K., & Marsiske, M. (2005). ACTIVE (Advanced Cognitive Training for Independent and Vital Elderly), United States, 1999–2001. Interuniversity Consortium for Political and Social Research.
  • Vanbrabant, L. (2020). Restriktor: Constrained statistical inference. R package version 0.2-800.
  • Vanbrabant, L., Van de Schoot, R., & Rosseel, Y. (2015). Constrained statistical inference: Sample-size tables for ANOVA and regression. Frontiers in Psychology, 5, 1–8. doi:10.3389/fpsyg.2014.01565
  • Van de Schoot, R., & Miočević, M., (Ed.). (2020). Small sample size solutions: A guide for applied researchers and practitioners. Routledge.