1,283
Views
1
CrossRef citations to date
0
Altmetric
Original Articles

Testing Inference from Logistic Regression Models in Data with Unobserved Heterogeneity at Cluster Levels

Pages 1202-1211 | Received 19 Nov 2008, Accepted 18 Feb 2009, Published online: 09 Apr 2009

Abstract

Clustering due to unobserved heterogeneity may seriously impact on inference from binary regression models. We examined the performance of the logistic, and the logistic-normal models for data with such clustering. The total variance of unobserved heterogeneity rather than the level of clustering determines the size of bias of the maximum likelihood (ML) estimator, for the logistic model. Incorrect specification of clustering as level 2, using the logistic-normal model, provides biased estimates of the structural and random parameters, while specifying level 1, provides unbiased estimates for the former, and adequately estimates the latter. The proposed procedure appeals to many research areas.

Mathematics Subject Classification:

1. Introduction

Inferring causal effect from repeated measures data has high relevance to a number of areas of research, including economics social sciences and epidemiology. Grouping of observations or clustering due to repeated measures, geographical location, personal characteristics, or any other factors is often used in study design. In multistage surveys, for example, we may select geographical area at the first stage, general practices (GPs) at the second stage, and patients at the third stage. If repeated observations were taken for patients, we would expect correlation at patients level (level 1), at GPs level (level 2), and at geographical area level (level 3). Data resulting from such designs do not comply with the identical independent distribution (IID) assumption, on which standard statistical methods are based. The inappropriate use of standard methods for such data may lead to biased estimates, loss of precision, and misleading tests of hypothesis as discussed in Skinner et al. (Citation1989) where procedures for correction of estimates were developed. Such type of data are convenient to fit multilevel models (Goldstein et al., Citation2000; Rasbash et al. Citation2005; Steele Citation2008; Yang et al. Citation2000), which are useful to understand the causal as well as the hierarchical relationships.

Simple random sampling (SRS) may be used to obtain data that follow the IID assumption required by standard statistical procedures and often seen as advantageous, due to simplicity. In practical situations, however, SRS design may not necessarily guarantee the IID assumption, as correlation may corrupt data drawn by the procedures, at any stage, and that may be entirely due to unobserved heterogeneity, suggesting that the simplicity of (SRS) has to be carefully balanced against its disadvantages. In social, economics, or biological investigations, for example, there are factors (exogenous or endogenous, independent of a process, or part of it, time varying or time invariant, personal or contextual) that may be unobserved at the individual or cluster level. There is much interest on the problem of unobserved heterogeneity in general in the field of econometrics (Allenby and Lenk, Citation1994; Allenby and Rossi Citation1999; Arana and Leon Citation2006; Aprahamian et al. Citation2007; Cramer Citation2007; Garcia and Hernandez Citation2007). The awareness in public health sciences is considerably less (Zohoori and Savitz, Citation1997), or perhaps the problem was addressed differently and viewed in terms of confounders, and methods in the filed of epidemiology and biostatistics were oriented toward making adjustment for observed confounders. (Hogan and Lancaster, Citation2004). Even in the richest model specification, however, it is not possible to adjust for all determinants of the outcome, as several factors would be unobserved, immeasurable, or unknown (Lee and Lee, Citation2003; Zohoori and Savitz Citation1997). Unaccounted for, unobserved heterogeneity, can lead to spurious causal inference, if standard binary regression models were used, whether the unobserved heterogeneity has resulted in a complex data structure, or it has been linked to individuals. (Heckman, Citation2008; Peters Citation2004).

The standard logistic model was one of the most popular standard link functions for the analysis of binary response data where the IID assumption holds, and for clustered data, mixed effect logistic models comes as a natural choice (Kahn and Raftery, Citation1996). While much attention has been given to choice between alternative mixed effects models and the relevant estimation procedures (Arana and Leon, Citation2005) a systematic investigation on the performance of any of these models, under variety of conditions is remarkably lacking. A basic characteristic of the latter models is the inclusion of random subject effects into regression models in order to account for the influence of correlations whether that was due to a clustering factor such as litter, or household for example, genotype, or due to repeated measurements. Several reviews have discussed and compared some of these models and their estimation procedures, (Donald and Gibbons, Citation2006), for example.

In this study we introduced the use of the (random effect) logistic-normal model, designed to deal with repeated measurements, or clustering at one level, for data with two levels of clustering that may occur purely due to unobserved heterogeneity. This approach is motivated by our findings in theory and empirical investigations, that the leading terms of bias, for the (ML) estimate of the logistic model due to unobserved heterogeneity, is the same whether the observations are correlated or not (Ayis, Citation1995). The method may be of wider applications, due to its simplicity and as it is readily available in many standard statistical software.

2. Methods

2.1. Data Simulation

Data was simulated from a logistic model with a response y, an explanatory variable x, and an extra term representing unobserved heterogeneity. A Fortran program was written to generate data from the model, where there is one, then two levels of unobserved heterogeneity, for one level the model would be:

where x jk is a vector of a binary variable for subjects j = 1, 2,…, n, for repeated observations k = 1,…, K, ϵ j remains the same for all replications within cluster. The explanatory variable x jk  = x j , if it was assumed to be constant over time. The situation with two levels of unobserved heterogeneity may be represented by:
where α and β are the logistic regression parameters and ϵ j1 and ϵ j2 are extra terms representing unobserved heterogeneity, at cluster levels 1 and 2, the two were assumed to be normally distributed random variables with zero mean and variances and , respectively, and for the two levels the term was kept constant within cluster. Several parameter values were considered, so that they cover a range of probabilities, including situations where one of the probabilities was 0.5, the two probabilities were lying on one side of 0.5, and where they were lying further apart on the two sides of 0.5. We also considered a range of values for the variance of the unobserved heterogeneity term. The covariate x was allowed to be time invariant as well as time varying. We examined estimation based on: (1) The standard logistic model, that is assuming data follow an identical independent distribution; (2) on the Logistic-normal Model, where clustering at level 2 was specified as cluster level; and (3) on the Logistic-normal Model, where clustering at level 1 was specified as cluster level. Estimates were calculated from 200 simulations, each based on a sample of around 784 observations, and the number was chosen to maintain and equal sample size for all designs considered.

3. Results

We examined estimates for the structural parameter of the standard logistic model, and the structural and random parameter from the logistic-normal model, results presented in Table . True parameters were (α, β) = (− 1, 2), that is (p 0, p 1) = (0.27, 0.73), replications were; K = 2 at level 1, and J = 12 at level 2. For the first set, clustering due to unobserved heterogeneity was at level 1 only, with total variance equals 1.0. The estimation of β by the standard logistic model was biased as expected, the bias was a function of the variance of unobserved heterogeneity and the difference between the two probabilities, and may be estimated using a formula we derived in an earlier study (Ayis, Citation1995). Estimates of β and σ, from the logistic-normal model, specifying level 2 incorrectly as the cluster level, were similarly biased, both structural parameter and random parameter were underestimated, the estimation of β was much similar to that obtained by the standard logistic model. The Logistic normal model estimates by correctly specifying level 1, as the cluster level, was as expected to be good for both estimates and their standard errors, with rejections for both being within the expected 5% range. For subsequent sets, and where the unobserved heterogeneity was divided into two levels with true variances , and (0.6, 0.4) the estimation assuming level 2 was the cluster level, improved, specifically for the structural parameter β. Further improvement was shown for both the structural and random parameter as the variance of the unobserved heterogeneity at level 2 increases. Estimates from the logistic-normal model specifying level 1 as the cluster level, on the other hand, continued to be fairly good, even when the clustering due to unobserved heterogeneity was all at level 2. Estimates based on the standard logistic model continued to be biased, and the size of bias remained to be the same regardless of how the clustering due to unobserved heterogeneity was arranged among the two cluster levels.

Table 1 Estimation of β using the logistic model, β and σ using the logistic-normal for two specifications of cluster due to unobserved heterogeneity as levels 1 and 2, (p 0, p 1) = (0.27, 0.73), (n = 364 individual, K = 2 (level 1) and J = 12 (level 2)

In Table , the true parameters and variance components considered in Table were examined but, replications at levels 1 and 2 were different, K = 12 replications at level 1, and J = 16 replications at level 2. Estimates from the standard logistic model were similar to those obtained for Table . For the logistic-normal model, also similar pattern was reported, in summary, estimates of the structural as well as the random parameter were good if the cluster level was specified as level 1, even if the clustering was in fact at level 2. If cluster was specified as level 2, the only good estimation obtained was when the clustering was either fully or mostly at level 2, that is if the clustering level was reasonably correctly defined.

Table 2 Estimation of β using the logistic model, β and σ using the logistic-normal and two specifications of cluster due to unobserved heterogeneity as levels 1 and 2, (p 0, p 1) = (0.27, 0.73), (n = 364 individual, K = 12 (level 1) and J = 16 (level 2)

In Table , results were based on different structure of probabilities, true parameters were (α, β) = (− 2, 1.0), that is (p 0, p 1) = (0.12, 0.27) and replications were K = 2 at level 1 and J = 12 at level 2. Estimation of the structural and random parameters was again fairly good if level 1 was specified as the cluster level, whether clustering was truly at level 1 or 2. If level 2 was specified as the cluster level, estimation was only good if that was the correct specification, and gets poorer as clustering at level 1 increases. Estimates from the standard logistic model were as before, found to be biased regardless of how the unobserved heterogeneity was split between the two levels, the bias was, however, relatively small as the difference between the two probabilities was small.

Table 3 Estimation of β using the logistic model, β and σ using the logistic-normal with correct and incorrect specification for unobserved heterogeneity at levels 1 and 2, (p 0, p 1) = (0.12, 0.27), (n = 364 individual, K = 2 (level 1) and J = 12 (level 2)

Other combinations of probabilities and a range of variances for unobserved heterogeneity were examined but not presented here due to economy of space. In general, for all situations examined, estimates from the standard logistic model were biased, the bias increases with the increase in the total variance of unobserved heterogeneity, the increase in the difference between the two probabilities, or the increase of both. If level 2, was the major clustering level, good estimates were obtained from the logistic normal model, when level 2 was correctly assumed to be the clustering level, as it should be. However, if level 2 was assumed to be the clustering level, while there was clustering at the two levels, or clustering was at level 1 rather than level 2, in that case the estimates of the structural parameter as well as those for the random parameter will be biased, an the bias increases as the variance of the unobserved heterogeneity at level 1 increases. On the other hand, specifying level 1 as the clustering level seems to result in correct estimates of the structural parameter, at all situations even if clustering was only at level 2. Estimates of the random parameter were also fairly good, with a slight underestimation as the variance of the unobserved heterogeneity at level 2 increases. The number of rejections for these estimates was found to be around 10% where the unobserved heterogeneity term, has an equal variance at the two levels, and it was around 20% where clustering was only at level 2. Further details may be found in Ayis (Citation1995).

4. Discussion

We introduced the use of a simple approach, the “logistic normal model” that is designed to take account of one level of clustering, for data with two levels of clustering due to unobserved heterogeneity. We examined the model estimation for situations where clustering was mostly at level 1, mostly at level 2, and where it was at the two levels simultaneously. The approach was motivated by our earlier findings based on theory and empirical results that the leading term of bias for the standard logistic model, which is based on the identical independent distribution assumption, is the same whether the observations are correlated or not. We conjecture that single level methods of adjusting for unobserved heterogeneity may be satisfactory when there are two or more.

The standard logistic model, then the random effect logistic normal model, were first fitted to simulated data for a range of probabilities, for time varying and time invariant covariates, for a range of variance of unobserved heterogeneity, and for different combinations of variances of unobserved heterogeneity at levels 1 and 2. The main objective from fitting the standard logistic model was to test the effect of clustering at two levels on the estimation, for all situations investigated, regardless of the level of clustering due to unobserved heterogeneity, the estimates were biased. As the two random error terms are of an additive form, the bias was a function of σ2 (the total variance of the error term: ) and β as expected. The approach proposed for correction, using the logistic normal model was found to perform well under all situations examined when the first level of correlation was specified a priori, suggesting that level 1 was more important to be recognised by the fitting procedure when there are more than one level of clusters.

Complex data structure, in epidemiology, economics or clinical research may occur for several reasons, suggesting that our proposed approach, have good potential for applications due to its simplicity. In genetic epidemiology, for example, causal inference when data exhibit unobserved heterogeneity has been a cause of concern, and attention has increased in recent years (Fewell et al., Citation2007; Palmer et al. Citation2008). The method of instrumental variable (IV), which is used in econometrics, as a standard method for estimating linear and nonlinear models in which the error term may be correlated with an observed covariate as a result of unobserved heterogeneity (Hogan and Lancaster, Citation2004) is gaining popularity among genetic epidemiologists who argue that it is possible to replace an exposure that is likely to be confounded by unmeasured or unknown behavioural and socio-economic factors, by a genotype that is believed for good biological reasons, to be strongly associated with the exposure of interest and with the outcome of interest, utilizing the principle of “Mendelian randomization” (Lawlor et al., Citation2008; Palmer et al. Citation2008; Sheehan et al. Citation2008). According to “Mendelian randomization,” the random assignment of an individual's genotype from his or hers parental genotype occur before birth, hence is unlikely to be correlated with confounders, making genotype a suitable candidate to work as (IV). The method, however, has many challenges awaiting further research and exploration, some of these were described in Lawlor et al. (Citation2008), including the availability of genotype data, although the authors estimated that such data will become available in the coming (5–10) years.

Using the logistic-normal approach may be worthwhile, when researchers are aware, or in doubt about the availability of clustering at one or more levels due to unobserved heterogeneity. Of relevance may be is the analysis of panel data, with genotype information, where the approach may be used to improve causal inference at situations, where the (IV) approach is considered inappropriate due to weak association with exposure of interest, and where genotype is thought to be associated (but possibly not confirmed) with the outcome, at such situation the approach will rid the data from the effect of confounding at cluster levels 1 or higher if there is any. Further tests are, however, important, for example the distribution of the genotype needs to be assessed, the use of other mixed effect specifications also needs to be considered, depending on the shape of the data on genotype; a Mont-Carlo simulation was used in Arana and Leon (Citation2005) to test the performance of a Bayesian mixture normal distribution (semi parametric model), in comparison with other parametric models (including the logit) and nonparametric models, using alternative assumptions for the error distribution, when data exhibit unobserved heterogeneity, and found that the mean square error (MSE) in all models was considerably large reflecting the difficulty in modelling this type of data, the Bayesian specification however, performed better than the competing models.

Acknowledgments

I am grateful to professor D. Holt for the advice and suggestions he gave to the original study. I am very grateful to Dr. Marie South and Dr. Peter Egger for a considerable help with the FORTRAN programmes used in the simulation.

Notes

Note: Reject %: the number of times in percentage that the true parameter lied outside the 95% confidence intervals of the simulated estimator. Estimates are based on 200 simulations. J = 12 (replication at level 2), and K = 2 (replication at level 1), n = 768 individuals, for (α, β) = (− 1, 2).

References

  • Allenby , G. M. , Lenk , P. J. ( 1994 ). Modeling household purchase behavior with logistic normal regression . Journal of the American Statistical Association 89 : 1218 – 1231 .
  • Allenby , G. M. , Rossi , P. E. ( 1999 ). Marketing models of consumer heterogeneity . Journal of Econometrics 89 : 57 – 78 .
  • Aprahamian , F. , Chanel , O. , Luchini , S. ( 2007 ). Modeling starting point bias as unobserved heterogeneity in contingent valuation surveys: An application to air pollution . American Journal of Agricultural Economics 89 : 533 – 547 .
  • Arana , J. E. , Leon , C. J. ( 2005 ). Flexible mixture distribution modeling of dichotomous choice contingent valuation with heterogenity . Journal of Environmental Economics and Management 50 : 170 – 188 .
  • Arana , J. E. , Leon , C. J. ( 2006 ). Modelling unobserved heterogeneity in contingent valuation of health risks . Applied Economics 38 : 2315 – 2325 .
  • Ayis , S. A. M. ( 1995 ). Modelling Unobserved Heterogeneity: Theoretical and Practical Aspects . Southampton, UK : University of Southampton .
  • Cramer , J. S. ( 2007 ). Robustness of logit analysis: Unobserved heterogeneity and mis-specified disturbances . Oxford Bulletin of Economics and Statistics 69 : 545 – 555 .
  • Donald , H. , Gibbons , R. D. ( 2006 ). Longitudinal Data Analysis . Hoboken, NJ : John Wiley & Sons, INC .
  • Fewell , Z. , Davey , S. G. , Sterne , J. A. ( 2007 ). The impact of residual and unmeasured confounding in epidemiologic studies: a simulation study 1 . American Journal of Epidemiology 166 : 646 – 655 .
  • Garcia , J. A. B. , Hernandez , J. E. R. ( 2007 ). Housing and urban location decisions in Spain: An econometric analysis with unobserved heterogeneity . Urban Studies 44 : 1657 – 1676 .
  • Goldstein , H. , Rasbash , J. , Browne , W. , Woodhouse , G. , Poulain , M. ( 2000 ). Multilevel models in the study of dynamic household structures . European Journal of Population-Revue Europeenne de Demographie 16 : 373 – 387 .
  • Heckman , J. J. ( 2008 ). Econometric causality . International Statistical Review 76 : 1 – 27 .
  • Hogan , J. W. , Lancaster , T. ( 2004 ). Instrumental variables and inverse probability weighting for causal inference from longitudinal observational studies . Statistical Methods in Medical Research 13 : 17 – 48 .
  • Kahn , M. J. , Raftery , A. E. ( 1996 ). Discharge rates of Medicare stroke patients to skilled nursing facilities: Bayesian logistic regression with unobserved heterogeneity . Journal of the American Statistical Association 91 : 29 – 41 .
  • Lawlor , D. A. , Harbord , R. M. , Sterne , J. A. C. , Timpson , N. , Smith , G. D. ( 2008 ). Mendelian randomization: Using genes as instruments for making causal inferences in epidemiology . Statistics in Medicine 27 : 1133 – 1163 .
  • Lee , S. , Lee , S. ( 2003 ). Testing heterogeneity for frailty distribution in shared frailty model . Communications in Statistics-Theory and Methods 32 : 2245 – 2253 .
  • Palmer , T. M. , Thompson , J. R. , Tobin , M. D. , Sheehan , N. A. , Burton , P. R. ( 2008 ). Adjusting for bias and unmeasured confounding in Mendelian randomization studies with binary responses . International Journal of Epidemiology 37 : 1161 – 1168 .
  • Peters , B. L. (2004). Is there a wage bonus from drinking? Unobserved heterogeneity examined. Applied Economics 36:2299–2315.
  • Rasbash , J. , Browne , W. , Healy , M. , Cameron , B. , Charlton , C. ( 2005 ). MLwiN (Multilevel Models Project) [Computer software] .
  • Sheehan , N. A. , Didelez , V. , Burton , P. R. , Tobin , M. D. ( 2008 ). Mendelian randomisation and causal inference in observational epidemiology . Plos Medicine 5 : 1205 – 1210 .
  • Skinner , C. J. , Holt , D. , Smith , T. F. M. ( 1989 ). Analysis of Complex Surveys . Chichester : John Wiley & Sons, Ltd .
  • Steele , F. ( 2008 ). Multilevel models for longitudinal data . Journal of the Royal Statistical Society Series A-Statistics in Society 171 : 5 – 19 .
  • Yang , M. , Goldstein , H. , Heath , A. ( 2000 ). Multilevel models for repeated binary outcomes: attitudes and voting over the electoral cycle . Journal of the Royal Statistical Society Series A-Statistics in Society 163 : 49 – 62 .
  • Zohoori , N. , Savitz , D. A. ( 1997 ). Econometric approaches to epidemiologic data: Relating endogeneity and unobserved heterogeneity to confounding . Annals of Epidemiology 7 : 251 – 257 .