Using Robust Standard Errors for the Analysis of Binary Outcomes with a Small Number of Clusters: Journal of Research on Educational Effectiveness: Vol 16 , No 2

Abstract

Binary outcomes are often analyzed in cluster randomized trials (CRTs) using logistic regression and cluster robust standard errors (CRSEs) are routinely used to account for the dependent nature of nested data in such models. However, CRSEs can be problematic when the number of clusters is low (e.g., < 50) and, with CRTs, a low number of clusters is quite common. We investigate the use of the CR2 CRSE and an empirical degrees of freedom adjustment (dof_BM) proposed by Bell and McCaffrey with a simulation using binary outcomes and illustrate its use with an applied example. Findings show that the CR2 (w/dof_BM) standard errors are relatively unbiased with coverage and power rates for group-level predictors that are comparable to that of a multilevel logistic regression model and can be used even with as few as 10 clusters. To promote its use, a free graphical SPSS extension is provided that can fit logistic (and linear) regression models with a variety of CRSEs and dof adjustments.

Keywords:

Open Research Statements

Study and Analysis Plan Registration

There is no study and analysis plan registration associated with this manuscript.

Data, Code, and Materials Transparency

The data set is available in the CR2 package for R (on CRAN): https://cran.r-project.org/package=CR2. All code for the simulation and applied analysis are available from the first author’s Github repository: https://github.com/flh3/CR2/JREE2022.

Design and Analysis Reporting Guidelines

This manuscript was not required to disclose use of reporting guidelines because it was initially submitted prior to JREE mandating open research statements in April 2022.

Transparency Declaration

The lead author (the manuscript’s guarantor) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.

Replication Statement

This manuscript reports an original study.

Notes

1 In certain disciplines such as medicine, binary outcomes (e.g., patient recovered or not) are the most common type of response variable used (Pedroza & Truong, Citation2017, p. 9). In a review of 80 studies with a focus on health and well-being, 78% of studies used noncontinuous outcomes (Dron et al., Citation2021).

2 Probit models have also been used but are not commonly used in education and psychology. In addition, logistic regression model results are easier to interpret and have almost the same results (Long & Freese, Citation2006).

3 The GEE approach is popular in the field of biostatistics and is an extension of the GLM developed specifically to account for violations of observations independence (Ghisletta & Spini, Citation2004). With GEEs, after regression coefficients are estimated by solving a series of estimating equations, CRSEs are then applied to account for the clustered nature of the data. For a primer on GEEs in an educational setting, see Huang (Citation2022).

4 The low number of clusters, though, is similar to CRTs in the health sciences, where the median number of clusters (out of 285 CRTs) was 21 (range = 2–605) (Ivers et al., Citation2011). In a more recent review of health-related CRTs, Fiero et al. (Citation2016) indicated that out of 86 studies, approximately 50% had 24 or fewer clusters (range = 2–1,552). Though not experimental, a review of 78 articles from 2011 to 2014 published in three leading sociology journals indicated that around a quarter of studies had fewer than 20 clusters (Heisig et al., Citation2017).

5 The Stata package xtgeebcv (Gallis et al., Citation2020) is free but will only work with Stata 16 or later, requiring users with older licenses to upgrade.

6 Also available in Stata.

7 The CR2 correction, also referred to as an HC2 in a clustered setting, is available in the GLIMMIX procedure in SAS, which uses a likelihood-based method for model fitting. The option is not available in the GENMOD procedure which is used for GEE.

8 Interested readers who want to see the step-by-step computations using R syntax using matrix notation can download the syntax from the first authors’ website.

9 In a logistic regression, working residuals can be estimated using $z - X \hat{β} .$ Premultiplying the working residuals by the square root of the working weights, $W^{1 / 2} (z - X \hat{β})$ (McCaffrey et al., Citation2001, p. 2), results in the Pearson residuals, which can be obtained (in R) using: residuals(mod, “working”) * sqrt(weights(mod, "working")) where mod is the model object. This can also be done using residuals(mod, “pearson”). McCaffrey et al. (Citation2001) refer to the Pearson residuals as r*.

10 Also referred to as the projection matrix.

11 This adjustment matrix is similar to the Kauermann and Carroll (Citation2001) correction used in GEEs. If ${[I_{Ng} - H_{g}]}^{- 1}$ is used instead of the inverse of the symmetric square root, this results in the Mancl and DeRouen (Citation2001) correction used with GEEs.

12 For level 1 variables and cross-level interactions, older versions of HLM (Raudenbush et al., Citation2013) used N – k, where k was the total number of predictors (at all levels including the intercept). The nlme (Pinheiro et al., Citation2014) package in R uses N – G – L1 – 1, where L1 is the number of level-1 predictors. Though not often of interest, in HLM, the intercept is a higher-level variable and in nlme, it is considered a level-1 variable (so the dof will differ).

13 There is an option in SPSS to estimate a model using GEEs that use CRSEs (Huang, Citation2022) but does not include options for few cluster adjustments.

14 The outcome is also a random variable because this is a result of the process of random assignment used in experiments (resulting in TR_j) as well as sampling variability.

15 Another drawback of PQL is that likelihoods are not estimated, which prohibits the use of likelihood ratio tests for model comparisons. However, that is not necessarily problematic in the evaluation of CRTs.

16 We also estimated the models using maximum likelihood as a basis for comparison (not shown). Note that both Laplace approximation and GHQ do not have analogs of REML, so they may not be well suited for CRTs with a few clusters.

17 Supplementary analysis (see Appendix A) using a random slope DGP analyzed using a random slope GLMM resulted in much higher non-convergence rates when the overall sample size was low, driven primarily by the number of observations in a cluster (i.e., 20 vs. 100). When there were only 10 groups and 20 observations per group, around half (∼ 50%) of the GLMMs did not converge. Convergence rates also improved as the number of groups increased. Non-convergence was less of an issue when GS = 100 and NG was at least 20, except in the case where the prevalence rate was high (i.e., 90%). The more extreme the prevalence rates (i.e., closer to 0% or 100%), the higher the rate of non-convergence as well.

18 We also estimated this using maximum likelihood, which showed underestimated standard errors (not shown).

19 The synthetic data allow us to provide a realistic case of how results may differ when using different methods. For a substantive interpretation of results, readers should consult the full article using the original data (Gregory et al., Citation2021) and the synthetic data are used for pedagogical purposes only.

20 There were no differential effects.

21 We thank an anonymous reviewer for pointing this out.

22 Other simulations have set $τ_{11}$ = $τ_{00}$ for simplicity (e.g., Clarke, Citation2008; Maas & Hox, Citation2004). In Moineddin et al.’s (Citation2007) simulation, $τ_{11}$ was set to a constant of 1. However, in practice, the random slope variance is often far lower (e.g., 1/5th) than the random intercept variance (Muthén & Muthén, Citation2002).

Using Robust Standard Errors for the Analysis of Binary Outcomes with a Small Number of Clusters

Study and Analysis Plan Registration

Data, Code, and Materials Transparency

Design and Analysis Reporting Guidelines

Transparency Declaration

Replication Statement

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Using Robust Standard Errors for the Analysis of Binary Outcomes with a Small Number of Clusters

Abstract

Open Research Statements

Study and Analysis Plan Registration

Data, Code, and Materials Transparency

Design and Analysis Reporting Guidelines

Transparency Declaration

Replication Statement

Notes

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature