5,059
Views
5
CrossRef citations to date
0
Altmetric
Research Articles

Reference-Based Multiple Imputation—What is the Right Variance and How to Estimate It

ORCID Icon
Pages 178-186 | Received 28 Apr 2021, Accepted 15 Sep 2021, Published online: 12 Nov 2021

Abstract

Reference-based multiple imputation methods have become popular for handling missing data in randomized clinical trials. Rubin’s variance estimator is well known to be biased compared to the reference-based imputation estimator’s true repeated sampling (frequentist) variance. Somewhat surprisingly given the increasing popularity of these methods, there has been relatively little debate in the literature as to whether Rubin’s variance estimator or alternative (smaller) variance estimators targeting the repeated sampling variance are more appropriate. We review the arguments made on both sides of this debate, and argue that the repeated sampling variance is more appropriate. We review different approaches for estimating the frequentist variance, and suggest a recent proposal for combining bootstrapping with multiple imputation as a widely applicable general solution. At the same time, in light of the consequences of reference-based assumptions for frequentist variance, we believe further scrutiny of these methods is warranted to determine whether the strength of their assumptions is generally justifiable.

1 Introduction

Reference or reference-based imputation approaches have become popular for handling missing data in randomized trials (Carpenter, Roger, and Kenward Citation2013). They have typically been used as a sensitivity analysis to a primary analysis which assumes data are missing at random (MAR). Unlike most missing not at random (MNAR) sensitivity analysis methods, which often require specification of the value of sensitivity parameters, reference-based methods make assumptions which can be described somewhat more qualitatively by specifying the distribution of missing data in the active arm by reference to the distribution in the reference or control arm. In the case of a continuous outcome which is measured repeatedly over time, popular reference-based approaches include Jump to reference (J2R) and copy reference (CR). In J2R, as described in further detail in Section 2, the imputation distribution is constructed by assuming for active treatment group patients with missing data that marginally their visit-specific means before dropout are equal to the active group means and after dropout are equal to the reference group (e.g., control group) visit-specific means. In CR, all means are taken from the reference group. Thus, broadly speaking, reference-based methods impute post dropout outcomes for patients in the active group as if they switched at the point of dropout to the control treatment.

Although reference-based MI has since its inception tended to be used for sensitivity analyses for missing data, its core idea has been adopted more recently to handle intercurrent events for estimation of certain estimands (Mallinckrodt et al. Citation2019). Moreover, a recent review of trials published in the Lancet and New England Journal of Medicine between 2014 and 2019 showed that the use of reference or control based MI is increasing in clinical trials (Tan et al. Citation2021).

An important question when using reference-based MI is how to estimate the variance of the resulting estimator. Conventionally the variance of MI estimators is obtained using Rubin’s combination rules. However, as observed by Seaman, White, and Leacy (Citation2014), for reference-based MI Rubin’s variance estimator is biased upward relative to the repeated sampling variance of the MI estimator. This bias is due to uncongeniality between the imputation and analysis models (Meng Citation1994). Subsequently, a number of authors have discussed the merits of using either Rubin’s variance estimator (Carpenter et al. Citation2014; Cro, Carpenter, and Kenward Citation2019) or the frequentist variance (Seaman, White, and Leacy Citation2014; Lu Citation2014; Tang Citation2017; White, Joseph and Best Citation2020), but no consensus has emerged. As we describe in more detail in Section 4, the frequentist variance can be estimated by analytical approaches such as the delta method (e.g., as described by Tang Citation2017) or by bootstrapping (e.g., as described by Gao, Liu, Zeng, Diao, Heyse, and Ibrahim Citation2017).

In Section 2, we review reference-based MI methods for continuous endpoints, the definition of congeniality, and why Rubin’s variance estimator is biased upward relative to the frequentist variance of the reference-based point estimator of treatment effect. In Section 3, we review arguments made in favor of both Rubin’s variance estimator and the frequentist variance, concluding that if the assumptions made by reference-based MI are employed, the frequentist variance is the right one. In Section 4, we review different approaches for estimating the frequentist variance of reference-based estimators. In Section 5, we report the results of a simulation study for a recurrent event endpoint, and in Section 6 give some discussion.

2 Reference-Based Multiple Imputation and Congeniality

In this section we review reference-based multiple imputation methods and the congeniality issue. The approach was originally proposed in the context of a repeatedly measured continuous endpoint assuming a multivariate normal model (Carpenter, Roger, and Kenward Citation2013). Subsequently, the idea has been extended to other endpoint types, including recurrent events (Keene et al. Citation2014) and survival times (Atkinson et al. Citation2019). To help make the following arguments regarding congeniality concrete yet (relatively) simple, we first review the J2R approach for a repeatedly measured continuous endpoint, following Carpenter, Roger, and Kenward (Citation2013).

2.1 J2R Imputation

We assume that i=1,,n patients are randomized to either reference (denoted Xi = 0) or active treatment (Xi = 1). Patients are scheduled to have the outcome measured at times j=0,,J. The outcomes intended to be measured are thus Yi=(Yi0,Yi1,Yi2,,YiJ). Often there may exist additional baseline covariates which are to be adjusted for, but to simplify the key points we do not include these.

For each patient, some measurements may be missing. We note that depending on the chosen estimand, the actual outcomes may be observed, but the potential outcomes of interest under the chosen estimand are missing. We assume that only monotone missingness occurs, although we note that implementations of J2R typically handle intermediate missingness using MAR imputation. Let Di denote the time of the last observation for patient i. A patient with complete follow-up thus has Di=J. The MAR assumption says that the probability of each pattern of missingness occurring depends only on the data observed under that pattern (Tsiatis Citation2006), which here meansP(Di=j|Yi0,Yi1,,YiJ,Xi)=P(Di=j|Yi0,Yi1,,Yij,Xi),that is, that the (marginal) probability of dropping out immediately after time j does not (statistically) depend on Yj+1,,YJ, conditional on the randomized treatment Xi and outcome measurements obtained up to and including time j. Alternatively, as described by Daniels and Hogan (Citation2008) it can be equivalently stated asP(Di=j|Dij,Yi0,Yi1,,YiJ,Xi)=P(Di=j|Dij,Yi0,Yi1,,Yij,Xi), for j=1,,J1. That is, among those who have not yet “dropped out” at time j, the probability that they drop out before time j + 1 does not depend on the outcomes Yj+1,,YJ, conditional on treatment group and the outcomes measured through to time j.

Multiple imputation as originally conceived is based on taking a Bayesian approach, with noninformative priors specified for the imputation model parameters. In general, to create a single imputed dataset in this Bayesian paradigm, step one consists of taking a draw from the observed data posterior distribution of the imputation model parameter. This sometimes requires use of Markov chain Monte Carlo (MCMC) methods. Step two consists of drawing from the conditional distribution of the missing data given the observed, with the model parameter set to the value drawn in step one.

In J2R MI, in the reference arm we assume that Yi=(Yi0,Yi1,,YiJ) is distributed multivariate normal with a distinct mean at each follow-up time and an unstructured covariance matrix, and this model is fitted to the observed data using maximum likelihood, assuming MAR. The same model is separately fitted to the data in the active arm, with distinct parameters. To generate an imputed dataset, from each model a posterior draw is taken of the respective model parameters. Following Carpenter, Roger, and Kenward (Citation2013), we let μr=(μr,0,,μr,J)T and unstructured covariance matrix Σr denote the resulting posterior draws of the mean and covariance matrix in the reference arm, and μa and Σa for the corresponding parameters in the active arm. We then partition the covariance matrices at time Di asΣr=[R11R12R21R22]Σa=[A11A12A21A22].

Consider a patient with observed values Yi0,Yi1,,YiDi and missing values Yi(Di+1),,YiJ. If they are in the reference arm, then their missing values are imputed from the conditional distribution implied by the assumed model under MAR and the reference arm model parameter posterior draws μr and Σr. If the patient is in the active treatment arm, then they are imputed using the multivariate normal conditional distribution implied by assuming that their full data vector has marginal mean equal toμ˜i=(μa,0,,μa,Di,μr,Di+1,,μr,J)T.and variance covariance equal toΣ˜=[Σ˜11Σ˜12Σ˜21Σ˜22], whereΣ˜11=A11Σ˜21=R21R111A11Σ˜22=R22R21R111(R11A11)R111R12,

The latter values are those that ensure that the sub-matrix of Σ˜ corresponding to the observed measurements matches that in the active arm and the conditional covariance matrix of the missing components given the observed matches that in the reference arm.

The preceding describes the J2R approach. Carpenter, Roger, and Kenward (Citation2013) also proposed a number of other variants, including last mean carried forward (LMCF), copy increments in reference and CR.

2.2 Uncongeniality of Reference-Based Imputation

Seaman, White, and Leacy (Citation2014) and Lu (Citation2014) both noted that reference-based imputation approaches were uncongenial with what would be the standard analysis of the resulting imputed data, namely a linear regression model of the final time point outcome, with randomized treatment as covariate (plus other baseline covariates typically). As such, Rubin’s variance estimator is not unbiased (even asymptotically) for the true repeated sampling variance of the estimator of treatment effect obtained after using reference-based MI to impute missing values.

To investigate why reference-based MI leads to uncongeniality, we first review the definition of congeniality (Meng Citation1994; Xie and Meng Citation2017). We let Zcom={(Xi,Di,Yi0,,YiJ);i=1,,n} denote the complete data, Zmis denote the missing component, and Zobs denote the observed component. Let θ denote the parameter of interest, which here is the difference in mean outcomes between randomized groups at the final time point, θ=E(YJ|X=1)E(YJ|X=0). In the context considered by Meng (Citation1994), the imputer and analyst in general are distinct entities, whereas in the present context of clinical trials they are the same person. In the absence of any other baseline covariates, the analysis model is a linear regression model for YJ with Y0 and X as covariates. In Meng’s terminology, the “analyst’s complete data procedure” is the ordinary least squares estimator of the coefficient of X in this regression, which we denote θ̂A(Z˜com), where A stands for analyst and Z˜com is an arbitrary complete dataset. The analyst’s variance estimator, WA(Z˜com), is the standard model based variance estimator from linear regression for the coefficient of randomized treatment group X.

Following Xie and Meng (Citation2017) and Bartlett and Hughes (Citation2020), the imputation model and the analyst’s complete data procedure are said to be congenial if there exists a unifying Bayesian model (referred to by IA) which embeds the imputer’s imputation model and the analyst’s complete data procedure, in the sense that

  1. For all possible complete datasets Z˜com,(1) θ̂A(Z˜com)=EIA(θ|Z˜com) and WA(Z˜com)=varIA(θ|Z˜com)(1)

    where EIA and varIA denote posterior expectation and variance with respect to the embedding Bayesian model;

  2. For all possible Z˜mis,(2) fI(Z˜mis|Zobs)=fIA(Z˜mis|Zobs)(2)

where fI(Z˜mis|Zobs) denotes the predictive distribution for the missing data used by the imputation model and fIA(Z˜mis|Zobs) is the predictive distribution for the missing data given the observed data under the embedding Bayesian model.

To see more clearly why reference-based MI leads to uncongeniality, following Seaman, White, and Leacy (Citation2014) and Carpenter et al. (Citation2014), we consider an unrealistic but instructive situation where we omit the baseline measurement Y0 and set J = 1. Thus now D = 0 indicates that the single outcome Y is missing and D = 1 indicates Y is observed. With no baseline measurement, the analyst’s complete data procedure reduces to calculating the difference in mean of Y between those randomized to active (X = 1) and those randomized to reference (X = 0):(3) θ̂A(Z˜com)=i=1nYiXii=1nXii=1nYi(1Xi)i=1n1Xi(3)

The J2R imputation model in this highly simplified case assumes that(4) Y|D=0,X=0N(μr,σr2),Y|D=1,X=0N(μr,σr2),Y|D=0,X=1N(μr,σr2),Y|D=1,X=1N(μa,σa2),(4) such that all outcomes have mean μr except those in the active arm with D = 1, who have mean μa. The J2R imputation model for the missing data can be embedded in a model for the complete data with f(Y,D,X)=f(Y|D,X)P(D|X)P(X) in which f(Y|D,X) is given by the normal models in Equationequation (4) and P(D=0|X=x)=πx,x=0,1, so that E(D|X=x)=1πx. We do not specify P(X) (although we know its distribution from the randomization scheme), but rather perform inference conditional on X. With this embedding model, the J2R imputation procedure satisfies the second part (EquationEquation (2)) of the congeniality definition by construction.

Given an arbitrary complete dataset Z˜com, under the model embedding J2R imputation the MLE of μr is the mean outcome combining the reference arm patients with those in the active arm with D = 0, which can be expressed as:μ̂rcom=i=1nYi(1DiXi)i=1n1DiXi,and the MLE of μa is the mean outcome in those with X = 1 and D = 1:μ̂a=i=1nYiDiXii=1nDiXi

Lastly, the MLE of πx for x = 0, 1 is simply the sample proportion with D = 0 in the X = 0 and X = 1 treatment groups. Then under the embedding model we have that(5) θ=E(Y|X=1)E(Y|X=0)=E[E(Y|X=1,D)|X=1]E[E(Y|X=0,D)|X=0]=E[μr+(μaμr)D|X=1]E(μr|X=0)=μr+(μaμr)E(D|X=1)μr=(μaμr)(1π1).(5)

The MLE of θ given complete data under the embedding model follows from this expression under the invariance property of MLE. Morever, for large n the posterior mean under the embedding model is (essentially) equal to the MLE, so that for large n we have(6) EIA(θ|Z˜com)=(μ̂aμ̂rcom)(1π̂1)=(i=1nYiDiXii=1nDiXii=1nYi(1DiXi)i=1n1DiXi)i=1nDiXii=1nXi,(6) which, unlike the analyst’s complete data estimator, uses D, and is not equal to the analyst’s complete data estimator θ̂A(Z˜com). Indeed, suppose that in the active arm virtually all patients were missing their outcomes, such that π̂11. In this case EIA(θ|Z˜com)0, whereas the analyst’s complete data estimator is not (in general). Thus the first part of the first condition in the congeniality definition is not satisfied.

Despite the fact that the analyst’s complete data estimator does not in general match the complete data posterior mean under the embedding model, the J2R MI estimator for θ which uses the analyst’s complete data estimator is equivalent (with an infinite number of imputations) to the Bayesian posterior mean under the embedding model. To see this, following Seaman, White, and Leacy (Citation2014) and Carpenter et al. (Citation2014), consider the MLE/posterior mean of θ given the observed data under the model embedding J2R imputation. The only change moving from the complete data to the observed data is that the MLE of μr is now based only on reference arm patients with D = 1, so that(7) EIA(θ|Zobs)=(μ̂aμ̂robs)(1π̂1).(7)

As the number of imputations goes to infinity, and provided n is sufficiently large for the priors to have essentially no impact, under J2R MI the active group mean converges to (1π̂1)μ̂a+π̂1μ̂robs, whereas the control group mean converges to μ̂robs. Thus the J2R MI estimator of θ converges to (as the number of imputations tends to infinity)(8) (1π̂1)μ̂a+π̂1μ̂robsμ̂robs=(μ̂aμ̂robs)(1π̂1)=EIA(θ|Zobs).(8)

Turning to the variance, we must check whether the analyst’s complete data variance WA(Z˜com) matches the posterior variance given complete data under the embedding model. The analyst’s complete data variance estimator could either assume equal variances for Y in the two groups (i.e., the standard t-test) or could use the variance estimator which relaxes this assumption, by estimating the variance separately in each group. Like Carpenter et al. (Citation2014), we will assume the analyst does the latter, so that(9) WA(Z˜com)=var̂(Y|X=1)na+var̂(Y|X=0)nr,(9) where na and nr denote the number randomized to active and control, and var̂(Y|X=1) and var̂(Y|X=0) are the sample variances in each treatment group. Assuming the J2R assumptions (EquationEquation (4)), we have var(Y|X=0)=σr2. For var(Y|X=1), we can use the law of total variance to give that(10) var(Y|X=1)=var[E(Y|X=1,D)|X=1]+E[var(Y|X=1,D)|X=1]=var[μr+(μaμr)D|X=1]+E[σa2D+σr2(1D)|X=1]=(μaμr)2π1(1π1)+σa2(1π1)+σr2π1.(10)

For the complete data posterior variance under the embedding model, again suppose n is large so that this matches the MLE estimated variance based on the observed information matrix. Some algebra shows that this is equal to(11) varIA(θ|Z˜com)=(1π̂1)[σ̂r2(1π̂1)nr+naπ̂1+σ̂a2na+(μ̂rcomμ̂a)2π̂1na],(11) where σ̂a2 is the estimated variance of Y from those with X = 1 and D = 1 and σ̂r2 is the estimated variance from the remaining patients (i.e., X = 0, or X = 1 and D = 0). EquationEquations (9) and Equation(11) are not the same, as required for the second part of the first condition in the definition of congeniality. For example, consider again the case that almost all patients in the active arm have missing data, such that π̂11. Then from EquationEquation (6) the complete data posterior mean is approximately zero and from EquationEquation (11) its estimated variance is also approximately zero. In contrast, if π11, from Equationequation (10) Var(Y|X=1)σr2, and the analyst’s complete data variance estimator of EquationEquation (9) will (on average) estimate σr2(na1+nr1), that is, greater than zero. Thus the second part of the first condition in the congeniality definition is also not satisfied.

Rubin’s rules variance estimator is based on decomposing the posterior variance of θ under the embedding model asvarIA(θ|Zobs)=EIA[varIA(θ|Z˜com)|Zobs]+varIA[EIA(θ|Z˜com)|Zobs]

Rubin’s rules approximates the first part, the within-imputation variance, by substituting WA(Z˜com) for varIA(θ|Z˜com). Since as we have seen WA(Z˜com) is too large, this component will be biased upward. Rubin’s rules approximates the second part, the between-imputation variance, by subtituting θ̂A(Z˜com) for EIA(θ|Z˜com). Consider again the case where π̂11. As we have noted previously, from EquationEquation (6), in this case EIA(θ|Z˜com)0, and so varIA[EIA(θ|Z˜com)|Zobs]0. In contrast, the value of θ̂A(Z˜com) will vary across imputations, so that the estimated between-imputation variance will be larger than zero.

In conclusion, the observed data posterior mean of θ under the embedding model essentially matches the J2R MI estimator of θ. Rubin’s rules variance estimator is however larger than the observed data posterior variance under the embedding model. Assuming the embedding model is correct, the latter will (asymptotically) estimate the true frequentist variance of the point estimator correctly, and thus we conclude Rubin’s variance estimator is biased upward compared to the true frequentist variance of the point estimator.

Regarding other variants of reference-based MI, positive bias in Rubin’s variance estimator for a CR-type approach was shown through simulation by Lu (Citation2014) and Tang (Citation2017) derived analytical expressions for the frequentist bias in Rubin’s variance estimator for J2R and CR MI. Gao, Liu, Zeng, Diao, Heyse, and Ibrahim (Citation2017) showed by simulation upward bias in Rubin’s variance estimator in the case of repeated binary outcomes. We note that, as shown for example by Robins and Wang (Citation2000), there are other settings where Rubin’s variance estimator can be biased downwards compared to the repeated sampling variance of the MI estimator.

3 What is the Right Variance for Reference-Based Multiple Imputation?

As described in the previous section, for reference-based MI, Rubin’s variance estimator is biased relative to the true repeated sampling variance of the point estimator of the treatment effect. How large the estimated variance of the treatment effect estimator should be is obviously important, since it affects the advertized precision of the estimated treatment effect and consequently Type I error and power to detect an effect. As expected given the upward (frequentist) bias in Rubin’s variance estimator, simulation studies have shown that use of reference-based MI with Rubin’s rules leads to conservative Type I error control under the null and the potential for substantial power loss (compared to using the frequentist variance) under the alternative (Gao, Liu, Zeng, Diao, Heyse, and Ibrahim Citation2017; Lu Citation2014; Tang Citation2017).

Carpenter et al. (Citation2014) argued that the frequentist repeated sampling variance is inappropriate in the context of using reference-based MI as a sensitivity analysis to a primary analysis which handles the missing data under a “baseline” assumption, for example, MAR. They proposed a principle that for missing data sensitivity analyses, the variance should be no lower (on average) than the complete data variance estimator, and they showed that reference-based MI with the frequentist variance violates this principle.

Cro, Carpenter, and Kenward (Citation2019) developed this principle further, considering a trial in which missing data are first handled using a “primary” set of assumptions about missingness and second handled using an alternative “sensitivity” set of assumptions. They defined an information anchored sensitivity analysis (e.g., an analysis using J2R) as one in which the relative loss in information about θ caused by missing data is the same as the loss in the primary analysis. Cro, Carpenter, and Kenward (Citation2019) argued that in the context of trials, information anchored sensitivity analyses seem appropriate because, relative to the primary analysis assumptions, they are neither adding nor removing information. They showed that reference-based MI such as J2R are approximately information anchored when Rubin’s variance estimator is used for inferences, whereas the repeated sampling variance is information positive—information is being added relative to an MAR-based analysis.

In contrast, others have argued that the repeated sampling variance may be more appropriate. White, Joseph and Best (Citation2020) and Gao, Liu, Zeng, Diao, Heyse, and Ibrahim (Citation2017) pointed out that using reference-based MI with Rubin’s rules leads to Type 1 error rates which are too small under the null and a loss of power under the alternative, and as such when used for the primary analysis, the frequentist variance may be preferable.

Consider the use of reference-based MI as a primary analysis estimator of treatment effect. Cro, Carpenter, and Kenward (Citation2019) quite reasonably point out that it seems very counterintuitive to use a method which apparently is able to make more precise inferences the more data is lost. As we have seen however, this is a logical consequence of the strength of the assumption made by reference-based MI methods. If such behavior is viewed as undesirable, and we believe that in many settings it may be, then we believe the correct response is to conclude that the assumptions made by the reference-based approach are inappropriate, rather than to assign blame to a variance estimator. Indeed, the uncongeniality issue here is caused by the fact we are happy to make a (strong) assumption in the imputation model but not in the analysis. If we truly believe in the assumption or at least want to perform an analysis supposing it is true, then we should use it throughout our analysis (i.e., at both imputation and analysis stages). If we do not believe in it, or feel it is too strong, then we should not use it.

Turning next to the use of reference-based MI for sensitivity analyses, we agree with Cro, Carpenter, and Kenward (Citation2019) that ensuring that sensitivity analyses do not inject or take away information (precision) relative to a primary set of missing data assumptions seems like a reasonable principle to adhere to. However, when considering this statement we believe it is critical to be careful about the precise meaning of ‘information’. Cro, Carpenter, and Kenward (Citation2019) implicitly take the view that information corresponds to estimates of the variance of point estimators, rather than the true repeated sampling variance of the point estimators. Relative to an MAR analysis, reference-based MI estimators such as J2R do inject information when information is judged in terms of the true repeated sampling variance of the estimator. Using reference-based MI with Rubin’s rules to estimate the variance in our view amounts to pretending reference-based assumptions about missing data are information anchored to an MAR analysis, when in actual fact they are information positive.

In summary, we believe that under a frequentist inference paradigm, information (precision) should be judged in terms of the true repeated sampling variance of estimators. If one wishes to perform information anchored sensitivity analyses, then we believe the correct solution is to construct missing data assumptions which differ to those made by the primary analysis but which genuinely neither add nor remove information, with information being judged in terms of the estimator’s true repeated sampling variance. Cro, Carpenter, and Kenward (Citation2019) propose one possible route to this—adding additional random noise to the reference-based MI estimator so that its true repeated sampling variance matches the primary analysis method’s variance. A drawback of this approach is that it would then seem difficult to readily communicate the totality of the missing data assumptions made by such a “added noise” reference-based MI estimator. We thus believe further research is warranted to develop sensitivity analysis methods which are information anchored in the sense described above but which like reference-based methods can be relatively easily communicated.

4 Estimating the Repeated Sampling Variance

In this section we review methods for estimating the frequentist variance of reference-based MI estimators, considering their relative advantages and disadvantages.

4.1 Analytical Variance Estimators

A number of authors have developed analytical variance estimators for reference-based estimators for various endpoint types. For a continuous endpoint (Lu Citation2014) developed a maximum likelihood estimator with delta method variance under a CR assumption. Tang (Citation2015) derived equivalent matrix versions of Lu’s (Citation2014) CR estimator and accompanying delta method variance estimator. Tang (Citation2017) derived analytical variance estimators for J2R and CR methods. Gao, Liu, Zeng, Diao, Heyse, and Ibrahim (Citation2017) applied the general theory developed by Robins and Wang (Citation2000) to derive analytical variance estimators for reference-based MI in the setting of repeated binary data. In all the preceding articles, the analytical variance estimators show good Type 1 error control under the null in simulations, and improved power under the alternative compared to using Rubin’s rules.

Analytical variance estimators have the major advantage of being computationally fast. Their drawback however is that they must be derived specifically for each case and implemented in software. Moreover, as noted by Gao, Liu, Zeng, Diao, Heyse, and Ibrahim (Citation2017), there are situations where it may be difficult to derive such variance estimators, for example when intermediate missing values are imputed assuming MAR in a first stage followed by use of reference-based imputation, or perhaps when different imputation strategies are used for different types of intercurrent event.

4.2 Congenial Bayesian Approach

An alternative approach is to perform (congenial) Bayesian inference for the treatment effect under a model which embeds the reference-based assumptions. This approach was developed by Lu (Citation2014) and Liu and Pang (Citation2016). Since it is relatively straightforward to obtain posterior draws of the MAR MMRM models using existing software, provided one can express the treatment effect (under the assumed reference-based assumption) as a function of the model parameters (EquationEquation (5) being an example), one can obtain posterior draws of the treatment effect under this assumption by simply applying this function to the posterior draws of the MMRM model parameters. For large n this approach results in accurate frequentist inferences, provided the assumed model is correct. We emphasize that here congeniality is not an issue here because one constructs the expression for the treatment effect under the assumed reference-based assumpion—there is no uncongenial analyst complete data estimator as there is with the reference-based MI approach.

A possible drawback with this approach however is that, like analytical variance estimators, expressing the treatment effect as a function of the MMRM model parameters may become complex and setup specific, for example if one wanted to make a variety of different imputation assumptions to handle different types of intercurrent events.

4.3 Bootstrap Variance Estimators

As noted previously, analytical variance estimators and the congenial Bayesian approach require problem specific derivations and implementations. An alternative which avoids this, at the expense of computational cost, is to use bootstrapping. Gao, Liu, Zeng, Diao, Heyse, and Ibrahim (Citation2017) proposed applying nonparametric bootstrapping to reference-based MI estimators in the context of repeated binary endpoints. Simulations showed it gave Type I error control close to the nominal level under the null and superior power to using Rubin’s rules under the alternative. Gaot, Liu, Zeng, Xu, Lin, Diao, Golm, Heyse, and Ibrahim (2017) and Diao et al. (Citation2020) similarly used the same bootstrapping approach for reference-based MI estimators for recurrent event data, while Quan et al. (Citation2018) examined its use for reference-based MI with continuous endpoints. Zhang, Golm, and Liu (Citation2020) examined the performance of bootstrapping when used with a return to baseline type MI approach. We emphasize that under uncongeniality it is critical for the bootstrapping to be applied first, followed by multiple imputation. Approaches based on first multiply imputing missing data and then bootstrapping are not generally valid under uncongeniality (Bartlett and Hughes Citation2020). Also, it is important to note that consistency of bootstrap variance estimators relies on the statistical functional in view being smooth in suitable sense (Shao and Tu Citation1995). While these conditions have not been verified for reference-based MI estimators, there is extensive empirical evidence supporting that the nonparametric bootstrap variance estimator can deliver accurate frequentist inference when used with reference-based MI methods.

A major practical drawback to using the bootstrap is its large computational cost. Whereas standard MI is often performed using a relatively small number of imputations, bootstrapping requires a much larger number of bootstraps to give accurate inferences. Moreover, in most implementations of MI, draws from the observed data posterior distribution of the model parameters are required in order to make the imputations “proper” and for Rubin’s variance estimator to be valid. For certain models this may require running computationally intensive methods such as Markov Chain Monte-Carlo (MCMC), which also necessitate consideration of how many iterations to run the chains for to achieve stationarity and independence of draws. However, if bootstrap is used for inference, this can all be avoided since bootstrapping does not rely on Rubin’s rule: it suffices to generate each imputed dataset conditional on efficient estimates (e.g., MLE) of the imputation model parameters, as proposed by von Hippel and Bartlett (Citation2021). To implement this one needs to make a generally minor modification to existing software, by skipping the step in the algorithm that performs the posterior draw.

One might be tempted to reduce computational cost further by reducing the number of imputations performed on each bootstrap sample. This however increases the Monte-Carlo noise in the treatment effect estimator, leading to a somewhat less precise effect estimate and wider confidence intervals than are necessary (Bartlett and Hughes Citation2020). An alternative bootstrap approach which overcomes this issue was proposed by von Hippel and Bartlett (Citation2021). Their approach performs a small (e.g., two) number of imputations of each bootstrap. The point estimator is taken as the average of estimates across all bootstraps and imputations. To estimate the variance of this estimator, they fit a simple random intercepts model to estimate the between bootstrap and within-bootstrap (between imputation) variances. Bartlett and Hughes (Citation2020) demonstrated through simulation that the approach of von Hippel and Bartlett (Citation2021) provided efficient frequentist valid inferences under uncongeniality yet is substantially quicker to run compared to applying standard nonparametric bootstrapping to an MI estimator which uses a large number of imputations.

5 Simulations

In this section, we report the results of simulations to demonstrate the performance of Rubin’s rules and the von Hippel bootstrap approach to reference-based imputation methods. We consider the case of recurrent event data. Trials of one-year duration were simulated with 250 patients in each treatment group. Event counts were simulated using a negative binomial model with the R package dejaVu. Under the null hypothesis event rates were specified as 0.01 (on the days time scale) in both treatment groups, while under the alternative hypothesis the event rate was 0.01 in the control arm and 0.005 in the active arm. A dispersion parameter of 0.25 was used for all scenarios. Dropout completely at random was simulated with each patient’s dropout time a draw from an exponential distribution either with rate 0.00025 (leading to 91% of patients completing follow-up) or 0.0025 (leading to 40% of patients completing follow-up).

For those patients who dropped out, the event count in the missing part of their follow-up was imputed either using J2R or CR assumptions as described by Keene et al. (Citation2014). For MI with Rubin’s rules ten imputations were generated. Each imputed dataset was generated conditional on a draw from the approximate posterior distribution of the model parameters. For the von Hippel bootstrap approach, two imputations were generated for each bootstrap sample (200 bootstraps), conditional on the maximum likelihood estimates of the fitted model to the corresponding bootstrap sample, as described by von Hippel and Bartlett (Citation2021). In both cases, a negative binomial regression model was fitted to each of the imputed datasets, with treatment as covariate. R code used for the simulations is available at https://github.com/jwb133/refBasedVar.

shows the average log rate ratio estimates for each method under the null hypothesis, the empirical standard error (SE) (deviation) of these estimates (“Emp. SE”) and the average of the method’s estimated SE (‘Est. SE’). The final column indicates the ratio of the SEs. Results are based on 5000 simulations. In the low dropout scenario, both Rubin’s rules and bootstrap SEs had small bias relative to the empirical SE. However, for the large dropout scenario, Rubin’s rules SE was biased upward, for both J2R and CR, compared to the empirical SE. In contrast, the bootstrap SEs remained essentially unbiased. In such a scenario, use of the Rubin’s rules SEs would lead to Type 1 being controlled a level lower than the nominal level.

Table 1 Simulation results from 5000 simulations under the null hypothesis.

shows the results under the alternative scenario. In the low dropout scenario, the J2R and CR estimated effects were close to the effect that would be seen under MAR (i.e., a log rate ratio of log(0.5)=0.69). Under the high dropout scenario, the treatment effect was diluted considerably toward the null, with CR less conservative than J2R, as has been seen previously by others. Again both Rubin’s rules and bootstrap gave approximately unbiased SEs in the low dropout scenario, while again Rubin’s rules SEs were biased upward substantially and the bootstrap SEs were unbiased. In this scenario use of the bootstrap SEs would lead to materially increased power to detect a treatment effect.

Table 2 Simulation results from 5000 simulations under the alternative hypothesis.

6 Discussion

It has been known for almost 10 years that Rubin’s variance estimator is biased upward relative to the true repeated sampling variance of reference-based estimators of treatment effects in randomized trials, but there remains no settled view on what is the right variance to use. Given the increasing use of reference-based MI in trials, this is not particularly satisfactory. We have argued that the frequentist variance is the correct variance for reference-based estimators. If the behavior of the frequentist variance does not seem appropriate to the analyst, for example, because it decreases as the amount of missing data in the active arm increases, then our view is that this means the analyst does not really belief the assumptions made by the reference-based approach.

In the context of performing missing data sensitivity analyses, the proposed principle that sensitivity analyses should be information neutral, or anchored, seems eminently sensible. However, we believe that provided we are operating under the frequentist paradigm, information here must be judged in terms of repeated sampling variance. In our view, using reference-based MI with Rubin’s variance estimator amounts to using a point estimator that is not information anchored but then using a variance estimator that falsely suggests it is information anchored. Further research is warranted to develop new sensitivity analyses which retain the attractive features of a reference-based type approach, where assumptions are structured qualitatively, rather than quantitatively, but which are information anchored in the frequentist variance sense.

Historically, reference-based estimators have tended to be mostly used as sensitivity analyses to a primary analysis which adopts the MAR assumption. However, such methods, and combinations of them are being increasingly used to develop estimators which might be used as the primary analysis method of trials (e.g., Darken et al. Citation2020). In this context, it is clearly important to assess whether the strong assumptions potentially made by the missing data assumptions of such approaches are justifiable, particularly in light of what they imply for the repeated sampling variance of the treatment effect estimator.

We have suggested that a particular way of combining bootstrapping with MI can be used to obtain estimates of frequentist variance of reference-based MI estimators. Because Rubin’s rules are no longer used, the imputation process does not need to be “proper,” and imputation can instead be performed conditional on maximum likelihood estimates. As described by von Hippel and Bartlett (Citation2021), to implement this in software should in most cases require relatively small changes, since the part that performs the draw from the posterior distribution (e.g., via MCMC sampling) can simply be skipped. In R, J2R MI is implemented for continuous endpoints using this approach in the mlmi package, while the RefBasedMI package (https://github.com/UCL/RefbasedMI), currently under development, has an option that allows the user to impute conditional on the MLE for a much wider range of reference-based assumptions for continuous endpoints. The R package dejaVu implements reference-based MI for recurrent event data, as proposed by Keene et al. (Citation2014), and also includes an option to impute conditional on the maximum likelihood estimates. The calculations to implement the bootstrap/MI approach proposed by von Hippel and Bartlett (Citation2021) are relatively simple, but for R users they are implemented in the bootImpute package.

Acknowledgments

The author’s institution has received consultancy fees for the author’s advice on statistical methodology from AstraZeneca, Bayer, Novartis, Roche. The author has received consultancy fees from Bayer. The author has received fees for provision of online courses from Roche.

Funding

The author(s) reported there is no funding associated with the work featured in this article.

Additional information

Funding

This work was supported by the UK Medical Research Council (Grant MR/T023953/1).

References

  • Atkinson, A., Kenward, M. G., Clayton, T. and Carpenter, J. R. (2019), “Reference-Based Sensitivity Analysis for Time-to-Event Data,” Pharmaceutical Statistics, 18, 645–658. DOI: 10.1002/pst.1954.
  • Bartlett, J. W., and Hughes, R. A. (2020), “Bootstrap Inference for Multiple Imputation Under Uncongeniality and Misspecification,” Statistical Methods in Medical Research, 29, 3533–3546. DOI: 10.1177/0962280220932189.
  • Carpenter, J. R., Roger, J. H., and Kenward, M. G. (2013), “Analysis of Longitudinal Trials With Protocol Deviation: A Framework for Relevant, Accessible Assumptions, and Inference Via Multiple Imputation,” Journal of Biopharmaceutical Statistics 23, 1352–1371. DOI: 10.1080/10543406.2013.834911.
  • Carpenter, J., Roger, J., Cro, S., and Kenward, M. (2014), “Response to Comments by Seaman et al. on ‘Analysis of Longitudinal Trials With Protocol Deviation: A Framework for Relevant, Accessible Assumptions, and Inference Via Multiple Imputation,’ Journal of Biopharmaceutical Statistics, 23, 1352–1371,” Journal of Biopharmaceutical Statistics, 24, 1363–1369. DOI: 10.1080/10543406.2014.960085.
  • Cro, S., Carpenter, J. R., and Kenward, M. G. (2019), “Information-Anchored Sensitivity Analysis: Theory and Application,” Journal of the Royal Statistical Society, Series A, 182, 623–645. DOI: 10.1111/rssa.12423.
  • Daniels, M. J., and Hogan, J. W. (2008), Missing Data in Longitudinal Studies, Boca Raton, FL: Chapman & Hall/CRC.
  • Darken, P., Nyberg, J., Ballal, S., and Wright, D. (2020), “The Attributable Estimand: A New Approach to Account for Intercurrent Events,” Pharmaceutical Statistics, 19. DOI: 10.1002/pst.2019.
  • Diao, G., Liu, G. F., Zeng, D., Zhang, Y., Golm, G., Heyse, J. F., and Ibrahim, J. G. (2020), “Efficient Multiple Imputation for Sensitivity Analysis of Recurrent Events Data With Informative Censoring,” Statistics in Biopharmaceutical Research, 1–9, DOI: 10.1080/19466315.2020.1819403.
  • Gao, F., Liu, G. F., Zeng, D., Xu, L., Lin, B., Diao, G., Golm, G., Heyse, J. F., and Ibrahim, J. G. (2017), “Control-Based Imputation for Sensitivity Analyses in Informative Censoring for Recurrent Event Data,” Pharmaceutical Statistics, 16, 424–432. DOI: 10.1002/pst.1821.
  • Gao, F., Liu, G., Zeng, D., Diao, G., Heyse, J. F., and Ibrahim, J. G. (2017), “On Inference of Control-Based Imputation for Analysis of Repeated Binary Outcomes With Missing Data,” Journal of Biopharmaceutical Statistics, 27, 358–372. DOI: 10.1080/10543406.2017.1289957.
  • Keene, O. N., Roger, J. H., Hartley, B. F., and Kenward, M. G. (2014), “Missing Data Sensitivity Analysis for Recurrent Event Data Using Controlled Imputation,” Pharmaceutical Statistics, 13, 258–264. DOI: 10.1002/pst.1624.
  • Liu, G. F., and Pang, L. (2016), “On Analysis of Longitudinal Clinical Trials With Missing Data Using Reference-Based Imputation,” Journal of Biopharmaceutical Statistics, 26, 924–936. DOI: 10.1080/10543406.2015.1094810.
  • Lu, K. (2014), “An Analytic Method for the Placebo-Based Pattern-Mixture Model,” Statistics in Medicine, 33, 1134–1145. DOI: 10.1002/sim.6008.
  • Mallinckrodt, C., Bell, J., Liu, G., Ratitch, B., O’Kelly, M., Lipkovich, I., Singh, P., Xu, L., and Molenberghs, G. (2019), “Aligning Estimators With Estimands in Clinical Trials: Putting the ICH E9 (R1) Guidelines Into Practice,” Therapeutic Innovation & Regulatory Science, 54,353–364.
  • Meng, X. L. (1994), “Multiple-Imputation Inferences With Uncongenial Sources of Input” (with discussion), Statistical Science, 10, 538–573.
  • Quan, H., Qi, L., Luo, X., and Darchy, L. (2018), “Considerations of Multiple Imputation Approaches for Handling Missing Data in Clinical Trials,” Contemporary Clinical Trials, 70, 62–71. DOI: 10.1016/j.cct.2018.05.008.
  • Robins, J. M., and Wang, N. (2000), “Inference for Imputation Estimators,” Biometrika, 85, 113–124. DOI: 10.1093/biomet/87.1.113.
  • Seaman, S. R., White, I. R., and Leacy, F. P. (2014), “Comment on ‘Analysis of Longitudinal Trials With Protocol Deviations: A Framework for Relevant, Accessible Assumptions, and Inference Via Multiple Imputation’, by Carpenter, Roger, and Kenward,” Journal of Biopharmaceutical Statistics, 24, 1358–1362. DOI: 10.1080/10543406.2014.928306.
  • Shao, J., and Tu, D. (1995), The Jackknife and Bootstrap, Springer Science & Business Media. New York: Springer.
  • Tan, P.-T., Cro, S., Van Vogt, E., Szigeti, M. and Cornelius, V. R. (2021), “A Review of the Use of Controlled Multiple Imputation in Randomised Controlled Trials With Missing Outcome Data,” BMC Medical Research Methodology, 21, 1–17. DOI: 10.1186/s12874-021-01261-6.
  • Tang, Y. (2015), “Short Notes on Maximum Likelihood Inference for Control-Based Pattern-Mixture Models,” Pharmaceutical Statistics, 14, 395–399. DOI: 10.1002/pst.1698.
  • Tang, Y. (2017), “On the Multiple Imputation Variance Estimator for Control-Based and Delta-Adjusted Pattern Mixture Models,” Biometrics, 73, 1379–1387.
  • Tsiatis, A. A. (2006), Semiparametric Theory and Missing Data, New York: Springer.
  • von Hippel, P. T., and Bartlett, J. W. (2021), “Maximum Likelihood Multiple Imputation: Faster Imputations and Consistent Standard Errors Without Posterior Draws,” Statistical Science, 36, 400–420. DOI: 10.1214/20-STS793.
  • White, I., Joseph, R., and Best, N. (2020), “A Causal Modelling Framework for Reference-Based Imputation and Tipping Point Analysis in Clinical Trials With Quantitative Outcome,” Journal of Biopharmaceutical Statistics, 30, 334–350. DOI: 10.1080/10543406.2019.1684308.
  • Xie, X., and Meng, X.-L. (2017), “Dissecting Multiple Imputation From a Multi-Phase Inference Perspective: What Happens When God’s, Imputer’s and Analyst’s Models Are Uncongenial?” Statistica Sinica, 27, 1485–1545.
  • Zhang, Y., Golm, G., and Liu, G. (2020), “A Likelihood-Based Approach for the Analysis of Longitudinal Clinical Trials With Return-to-Baseline Imputation,” Statistics in Biosciences, 12, 23–36. DOI: 10.1007/s12561-020-09269-0.