537
Views
0
CrossRef citations to date
0
Altmetric
Short Communications

A discussion on “A selective review of statistical methods using calibration information from similar studies” by Qin, Liu and Li

Pages 193-195 | Received 20 Apr 2022, Accepted 02 May 2022, Published online: 10 Jun 2022

We congratulate Qin, Liu and Li (QLL) on a thoughtful and much needed review of many interesting methods for combining information from similar studies. We appreciate being given the opportunity to make a discussion. QLL cover a variety of different settings and methods. Based on that, we will provide a brief review on some additional relevant literature with a focus on methods that deal with population heterogeneity, since in practice it is most likely that different studies sample from different populations and whether information should be combined depends on how similar those populations are, among many other considerations. To keep the discussion focussed, we will follow the setting in Section 5 of QLL, although most of these methods can be more broadly applied.

We adopt the notation in Section 5.1 of QLL with some variations. Let (Yi,XiT,ZiT)T, i=1,,n, denote the individual data based on a random sample from the internal study, where Y is the response and X and Z are vectors of covariates. The model of interest is f(Y|X,Z;β) for f(Y|X,Z) with parameter β. The external study fitted a (possibly misspecified) model f(Y|X;θ) for f(Y|X) with covariates X alone and parameter θ. Throughout this discussion we use a *-superscript to denote distributions/expectations/quantities associated with the external study population. The external model fitting information is summarized by (1) E{h(Y,X;θ)}=0,(1) where h(Y,X;θ) is the score function for f(Y|X;θ) and θ is the solution to the score equation based on the external study sample. Individual data from the external sample are not available. We assume the external sample size is very large so that the uncertainty in θ is negligible compared to the internal study, i.e. Case I in QLL.

QLL in Section 5 give an excellent review of some methods and their comparisons in terms of asymptotic efficiency when the internal and external study populations are the same, based on both the empirical likelihood (EL) formulation (Owen, Citation1988; Qin & Lawless, Citation1994) and the constrained maximum likelihood (CML) formulation (Chatterjee et al., Citation2016; Qin, Citation2000). These two formulations are closely connected (Han & Lawless, Citation2016). For ease of discussion, we provide the CML formulation here, which is already covered in QLL. When f(Y,X,Z)=f(Y,X,Z), (Equation1) can be transformed into (2) E{ψ(X,Z;β)}=0,(2) where (3) ψ(X,Z;β)=E{h(Y,X;θ)|X,Z}=h(Y,X;θ)f(Y|X,Z;β)dY.(3) The CML estimator βˆcml is defined through (4) maxβ,p1,,pni=1nf(YiXi,Zi;β)pisubject topi0, i=1npi=1,i=1npiψ(Xi,Zi;β)=0,(4) where pidF(Xi,Zi), i=1,,n.

When f(Y|X,Z)=f(Y|X,Z) but f(X,Z)f(X,Z), applying the transformation (Equation3) to (Equation1) will lead to (5) E{ψ(X,Z;β)}=0,(5) where the expectation E() is under f(X,Z) in contrast to the E() in (Equation2) that is under f(X,Z). In this case, although (Equation2) no longer holds, there are ways to make use of the external study information summarized by (Equation5). One way is to collect a small supplementary sample (XjT,ZjT)T, j=1,,n, from the external study population (Chatterjee et al., Citation2016; Han & Lawless, Citation2019), and define the CML estimator through maxβ,p1,,pni=1nf(YiXi,Zi;β)j=1npjsubject topj0, j=1npj=1,j=1npjψ(Xj,Zj;β)=0,where pjdF(Xj,Zj), j=1,,n. Another way is to specify a density ratio model of the form f(X,Z)=exp(WTα)f(X,Z), where W is a subset of (XT,ZT)T and α is newly introduced parameter, to link the external distribution f(X,Z) to the internal distribution f(X,Z) (Sheng et al., Citation2021). This approach models the heterogeneity between f(X,Z) and f(X,Z) by exp(WTα), and is in the same spirit that an exponential tilting model is specified to link the case distribution to the control distribution as in Section 7.1 of QLL. With the density ratio model, the CML estimator can be defined through maxβ,α,p1,,pni=1nf(YiXi,Zi;β)pisubject topi0, i=1npi=1,i=1npi{exp(WiTα)1}=0,i=1npiψ(Xi,Zi;β)exp(WiTα)=0,where pidF(Xi,Zi), i=1,,n. Here the last two constraints are based on the fact that f(X,Z)=exp(WTα)f(X,Z) is a density and (Equation5), respectively. Note that the dimension of α needs to be no larger than the dimension of ψ plus one for α to be identifiable.

When f(Y|X,Z)f(Y|X,Z), applying transformation (Equation3) to (Equation1) leads to neither (Equation2) nor (Equation5) in general, regardless of if f(X,Z) is the same as f(X,Z). In this case, the aforementioned CML estimators are biased. However, since the external study sample size is large, the reduction in variance by making use of external summary information may still benefit the internal study parameter estimation from a mean squared error perspective. Based on consideration of such a bias-variance trade-off, Estes et al. (Citation2018) proposed an empirical Bayes shrinkage estimator of the form Vˆ0(Vˆmle+Vˆ0)1βˆmle+Vˆmle(Vˆmle+Vˆ0)1βˆcmlthat is a weighted average of βˆmle, the MLE using the internal study data alone, and βˆcml defined through (Equation4). Here Vˆmle and Vˆ0 are estimated variance matrices for βˆmle and for the prior normal distribution on β, respectively. This method shrinks the final estimate towards βˆmle in the presence of population heterogeneity and towards βˆcml otherwise. Gu et al. (Citation2021) extended the idea in Estes et al. (Citation2018) to the case of multiple external studies, and propose an estimator that is a weighted average of the empirical Bayes estimators resulted from using each external study separately.

To deal with arbitrary population heterogeneity when information is available from multiple external studies but without causing estimation bias, Zhai and Han (Citation2022) developed an estimation procedure that simultaneously selects the studies that give (Equation2) and incorporates the corresponding information into internal model fitting. Their method also applies under the current setting of only one external study. When (Equation2) does not hold because of an unknown form of heterogeneity, some components of E{ψ(X,Z;β)} may still be zero even if E{ψ(X,Z;β)} is not a zero vector, and these components still contain useful information to improve internal model estimation efficiency. This observation makes sense because the association between the same response and certain covariates may not differ much across populations with certain specific heterogeneity. Some general examples on this observation are given in Zhai and Han (Citation2022). Let γE{ψ(X,Z;β)}, then the components of ψ(X,Z;β) that correspond to zero-components of γ should be selected to compute βˆcml. To shrink the estimate of the zero-components of γ to exactly zero for information integration, under the current setting, the estimator in Zhai and Han (Citation2022) is defined through maxβ,γ[maxp1,,pnlog{i=1nf(Yi|Xi,Zi;β)pi}nk=1dim(γ)λn|γk||γ~k|w]subject topi0, i=1npi=1,i=1npi{ψ(Xi,Zi;β)γ}=0,with an adaptive Lasso penalty (Zou, Citation2006) on γ with tuning parameter λn>0, where γ~k is the kth component of γ~=n1i=1nψ(Xi,Zi;βˆmle) and w>0 is some user-specified positive number typically taken to be 1 or 2. Similar idea has also been considered for integrating external summary information into survival analysis with different penalty (Chen et al., Citation2021).

All of the aforementioned methods make use of the external summary information through transformation (Equation3). Taylor et al. (Citation2022) took a different approach when both f(Y|X,Z;β) and f(Y|X;θ) are generalized linear models (GLM), namely (6) g{E(Y|X,Z)}=β0+XTβX+ZTβZ(6) and l{E(Y|X)}=θ0+XTθX, with possibly different link functions g() and l() (note that the second GLM may be misspecified). Here the notation E() for expectation is generic to simply present the form of the model. For ease of discussion, assume covariates Z have been orthogonalized to X, which can be done by taking Z to be the vector of residuals of the least squares regression of each covariate in Z on X using the internal data. Taylor et al. (Citation2022) showed that βXcθX for some unknown constant c when both GLMs are fitted to the same infinitely large sample and when βX, βZ and θX are all close to zero (here in this sentence, with some abuse of notation, βX, βZ and θX are the values after fitting both GLMs to the same infinitely large sample). See also Neuhaus and Jewell (Citation1993). Based on this result, when f(Y|X) and f(Y|X) are similar in the sense that the relative effects of X on Y (but not necessarily the absolute magnitudes) are the same between the two populations, the θX produced by the external study can still be used to improve the internal estimation efficiency. With Z orthogonalized to X, instead of fitting (Equation6), Taylor et al. (Citation2022) proposed to fit g{E(Y|X,Z)}=β0+α(XTθX)+ZTβZwith coefficients (β0,α,βZT)T to the internal study data, which is equivalent to letting βX=αθX.

In the presence of different study populations, a crucial question to ask before combining information is which population is of the primary interest. Most of the methods reviewed in this discussion, if not all, explicitly or implicitly assume that the internal study population is of the primary interest and the external summary information is used for efficiency improvement without causing (too much) bias. This is reasonable, for example, when the internal study has a clear target population and is based on a careful design with a well controlled sampling. In practice, with data usually collected based on convenience sampling, the internal study sample may not be representative of the target population, or there may even be ambiguity about the target population itself. Therefore, some cautions are always needed when applying those methods to combine information in the presence of population heterogeneity, and more methodological developments are definitely needed.

Disclosure statement

No potential conflict of interest was reported by the author.

References

  • Chatterjee, N., Chen, Y. H., Maas, P., & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111(513), 107–117. https://doi.org/10.1080/01621459.2015.1123157
  • Chen, Z., Ning, J., Shen, Y., & Qin, J. (2021). Combining primary cohort data with external aggregate information without assuming comparability. Biometrics, 77(3), 1024–1036. https://doi.org/10.1111/biom.v77.3
  • Estes, J. P., Mukherjee, B., & Taylor, J. M. G. (2018). Empirical bayes estimation and prediction using summary-level information from external big data sources adjusting for violations of transportability. Statistics in Biosciences, 10(3), 568–586. https://doi.org/10.1007/s12561-018-9217-4
  • Gu, T., Taylor, J. M. G., & Mukherjee, B. (2021). A meta-inference framework to integrate multiple external models into a current study. Biostatistics, kxab017. https://doi.org/10.1093/biostatistics/kxab017
  • Han, P., & Lawless, J. F. (2016). Discussion of “Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources”. Journal of the American Statistical Association, 111(513), 118–121. https://doi.org/10.1080/01621459.2016.1149399
  • Han, P., & Lawless, J. F. (2019). Empirical likelihood estimation using auxiliary summary information with different covariate distributions. Statistica Sinica, 29, 1321–1342.
  • Neuhaus, J. M., & Jewell, N. P. (1993). A geometric approach to assess bias due to omitted covariates in generalized linear models. Biometrika, 80(4), 807–815. https://doi.org/10.1093/biomet/80.4.807
  • Owen, A. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2), 237–249. https://doi.org/10.1093/biomet/75.2.237
  • Qin, J. (2000). Combining parametric and empirical likelihoods. Biometrika, 87(2), 484–490. https://doi.org/10.1093/biomet/87.2.484
  • Qin, J., & Lawless, J. (1994). Empirical likelihood and general estimating equations. The Annals of Statistics, 22(1), 300–325. https://doi.org/10.1214/aos/1176325370
  • Sheng, Y., Sun, Y., Huang, C.-Y, & Kim, M.-O. (2021). Synthesizing external aggregated information in the presence of population heterogeneity: A penalized empirical likelihood approach. Biometrics. https://doi.org/10.1111/biom.13429
  • Taylor, J. M. G., Choi, K., & Han, P. (2022). Data integration – exploiting ratios of parameter estimates from a reduced external model. Biometrika. https://doi.org/10.1093/biomet/asac022
  • Zhai, Y., & Han, P. (2022). Data integration with oracle use of external information from heterogeneous populations. Journal of Computational and Graphical Statistics. https://doi.org/10.1080/10618600.2022.2050248
  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735