![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
We congratulate Qin, Liu and Li (QLL) on a thoughtful and much needed review of many interesting methods for combining information from similar studies. We appreciate being given the opportunity to make a discussion. QLL cover a variety of different settings and methods. Based on that, we will provide a brief review on some additional relevant literature with a focus on methods that deal with population heterogeneity, since in practice it is most likely that different studies sample from different populations and whether information should be combined depends on how similar those populations are, among many other considerations. To keep the discussion focussed, we will follow the setting in Section 5 of QLL, although most of these methods can be more broadly applied.
We adopt the notation in Section 5.1 of QLL with some variations. Let ,
, denote the individual data based on a random sample from the internal study, where Y is the response and
and
are vectors of covariates. The model of interest is
for
with parameter
. The external study fitted a (possibly misspecified) model
for
with covariates
alone and parameter
. Throughout this discussion we use a *-superscript to denote distributions/expectations/quantities associated with the external study population. The external model fitting information is summarized by
(1)
(1) where
is the score function for
and
is the solution to the score equation based on the external study sample. Individual data from the external sample are not available. We assume the external sample size is very large so that the uncertainty in
is negligible compared to the internal study, i.e. Case I in QLL.
QLL in Section 5 give an excellent review of some methods and their comparisons in terms of asymptotic efficiency when the internal and external study populations are the same, based on both the empirical likelihood (EL) formulation (Owen, Citation1988; Qin & Lawless, Citation1994) and the constrained maximum likelihood (CML) formulation (Chatterjee et al., Citation2016; Qin, Citation2000). These two formulations are closely connected (Han & Lawless, Citation2016). For ease of discussion, we provide the CML formulation here, which is already covered in QLL. When , (Equation1
(1)
(1) ) can be transformed into
(2)
(2) where
(3)
(3) The CML estimator
is defined through
(4)
(4) where
,
.
When but
, applying the transformation (Equation3
(3)
(3) ) to (Equation1
(1)
(1) ) will lead to
(5)
(5) where the expectation
is under
in contrast to the
in (Equation2
(2)
(2) ) that is under
. In this case, although (Equation2
(2)
(2) ) no longer holds, there are ways to make use of the external study information summarized by (Equation5
(5)
(5) ). One way is to collect a small supplementary sample
,
, from the external study population (Chatterjee et al., Citation2016; Han & Lawless, Citation2019), and define the CML estimator through
where
,
. Another way is to specify a density ratio model of the form
, where
is a subset of
and
is newly introduced parameter, to link the external distribution
to the internal distribution
(Sheng et al., Citation2021). This approach models the heterogeneity between
and
by
, and is in the same spirit that an exponential tilting model is specified to link the case distribution to the control distribution as in Section 7.1 of QLL. With the density ratio model, the CML estimator can be defined through
where
,
. Here the last two constraints are based on the fact that
is a density and (Equation5
(5)
(5) ), respectively. Note that the dimension of
needs to be no larger than the dimension of
plus one for
to be identifiable.
When , applying transformation (Equation3
(3)
(3) ) to (Equation1
(1)
(1) ) leads to neither (Equation2
(2)
(2) ) nor (Equation5
(5)
(5) ) in general, regardless of if
is the same as
. In this case, the aforementioned CML estimators are biased. However, since the external study sample size is large, the reduction in variance by making use of external summary information may still benefit the internal study parameter estimation from a mean squared error perspective. Based on consideration of such a bias-variance trade-off, Estes et al. (Citation2018) proposed an empirical Bayes shrinkage estimator of the form
that is a weighted average of
, the MLE using the internal study data alone, and
defined through (Equation4
(4)
(4) ). Here
and
are estimated variance matrices for
and for the prior normal distribution on
, respectively. This method shrinks the final estimate towards
in the presence of population heterogeneity and towards
otherwise. Gu et al. (Citation2021) extended the idea in Estes et al. (Citation2018) to the case of multiple external studies, and propose an estimator that is a weighted average of the empirical Bayes estimators resulted from using each external study separately.
To deal with arbitrary population heterogeneity when information is available from multiple external studies but without causing estimation bias, Zhai and Han (Citation2022) developed an estimation procedure that simultaneously selects the studies that give (Equation2(2)
(2) ) and incorporates the corresponding information into internal model fitting. Their method also applies under the current setting of only one external study. When (Equation2
(2)
(2) ) does not hold because of an unknown form of heterogeneity, some components of
may still be zero even if
is not a zero vector, and these components still contain useful information to improve internal model estimation efficiency. This observation makes sense because the association between the same response and certain covariates may not differ much across populations with certain specific heterogeneity. Some general examples on this observation are given in Zhai and Han (Citation2022). Let
, then the components of
that correspond to zero-components of
should be selected to compute
. To shrink the estimate of the zero-components of
to exactly zero for information integration, under the current setting, the estimator in Zhai and Han (Citation2022) is defined through
with an adaptive Lasso penalty (Zou, Citation2006) on
with tuning parameter
, where
is the kth component of
and w>0 is some user-specified positive number typically taken to be 1 or 2. Similar idea has also been considered for integrating external summary information into survival analysis with different penalty (Chen et al., Citation2021).
All of the aforementioned methods make use of the external summary information through transformation (Equation3(3)
(3) ). Taylor et al. (Citation2022) took a different approach when both
and
are generalized linear models (GLM), namely
(6)
(6) and
, with possibly different link functions
and
(note that the second GLM may be misspecified). Here the notation
for expectation is generic to simply present the form of the model. For ease of discussion, assume covariates
have been orthogonalized to
, which can be done by taking
to be the vector of residuals of the least squares regression of each covariate in
on
using the internal data. Taylor et al. (Citation2022) showed that
for some unknown constant c when both GLMs are fitted to the same infinitely large sample and when
,
and
are all close to zero (here in this sentence, with some abuse of notation,
,
and
are the values after fitting both GLMs to the same infinitely large sample). See also Neuhaus and Jewell (Citation1993). Based on this result, when
and
are similar in the sense that the relative effects of
on Y (but not necessarily the absolute magnitudes) are the same between the two populations, the
produced by the external study can still be used to improve the internal estimation efficiency. With
orthogonalized to
, instead of fitting (Equation6
(6)
(6) ), Taylor et al. (Citation2022) proposed to fit
with coefficients
to the internal study data, which is equivalent to letting
.
In the presence of different study populations, a crucial question to ask before combining information is which population is of the primary interest. Most of the methods reviewed in this discussion, if not all, explicitly or implicitly assume that the internal study population is of the primary interest and the external summary information is used for efficiency improvement without causing (too much) bias. This is reasonable, for example, when the internal study has a clear target population and is based on a careful design with a well controlled sampling. In practice, with data usually collected based on convenience sampling, the internal study sample may not be representative of the target population, or there may even be ambiguity about the target population itself. Therefore, some cautions are always needed when applying those methods to combine information in the presence of population heterogeneity, and more methodological developments are definitely needed.
Disclosure statement
No potential conflict of interest was reported by the author.
References
- Chatterjee, N., Chen, Y. H., Maas, P., & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111(513), 107–117. https://doi.org/10.1080/01621459.2015.1123157
- Chen, Z., Ning, J., Shen, Y., & Qin, J. (2021). Combining primary cohort data with external aggregate information without assuming comparability. Biometrics, 77(3), 1024–1036. https://doi.org/10.1111/biom.v77.3
- Estes, J. P., Mukherjee, B., & Taylor, J. M. G. (2018). Empirical bayes estimation and prediction using summary-level information from external big data sources adjusting for violations of transportability. Statistics in Biosciences, 10(3), 568–586. https://doi.org/10.1007/s12561-018-9217-4
- Gu, T., Taylor, J. M. G., & Mukherjee, B. (2021). A meta-inference framework to integrate multiple external models into a current study. Biostatistics, kxab017. https://doi.org/10.1093/biostatistics/kxab017
- Han, P., & Lawless, J. F. (2016). Discussion of “Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources”. Journal of the American Statistical Association, 111(513), 118–121. https://doi.org/10.1080/01621459.2016.1149399
- Han, P., & Lawless, J. F. (2019). Empirical likelihood estimation using auxiliary summary information with different covariate distributions. Statistica Sinica, 29, 1321–1342.
- Neuhaus, J. M., & Jewell, N. P. (1993). A geometric approach to assess bias due to omitted covariates in generalized linear models. Biometrika, 80(4), 807–815. https://doi.org/10.1093/biomet/80.4.807
- Owen, A. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2), 237–249. https://doi.org/10.1093/biomet/75.2.237
- Qin, J. (2000). Combining parametric and empirical likelihoods. Biometrika, 87(2), 484–490. https://doi.org/10.1093/biomet/87.2.484
- Qin, J., & Lawless, J. (1994). Empirical likelihood and general estimating equations. The Annals of Statistics, 22(1), 300–325. https://doi.org/10.1214/aos/1176325370
- Sheng, Y., Sun, Y., Huang, C.-Y, & Kim, M.-O. (2021). Synthesizing external aggregated information in the presence of population heterogeneity: A penalized empirical likelihood approach. Biometrics. https://doi.org/10.1111/biom.13429
- Taylor, J. M. G., Choi, K., & Han, P. (2022). Data integration – exploiting ratios of parameter estimates from a reduced external model. Biometrika. https://doi.org/10.1093/biomet/asac022
- Zhai, Y., & Han, P. (2022). Data integration with oracle use of external information from heterogeneous populations. Journal of Computational and Graphical Statistics. https://doi.org/10.1080/10618600.2022.2050248
- Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735