Full article: A discussion on “A selective review of statistical methods using calibration information from similar studies” by Qin, Liu and Li

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

We congratulate Qin, Liu and Li (QLL) on a thoughtful and much needed review of many interesting methods for combining information from similar studies. We appreciate being given the opportunity to make a discussion. QLL cover a variety of different settings and methods. Based on that, we will provide a brief review on some additional relevant literature with a focus on methods that deal with population heterogeneity, since in practice it is most likely that different studies sample from different populations and whether information should be combined depends on how similar those populations are, among many other considerations. To keep the discussion focussed, we will follow the setting in Section 5 of QLL, although most of these methods can be more broadly applied.

We adopt the notation in Section 5.1 of QLL with some variations. Let $(Y_{i}, X_{i}^{T}, Z_{i}^{T})^{T}$ , $i = 1, \dots, n$ , denote the individual data based on a random sample from the internal study, where Y is the response and $X$ and $Z$ are vectors of covariates. The model of interest is $f (Y | X, Z; β)$ for $f (Y | X, Z)$ with parameter $β$ . The external study fitted a (possibly misspecified) model $f (Y | X; θ)$ for $f^{*} (Y | X)$ with covariates $X$ alone and parameter $θ$ . Throughout this discussion we use a *-superscript to denote distributions/expectations/quantities associated with the external study population. The external model fitting information is summarized by (1) $E^{*} {h (Y, X; θ^{*})} = 0,$ (1) where $h (Y, X; θ)$ is the score function for $f (Y | X; θ)$ and $θ^{*}$ is the solution to the score equation based on the external study sample. Individual data from the external sample are not available. We assume the external sample size is very large so that the uncertainty in $θ^{*}$ is negligible compared to the internal study, i.e. Case I in QLL.

QLL in Section 5 give an excellent review of some methods and their comparisons in terms of asymptotic efficiency when the internal and external study populations are the same, based on both the empirical likelihood (EL) formulation (Owen, Citation1988; Qin & Lawless, Citation1994) and the constrained maximum likelihood (CML) formulation (Chatterjee et al., Citation2016; Qin, Citation2000). These two formulations are closely connected (Han & Lawless, Citation2016). For ease of discussion, we provide the CML formulation here, which is already covered in QLL. When $f (Y, X, Z) = f^{*} (Y, X, Z)$ , (Equation1(1) $E^{*} {h (Y, X; θ^{*})} = 0,$ (1) ) can be transformed into (2) $E {ψ (X, Z; β)} = 0,$ (2) where (3) $\begin{aligned} ψ (X, Z; β) & = E {h (Y, X; θ^{*}) | X, Z} \\ = \int h (Y, X; θ^{*}) f (Y | X, Z; β) d Y . \end{aligned}$ (3) The CML estimator ${\hat{β}}_{c m l}$ is defined through (4) $\begin{aligned} max_{β, p_{1}, \dots, p_{n}} & \prod_{i = 1}^{n} f (Y_{i} ∣ X_{i}, Z_{i}; β) p_{i} \\ s u b j e c t t o & p_{i} \geq 0, \sum_{i = 1}^{n} p_{i} = 1, \\ \sum_{i = 1}^{n} p_{i} ψ (X_{i}, Z_{i}; β) = 0, \end{aligned}$ (4) where $p_{i} \equiv d F (X_{i}, Z_{i})$ , $i = 1, \dots, n$ .

When $f (Y | X, Z) = f^{*} (Y | X, Z)$ but $f (X, Z) \neq f^{*} (X, Z)$ , applying the transformation (Equation3(3) $\begin{aligned} ψ (X, Z; β) & = E {h (Y, X; θ^{*}) | X, Z} \\ = \int h (Y, X; θ^{*}) f (Y | X, Z; β) d Y . \end{aligned}$ (3) ) to (Equation1(1) $E^{*} {h (Y, X; θ^{*})} = 0,$ (1) ) will lead to (5) $E^{*} {ψ (X, Z; β)} = 0,$ (5) where the expectation $E^{*} (\cdot)$ is under $f^{*} (X, Z)$ in contrast to the $E (\cdot)$ in (Equation2(2) $E {ψ (X, Z; β)} = 0,$ (2) ) that is under $f (X, Z)$ . In this case, although (Equation2(2) $E {ψ (X, Z; β)} = 0,$ (2) ) no longer holds, there are ways to make use of the external study information summarized by (Equation5(5) $E^{*} {ψ (X, Z; β)} = 0,$ (5) ). One way is to collect a small supplementary sample $(X_{j}^{* T}, Z_{j}^{* T})^{T}$ , $j = 1, \dots, n^{*}$ , from the external study population (Chatterjee et al., Citation2016; Han & Lawless, Citation2019), and define the CML estimator through $\begin{aligned} max_{β, p_{1}^{*}, \dots, p_{n^{*}}^{*}} & \prod_{i = 1}^{n} f (Y_{i} ∣ X_{i}, Z_{i}; β) \prod_{j = 1}^{n^{*}} p_{j}^{*} \\ s u b j e c t t o & p_{j}^{*} \geq 0, \sum_{j = 1}^{n^{*}} p_{j}^{*} = 1, \\ \sum_{j = 1}^{n^{*}} p_{j}^{*} ψ (X_{j}^{*}, Z_{j}^{*}; β) = 0, \end{aligned}$ where $p_{j}^{*} \equiv d F^{*} (X_{j}^{*}, Z_{j}^{*})$ , $j = 1, \dots, n^{*}$ . Another way is to specify a density ratio model of the form $f^{*} (X, Z) = \exp (W^{T} α) f (X, Z)$ , where $W$ is a subset of $(X^{T}, Z^{T})^{T}$ and $α$ is newly introduced parameter, to link the external distribution $f^{*} (X, Z)$ to the internal distribution $f (X, Z)$ (Sheng et al., Citation2021). This approach models the heterogeneity between $f^{*} (X, Z)$ and $f (X, Z)$ by $\exp (W^{T} α)$ , and is in the same spirit that an exponential tilting model is specified to link the case distribution to the control distribution as in Section 7.1 of QLL. With the density ratio model, the CML estimator can be defined through $\begin{aligned} max_{β, α, p_{1}, \dots, p_{n}} & \prod_{i = 1}^{n} f (Y_{i} ∣ X_{i}, Z_{i}; β) p_{i} \\ s u b j e c t t o & p_{i} \geq 0, \sum_{i = 1}^{n} p_{i} = 1, \\ \sum_{i = 1}^{n} p_{i} {\exp (W_{i}^{T} α) - 1} = 0, \\ \sum_{i = 1}^{n} p_{i} ψ (X_{i}, Z_{i}; β) \exp (W_{i}^{T} α) = 0, \end{aligned}$ where $p_{i} \equiv d F (X_{i}, Z_{i})$ , $i = 1, \dots, n$ . Here the last two constraints are based on the fact that $f^{*} (X, Z) = \exp (W^{T} α) f (X, Z)$ is a density and (Equation5(5) $E^{*} {ψ (X, Z; β)} = 0,$ (5) ), respectively. Note that the dimension of $α$ needs to be no larger than the dimension of $ψ$ plus one for $α$ to be identifiable.

When $f (Y | X, Z) \neq f^{*} (Y | X, Z)$ , applying transformation (Equation3(3) $\begin{aligned} ψ (X, Z; β) & = E {h (Y, X; θ^{*}) | X, Z} \\ = \int h (Y, X; θ^{*}) f (Y | X, Z; β) d Y . \end{aligned}$ (3) ) to (Equation1(1) $E^{*} {h (Y, X; θ^{*})} = 0,$ (1) ) leads to neither (Equation2(2) $E {ψ (X, Z; β)} = 0,$ (2) ) nor (Equation5(5) $E^{*} {ψ (X, Z; β)} = 0,$ (5) ) in general, regardless of if $f (X, Z)$ is the same as $f^{*} (X, Z)$ . In this case, the aforementioned CML estimators are biased. However, since the external study sample size is large, the reduction in variance by making use of external summary information may still benefit the internal study parameter estimation from a mean squared error perspective. Based on consideration of such a bias-variance trade-off, Estes et al. (Citation2018) proposed an empirical Bayes shrinkage estimator of the form ${\hat{V}}_{0} ({\hat{V}}_{m l e} + {\hat{V}}_{0})^{- 1} {\hat{β}}_{m l e} + {\hat{V}}_{m l e} ({\hat{V}}_{m l e} + {\hat{V}}_{0})^{- 1} {\hat{β}}_{c m l}$ that is a weighted average of ${\hat{β}}_{m l e}$ , the MLE using the internal study data alone, and ${\hat{β}}_{c m l}$ defined through (Equation4(4) $\begin{aligned} max_{β, p_{1}, \dots, p_{n}} & \prod_{i = 1}^{n} f (Y_{i} ∣ X_{i}, Z_{i}; β) p_{i} \\ s u b j e c t t o & p_{i} \geq 0, \sum_{i = 1}^{n} p_{i} = 1, \\ \sum_{i = 1}^{n} p_{i} ψ (X_{i}, Z_{i}; β) = 0, \end{aligned}$ (4) ). Here ${\hat{V}}_{m l e}$ and ${\hat{V}}_{0}$ are estimated variance matrices for ${\hat{β}}_{m l e}$ and for the prior normal distribution on $β$ , respectively. This method shrinks the final estimate towards ${\hat{β}}_{m l e}$ in the presence of population heterogeneity and towards ${\hat{β}}_{c m l}$ otherwise. Gu et al. (Citation2021) extended the idea in Estes et al. (Citation2018) to the case of multiple external studies, and propose an estimator that is a weighted average of the empirical Bayes estimators resulted from using each external study separately.

To deal with arbitrary population heterogeneity when information is available from multiple external studies but without causing estimation bias, Zhai and Han (Citation2022) developed an estimation procedure that simultaneously selects the studies that give (Equation2(2) $E {ψ (X, Z; β)} = 0,$ (2) ) and incorporates the corresponding information into internal model fitting. Their method also applies under the current setting of only one external study. When (Equation2(2) $E {ψ (X, Z; β)} = 0,$ (2) ) does not hold because of an unknown form of heterogeneity, some components of $E {ψ (X, Z; β)}$ may still be zero even if $E {ψ (X, Z; β)}$ is not a zero vector, and these components still contain useful information to improve internal model estimation efficiency. This observation makes sense because the association between the same response and certain covariates may not differ much across populations with certain specific heterogeneity. Some general examples on this observation are given in Zhai and Han (Citation2022). Let $γ \equiv E {ψ (X, Z; β)}$ , then the components of $ψ (X, Z; β)$ that correspond to zero-components of $γ$ should be selected to compute ${\hat{β}}_{c m l}$ . To shrink the estimate of the zero-components of $γ$ to exactly zero for information integration, under the current setting, the estimator in Zhai and Han (Citation2022) is defined through $\begin{aligned} max_{β, γ} & [max_{p_{1}, \dots, p_{n}} \log {\prod_{i = 1}^{n} f (Y_{i} | X_{i}, Z_{i}; β) p_{i}} \\ - n \sum_{k = 1}^{d i m (γ)} λ_{n} \frac{| γ_{k} |}{| {\tilde{γ}}_{k} |^{w}}] \\ s u b j e c t t o & p_{i} \geq 0, \sum_{i = 1}^{n} p_{i} = 1, \\ \sum_{i = 1}^{n} p_{i} {ψ (X_{i}, Z_{i}; β) - γ} = 0, \end{aligned}$ with an adaptive Lasso penalty (Zou, Citation2006) on $γ$ with tuning parameter $λ_{n} > 0$ , where ${\tilde{γ}}_{k}$ is the kth component of $\tilde{γ} = n^{- 1} \sum_{i = 1}^{n} ψ (X_{i}, Z_{i}; {\hat{β}}_{m l e})$ and w>0 is some user-specified positive number typically taken to be 1 or 2. Similar idea has also been considered for integrating external summary information into survival analysis with different penalty (Chen et al., Citation2021).

All of the aforementioned methods make use of the external summary information through transformation (Equation3(3) $\begin{aligned} ψ (X, Z; β) & = E {h (Y, X; θ^{*}) | X, Z} \\ = \int h (Y, X; θ^{*}) f (Y | X, Z; β) d Y . \end{aligned}$ (3) ). Taylor et al. (Citation2022) took a different approach when both $f (Y | X, Z; β)$ and $f (Y | X; θ)$ are generalized linear models (GLM), namely (6) $g {E (Y | X, Z)} = β_{0} + X^{T} β_{X} + Z^{T} β_{Z}$ (6) and $l {E (Y | X)} = θ_{0} + X^{T} θ_{X}$ , with possibly different link functions $g (\cdot)$ and $l (\cdot)$ (note that the second GLM may be misspecified). Here the notation $E (\cdot)$ for expectation is generic to simply present the form of the model. For ease of discussion, assume covariates $Z$ have been orthogonalized to $X$ , which can be done by taking $Z$ to be the vector of residuals of the least squares regression of each covariate in $Z$ on $X$ using the internal data. Taylor et al. (Citation2022) showed that $β_{X} \approx c θ_{X}$ for some unknown constant c when both GLMs are fitted to the same infinitely large sample and when $β_{X}$ , $β_{Z}$ and $θ_{X}$ are all close to zero (here in this sentence, with some abuse of notation, $β_{X}$ , $β_{Z}$ and $θ_{X}$ are the values after fitting both GLMs to the same infinitely large sample). See also Neuhaus and Jewell (Citation1993). Based on this result, when $f (Y | X)$ and $f^{*} (Y | X)$ are similar in the sense that the relative effects of $X$ on Y (but not necessarily the absolute magnitudes) are the same between the two populations, the $θ_{X}^{*}$ produced by the external study can still be used to improve the internal estimation efficiency. With $Z$ orthogonalized to $X$ , instead of fitting (Equation6(6) $g {E (Y | X, Z)} = β_{0} + X^{T} β_{X} + Z^{T} β_{Z}$ (6) ), Taylor et al. (Citation2022) proposed to fit $g {E (Y | X, Z)} = β_{0} + α (X^{T} θ_{X}^{*}) + Z^{T} β_{Z}$ with coefficients $(β_{0}, α, β_{Z}^{T})^{T}$ to the internal study data, which is equivalent to letting $β_{X} = α θ_{X}^{*}$ .

In the presence of different study populations, a crucial question to ask before combining information is which population is of the primary interest. Most of the methods reviewed in this discussion, if not all, explicitly or implicitly assume that the internal study population is of the primary interest and the external summary information is used for efficiency improvement without causing (too much) bias. This is reasonable, for example, when the internal study has a clear target population and is based on a careful design with a well controlled sampling. In practice, with data usually collected based on convenience sampling, the internal study sample may not be representative of the target population, or there may even be ambiguity about the target population itself. Therefore, some cautions are always needed when applying those methods to combine information in the presence of population heterogeneity, and more methodological developments are definitely needed.

Disclosure statement

No potential conflict of interest was reported by the author.

References

Chatterjee, N., Chen, Y. H., Maas, P., & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111(513), 107–117. https://doi.org/10.1080/01621459.2015.1123157
PubMed Web of Science ®Google Scholar
Chen, Z., Ning, J., Shen, Y., & Qin, J. (2021). Combining primary cohort data with external aggregate information without assuming comparability. Biometrics, 77(3), 1024–1036. https://doi.org/10.1111/biom.v77.3
Web of Science ®Google Scholar
Estes, J. P., Mukherjee, B., & Taylor, J. M. G. (2018). Empirical bayes estimation and prediction using summary-level information from external big data sources adjusting for violations of transportability. Statistics in Biosciences, 10(3), 568–586. https://doi.org/10.1007/s12561-018-9217-4
Web of Science ®Google Scholar
Gu, T., Taylor, J. M. G., & Mukherjee, B. (2021). A meta-inference framework to integrate multiple external models into a current study. Biostatistics, kxab017. https://doi.org/10.1093/biostatistics/kxab017
Google Scholar
Han, P., & Lawless, J. F. (2016). Discussion of “Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources”. Journal of the American Statistical Association, 111(513), 118–121. https://doi.org/10.1080/01621459.2016.1149399
Google Scholar
Han, P., & Lawless, J. F. (2019). Empirical likelihood estimation using auxiliary summary information with different covariate distributions. Statistica Sinica, 29, 1321–1342.
Web of Science ®Google Scholar
Neuhaus, J. M., & Jewell, N. P. (1993). A geometric approach to assess bias due to omitted covariates in generalized linear models. Biometrika, 80(4), 807–815. https://doi.org/10.1093/biomet/80.4.807
Web of Science ®Google Scholar
Owen, A. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2), 237–249. https://doi.org/10.1093/biomet/75.2.237
Web of Science ®Google Scholar
Qin, J. (2000). Combining parametric and empirical likelihoods. Biometrika, 87(2), 484–490. https://doi.org/10.1093/biomet/87.2.484
Web of Science ®Google Scholar
Qin, J., & Lawless, J. (1994). Empirical likelihood and general estimating equations. The Annals of Statistics, 22(1), 300–325. https://doi.org/10.1214/aos/1176325370
Web of Science ®Google Scholar
Sheng, Y., Sun, Y., Huang, C.-Y, & Kim, M.-O. (2021). Synthesizing external aggregated information in the presence of population heterogeneity: A penalized empirical likelihood approach. Biometrics. https://doi.org/10.1111/biom.13429
Google Scholar
Taylor, J. M. G., Choi, K., & Han, P. (2022). Data integration – exploiting ratios of parameter estimates from a reduced external model. Biometrika. https://doi.org/10.1093/biomet/asac022
Google Scholar
Zhai, Y., & Han, P. (2022). Data integration with oracle use of external information from heterogeneous populations. Journal of Computational and Graphical Statistics. https://doi.org/10.1080/10618600.2022.2050248
Google Scholar
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735
Web of Science ®Google Scholar

A discussion on “A selective review of statistical methods using calibration information from similar studies” by Qin, Liu and Li

Disclosure statement

References

Information for

Open access

Opportunities

Help and information

A discussion on “A selective review of statistical methods using calibration information from similar studies” by Qin, Liu and Li

Disclosure statement

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date