Full article: A selective review of statistical methods using calibration information from similar studies

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

In the era of big data, divide-and-conquer, parallel, and distributed inference methods have become increasingly popular. How to effectively use the calibration information from each machine in parallel computation has become a challenging task for statisticians and computer scientists. Many newly developed methods have roots in traditional statistical approaches that make use of calibration information. In this paper, we first review some classical statistical methods for using calibration information, including simple meta-analysis methods, parametric likelihood, empirical likelihood, and the generalized method of moments. We further investigate how these methods incorporate summarized or auxiliary information from previous studies, related studies, or populations. We find that the methods based on summarized data usually have little or nearly no efficiency loss compared with the corresponding methods based on all-individual data. Finally, we review some recently developed big data analysis methods including communication-efficient distributed approaches, renewal estimation, and incremental inference as examples of the latest developments in methods using calibration information.

Keywords:

1. Introduction

Statistical inference with big data can be extremely challenging owing to the high volume and large variety of observed quantities. Currently, one of the most popular approaches to this problem in statistics and computer science is the divide-and-conquer paradigm. The basic idea of this method is to break down a problem recursively into two or more sub-problems of the same or related type, such that each sub-problem becomes simple enough to be solved easily. The solution to the original problem is the optimal combination of the solutions to the sub-problems. A closely related statistical method is called parallel and distributed inference. In essence, large amounts of observed data are stored in different machines in a distributed manner. The computation is often relatively inexpensive in each machine. Then, communication is essential to enable assembly of the available results from all machines. Many related references can be found in, for example, Jordan et al. (Citation2019). Although many new statistical methods have been developed for big data analysis, most of them have roots in traditional statistical methods of combining auxiliary information.

Combining information from similar studies has been and will continue to be an extremely important strategy in statistical inference. The most popular example of such methods is meta-analysis, in which the published results of multiple similar scientific studies are pooled to produce an enhanced estimate without using the raw individual data from each study. We refer to Borenstein et al. (Citation2009) for a comprehensive introduction to meta-analysis. For various reasons such as privacy or capacity of computer storage, in massive data inference, only summarized data rather than the original individual data may be available. This poses a very challenging problem: how to conduct efficient updated inference by making full use of the summarized data? In recent years, many methods of combining information have been developed in economic studies, machine learning, and distributed statistical inference. The goal of this paper is to selectively review a few popular methods that are able to integrate information in different disciplines.

Utilizing external summary data or auxiliary information to obtain more accurate inference is an old and effective method in survey sampling. Owing to restrictions such as cost effectiveness or convenience, the variable of interest Y may be available for only a small portion of individuals. However, the explanatory variable X associated with Y may readily be available for all individuals. Cochran (Citation1977) presented a comprehensive discussion on regression-type estimators making use of the summarized information from X. Chen and Qin (Citation1993), Chen et al. (Citation2002), and Wu and Sitter (Citation2001) used empirical likelihood (EL; Owen, Citation1988) to incorporate such information in finite populations.

With advances in technology, many summarized statistical results have become available in public domains. For example, many aggregated demographic and socioeconomic status data are provided in the US census reports. The Surveillance, Epidemiology, and End Results (SEER) programme of the National Cancer Institute provides population-based cancer survival statistics such as covariate-specific survival probabilities. Imbens and Lancaster (Citation1994) combined micro and macro data in economic studies through the generalized method of moments (GMM). Chaudhuri et al. (Citation2008) showed that inclusion of population-level information could reduce bias and increase the efficiency of the parameter estimates in a generalized linear model setup. Wu and Thompson (Citation2020) published an excellent monograph on combining auxiliary information in survey sampling.

In this paper, we consider two situations. In the first, the summarized information from different studies was derived using the same statistical model. Second, the summarized information was derived using statistical models that were similar but not exactly the same. In general, combining information in the former case is easier. The latter case is more complex, as one has to take into consideration the heterogeneity among different studies.

The rest of this paper is organized as follows. In Section 2, we briefly review two simple and popular meta-analysis methods for combining similar results. In Section 3, we review Owen's (Citation1988) EL method and Qin and Lawless's (Citation1994) over-identified parameter problem as examples of general tools for synthesizing information from summarized data. In particular, we present a new way of deriving the lower information bound for the over-identified parameter problem. Section 4 discusses enhanced inference by utilizing auxiliary information. Section 5 presents results on more flexible meta-analyses where information on different covariates are available in similar studies. Calibration of information from previous studies is described in Section 6. We discuss methods of using disease prevalence information for more efficient estimation in case–control studies in Section 7. The popular communication-efficient distributed statistical inference method used in machine learning is discussed in Section 8. Renewal estimation and incremental inference are briefly presented in Section 9. Finally, some further discussion is presented in Section 10.

2. Two simple information-combining methods

2.1. Convex combination

Suppose that ${\hat{θ}}_{1}$ and ${\hat{θ}}_{2}$ are two asymptotically unbiased estimators for θ from two independent studies, and that they satisfy ${\hat{θ}}_{i} \sim N (θ, σ_{i}^{2}), i = 1, 2$ . The most straightforward way of combining ${\hat{θ}}_{1}$ and ${\hat{θ}}_{2}$ is a convex combination, $\hat{θ} = α {\hat{θ}}_{1} + (1 - α) {\hat{θ}}_{2}, 0 < α < 1.$ The asymptotic variance of $\hat{θ}$ is $σ^{2} = α^{2} σ_{1}^{2} + (1 - α)^{2} σ_{2}^{2},$ which takes its minimum at $α = σ_{2}^{2} / (σ_{1}^{2} + σ_{2}^{2})$ . This suggests combining ${\hat{θ}}_{1}$ and ${\hat{θ}}_{2}$ by $\hat{θ} = \frac{σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}} {\hat{θ}}_{1} + \frac{σ_{1}^{2}}{σ_{1}^{2} + σ_{2}^{2}} {\hat{θ}}_{2} = \frac{{\hat{θ}}_{1} / σ_{1}^{2} + {\hat{θ}}_{2} / σ_{2}^{2}}{1 / σ_{1}^{2} + 1 / σ_{2}^{2}},$ an inverse-variance weighting estimator. In general, $σ_{1}^{2}$ and $σ_{2}^{2}$ are unknown; we may replace them by their estimators ${\hat{σ}}_{1}^{2}$ and ${\hat{σ}}_{2}^{2}$ , respectively, which leads to $\hat{θ} = \frac{{\hat{σ}}_{2}^{2}}{{\hat{σ}}_{1}^{2} + {\hat{σ}}_{2}^{2}} {\hat{θ}}_{1} + \frac{{\hat{σ}}_{1}^{2}}{{\hat{σ}}_{1}^{2} + {\hat{σ}}_{2}^{2}} {\hat{θ}}_{2} = \frac{{\hat{θ}}_{1} / {\hat{σ}}_{1}^{2} + {\hat{θ}}_{2} / {\hat{σ}}_{2}^{2}}{1 / {\hat{σ}}_{1}^{2} + 1 / {\hat{σ}}_{2}^{2}} .$ As an alternative method, we may use the maximum likelihood method to argue that this is the best estimator. We can treat ${\hat{θ}}_{i}$ as a direct observation from ${\hat{θ}}_{i} | θ \sim N (θ, σ_{i}^{2})$ , i = 1, 2. Then, the log-likelihood (regarding $σ_{1}^{2}$ and $σ_{2}^{2}$ as known constants) is $- ({\hat{θ}}_{1} - θ)^{2} / (2 σ_{1}^{2}) - ({\hat{θ}}_{2} - θ)^{2} / (2 σ_{2}^{2}) .$ Maximizing this likelihood with respect to θ or setting the score function to be zero, we end up with the same inverse-variance weighting estimator.

2.2. Random-effect meta-analysis

Dersimonian and Laird (Citation1986) proposed a moment-based estimation method using a random-effect model for meta-analysis. Let ${\hat{θ}}_{i}$ be an estimator of $θ_{i}$ from the i-th study, $i = 1, 2, \dots, K$ . For example, ${\hat{θ}}_{i}$ could be the estimated mean response from the i-th study. When the sample size $n_{i}$ in the i-th study is reasonably large, we may assume that $\begin{aligned} {\hat{θ}}_{i} | θ_{i} \sim N (θ_{i}, w_{i}^{- 1}), θ_{i} \sim N (θ, τ^{2}), \\ i = 1, 2, \dots, K, \end{aligned}$ where the $w_{i}^{- 1}$ s are treated as known. Although the normal models hold to be true approximately, we assume that they are all true for ease of theoretical development. The goal here is to better estimate θ by combining the results from all the studies.

Unconditionally, we have ${\hat{θ}}_{i} \sim N (θ, w_{i}^{- 1} + τ^{2}) .$ Consider the following inverse-variance weighting estimator for θ: $\hat{θ} = \frac{\sum_{i = 1}^{K} {\hat{θ}}_{i} w_{i}}{\sum_{i = 1}^{K} w_{i}}$ with variance $V a r (\hat{θ}) = \sum_{i = 1}^{K} w_{i}^{2} (w_{i}^{- 1} + τ^{2}) / {(\sum_{i = 1}^{K} w_{i})}^{2} .$ Define $\begin{aligned} Q & = \sum_{i = 1}^{K} w_{i} ({\hat{θ}}_{i} - \hat{θ})^{2} = \sum_{i = 1}^{K} w_{i} ({\hat{θ}}_{i} - θ)^{2} \\ - (\hat{θ} - θ)^{2} \sum_{i = 1}^{K} w_{i} . \end{aligned}$ We can easily check that $E (Q) = (K - 1) + τ^{2} (\sum_{i = 1}^{K} w_{i} - \sum_{i = 1}^{K} w_{i}^{2} / \sum_{j = 1}^{K} w_{j}),$ which implies that a natural estimator of $τ^{2}$ is ${\hat{τ}}^{2} = \frac{Q - (K - 1)}{\sum_{i = 1}^{K} w_{i} - \sum_{i = 1}^{K} w_{i}^{2} / \sum_{j = 1}^{K} w_{j}} .$ For small sample sizes, there is no guarantee that this estimator is non-negative; one may replace it by $max ({\hat{τ}}^{2}, 0) .$

Alternatively, we may estimate τ using the likelihood approach. The joint likelihood based on the ${\hat{θ}}_{i}$ s is $ℓ (θ, τ) = - \frac{1}{2} \sum_{i = 1}^{K} \frac{({\hat{θ}}_{i} - θ)^{2}}{τ^{2} + w_{i}^{- 1}} - \frac{1}{2} \sum_{i = 1}^{K} \log (τ^{2} + w_{i}^{- 1}) .$ Maximizing ℓ with respect to θ and $τ^{2}$ gives their maximum likelihood estimators (MLEs).

Lin and Zeng (Citation2010) compared the relative efficiency of using summary statistics versus individual-level data in meta-analysis. They found that in general there was no information loss when using the summarized information compared with inference based on the original individual data when available.

3. Empirical likelihood and general estimating equations

In this section we briefly review Owen's (Citation1988) EL and Qin and Lawless' (Citation1994) estimating equations approaches, as those methods represent general tools for assembly of information from different sources. The maximum likelihood method for regular parametric models is among the most popular methods in statistical inference, as it has many nice properties. However, model mis-specification is a major concern, as a mis-specified model may lead to biased results. For the case when the underlying distribution is multinomial, Hartely and Rao (Citation1968) proposed a mean constrained estimator for the population total in survey sampling problems. To mimic the parametric likelihood but discard parametric model assumptions, Owen (Citation1988) and Owen (Citation1990) proposed the EL method, which is a natural generalization of the multinomial likelihood when the number of categories is equal to the sample size. The EL approach can be thought of as a bootstrap that does not resample, or as a likelihood without parametric assumptions (Owen, Citation2001).

3.1. Definition of empirical likelihood

Suppose that $X_{1}, \dots, X_{n}$ are n independent and identically distributed observations from X, with cumulative distribution F. For convenience, we assume there are no ties, i.e., any two observations are unequal to each other. The techniques developed below can be easily adapted to handle ties. Let $d F (X_{i}), i = 1, 2, \dots, n,$ be the jumps of $F (x)$ at the observed data points. The nonparametric likelihood is $L (F) = \prod_{i = 1}^{n} p_{i} .$ It is clear that if any $p_{i} = 0$ , then $L (F) = 0$ , and if $\sum_{i = 1}^{n} p_{i} < 1$ , then $L (F) < L (F_{*})$ , where $F_{*} (x) = \sum_{i = 1}^{n} p_{i} I (X_{i} \leq x) / \sum_{i = 1}^{n} p_{i}$ . According to the likelihood principle (that parameters with larger likelihoods are preferable), one need only consider the distribution functions $F (x)$ with $p_{i} > 0$ and $\sum_{i = 1}^{n} p_{i} = 1$ .

If we maximize the log-likelihood (1) $ℓ (F) = \sum_{i = 1}^{n} \log p_{i}$ (1) subject to the constraints (2) $\sum_{i = 1}^{n} p_{i} = 1, p_{i} \geq 0,$ (2) then we obtain $p_{i} = 1 / n, i = 1, 2, \dots, n .$ Therefore, the maximum EL estimator of F is $F_{n} (x) = \sum_{i = 1}^{n} p_{i} I (X_{i} \leq x) = n^{- 1} \sum_{i = 1}^{n} I (X_{i} \leq x) .$ This is why the empirical distribution is called the nonparametric MLE of $F (x)$ .

Suppose we are interested in constructing a confidence interval for $μ = E (X) = \int x d F (x)$ , the mean of X. Since we have discretized F at each of the observed data points, the integral becomes $μ = \sum_{i = 1}^{n} p_{i} X_{i}$ . Next, we maximize the nonparametric log-likelihood subject to an extra constraint: (3) $\sum_{i = 1}^{n} p_{i} (X_{i} - μ) = 0.$ (3) Maximizing the log-likelihood (Equation1(1) $ℓ (F) = \sum_{i = 1}^{n} \log p_{i}$ (1) ) subject to constraints (Equation2(2) $\sum_{i = 1}^{n} p_{i} = 1, p_{i} \geq 0,$ (2) ) and (Equation3(3) $\sum_{i = 1}^{n} p_{i} (X_{i} - μ) = 0.$ (3) ), the Lagrange multiplier method gives the profile log-likelihood of μ, (4) $ℓ_{n} (μ) = - \sum_{i = 1}^{n} \log {1 + λ^{⊤} (X_{i} - μ)} - n \log n,$ (4) where λ is the solution to $\sum_{i = 1}^{n} (X_{i} - μ) / {1 + λ^{⊤} (X_{i} - μ)} = 0.$ We can treat $ℓ_{n} (μ)$ as a parametric likelihood of μ. Based on this likelihood, the maximum EL estimator of μ is $\hat{μ} = \bar{X} = n^{- 1} \sum_{i = 1}^{n} X_{i}$ , which is exactly the sample mean. We define the likelihood ratio function as $R_{n} (μ) = 2 {max_{μ} ℓ_{n} (μ) - ℓ_{n} (μ)} = 2 {ℓ_{n} (\bar{X}) - ℓ_{n} (μ)} .$ Under the regularity conditions specified in Owen (Citation1988) and Owen (Citation1990), as n goes to infinity, $R_{n} (μ_{0})$ converges to the $χ^{2}$ distribution with p degrees of freedom, where p is the dimension of μ, and $μ_{0}$ is the true value of μ.

3.2. General estimating equations

The original EL was mainly used to make inference for linear functionals of the underlying population distribution such as the population mean (Owen, Citation1988, Citation1990). Qin and Lawless (Citation1994) applied this method to general estimating models, which greatly broadened its applications. Specifically, suppose the population of interest satisfies a general estimating equation (5) $E {g (X, θ)} = 0,$ (5) for a $r \times 1$ vector-valued function g and some θ, which is a $p \times 1$ parameter to be estimated. We assume $r \geq p$ as otherwise the true parameter value of θ would be undefined.

For general estimating equations with r>p or over-identified models, Hansen (Citation1982) proposed the celebrated GMM, which has become one of the most popular methods in the econometric community. In essence, the GMM minimizes ${\sum_{i = 1}^{n} g (X_{i}, θ)}^{⊤} Σ^{- 1} {\sum_{i = 1}^{n} g (X_{i}, θ)}$ with respect to θ, where Σ is the variance matrix of the estimating equation $g (X, θ)$ . If Σ is unknown, we may replace it by the sample variance $\hat{Σ} = \frac{1}{n} \sum_{i = 1}^{n} g (X_{i}, \tilde{θ}) g^{⊤} (X_{i}, \tilde{θ}),$ where $\tilde{θ}$ is an initial and consistent estimate of θ.

Instead of GMM, Qin and Lawless (Citation1994) used the EL to make inferences for parameters defined by a general estimating equation. For discretized $F (x)$ satisfying (Equation2(2) $\sum_{i = 1}^{n} p_{i} = 1, p_{i} \geq 0,$ (2) ), Equation (Equation5(5) $E {g (X, θ)} = 0,$ (5) ) becomes (6) $\sum_{i = 1}^{n} p_{i} g (X_{i}, θ) = 0.$ (6) Maximizing the log-likelihood (Equation1(1) $ℓ (F) = \sum_{i = 1}^{n} \log p_{i}$ (1) ) subject to (Equation2(2) $\sum_{i = 1}^{n} p_{i} = 1, p_{i} \geq 0,$ (2) ) and (Equation6(6) $\sum_{i = 1}^{n} p_{i} g (X_{i}, θ) = 0.$ (6) ), we have the following profile log-likelihood of θ (up to a constant): $ℓ_{n} (θ) = - \sum_{i = 1}^{n} \log {1 + λ^{⊤} g (X_{i}, θ)},$ where λ is the Lagrange multiplier determined by $\sum_{i = 1}^{n} g (X_{i}, θ) / {1 + λ^{⊤} g (X_{i}, θ)} = 0.$ We then estimate θ by the maximizer $\hat{θ} = \arg max_{θ} ℓ_{n} (θ)$ , whose limiting distribution is established in the following theorem. Hereafter, we use $\nabla_{θ}$ to denote the differentiation operator with respect to θ.

Theorem 3.1

Qin & Lawless, Citation1994

Denote $g = g (X, θ_{0})$ and $\nabla_{θ^{⊤}} g = \nabla_{θ^{⊤}} g (X, θ_{0})$ . Suppose that (1) $E (g g^{⊤})$ is positive definite, (2) $\nabla_{θ^{⊤}} g (X, θ)$ is continuous in a neighbourhood of $θ_{0}$ , (3) $‖ \nabla_{θ^{⊤}} g (X, θ) ‖$ and $‖ g (X, θ) ‖^{3}$ can be bounded by some integrable function $G (X)$ in this neighbourhood, and (4) $E (\nabla_{θ^{⊤}} g)$ is of full rank. Then, as $n \to \infty$ , $\sqrt{n} (\hat{θ} - θ) \overset{d}{⟶} N (0, V),$ where $\overset{d}{⟶}$ means ‘convergence in distribution’ and (7) $V = {E (\nabla_{θ} g^{⊤}) (E g g^{⊤})^{- 1} E (\nabla_{θ^{⊤}} g)}^{- 1} .$ (7)

3.3. Calculation of the information bound

Assuming that the parameter of interest satisfies the general estimating equation $E {g (X, θ)} = 0$ , we next consider how well we can estimate θ based on this model, and whether the maximum EL estimator is optimal. To answer these questions, we consider an ideal situation, where the probability function X has a parametric form $f (x, θ)$ , which is known up to θ. We define $\begin{aligned} h (x, η, θ) & = \exp {η^{⊤} g (x, θ)} f (x, θ) \div \\ \int \exp {η^{⊤} g (t, θ)} f (t, θ) d t, \end{aligned}$ implicitly assuming that $\int \exp {η^{⊤} g (t, θ)} f (t, θ) d t < \infty .$ Clearly, $h (x, η, θ)$ is an enlarged parametric model of $f (x, θ)$ as it reduces to $f (x, θ)$ when $η = 0$ . As the parametric form $f (x, θ)$ is unknown in practice, we anticipate that any estimator based on the moment constraints $E {g (X, θ)} = 0$ should have a variance that is no less than that of the MLE derived from the enlarged model. We show that even if the form of $f (x, θ)$ is available, the MLE of θ based on $h (x, η, θ)$ has the same asymptotic variance as the maximum EL estimator.

With the parametric model h, we can estimate θ by maximizing $L (θ, η) = \prod_{i = 1}^{n} h (X_{i}, η, θ)$ with respect to $(θ, η)$ . We denote the resulting MLE by $(\tilde{η}, \tilde{θ})$ . We show in Section 3.4 that under some regularity conditions on h (see, e.g., Theorems 14 and 23 of van de Vaart (Citation2000)), as $n \to \infty$ , (8) $\sqrt{n} (\tilde{θ} - θ) \overset{d}{⟶} N (0, V),$ (8) where V is defined in (Equation7(7) $V = {E (\nabla_{θ} g^{⊤}) (E g g^{⊤})^{- 1} E (\nabla_{θ^{⊤}} g)}^{- 1} .$ (7) ). In general, the parametric form $f (x, θ)$ is unknown; hence, we expect that the best estimator of θ should have an asymptotic variance at least as large as V. As the maximum EL estimator of θ of Qin and Lawless (Citation1994) has asymptotic variance V, we conclude that it achieves the lower information bound.

Remark 3.1

If $g (x, θ)$ is an unbounded function of x for each θ, we may construct a new density $h (x, θ, η) = \frac{ψ {η^{⊤} g (x, θ)} f (x, θ)}{\int ψ {η^{⊤} g (t, θ)} f (t, θ) d t},$ where $ψ (x) = 2 (1 + e^{- 2 x})^{- 1}$ with $ψ (0) = ψ^{'} (0) = 1$ . Clearly, ψ is bounded. We may go through the same derivations to get the same conclusion.

Remark 3.2

Back and Brown (Citation1992) established a similar result by constructing an exponential family. In particular, they defined $h (x, θ) = \exp {ξ^{⊤} (θ) g (x, θ_{0}) - a (θ)} f_{0} (x),$ where $f_{0} (x) = f (x, θ_{0})$ and $ξ (θ)$ is determined implicitly by the following conditions: $ξ (θ_{0}) = 0$ , $a (θ_{0}) = 0$ , $\int \exp {ξ^{⊤} (θ) g (x, θ_{0}) - a (θ)} f_{0} (x) d x = 1,$ and $\int g (x, θ) \exp {ξ^{⊤} (θ) g (x, θ_{0}) - a (θ)} f_{0} (x) d x = 0.$ In Back & Brown's approach, $ξ (θ)$ is determined implicitly by the above constraint equation, whereas in our new approach, η is an independent parameter.

3.4. A sketched proof of (8)

The log-likelihood based on the enlarged model is $ℓ (θ, η) = \sum_{i = 1}^{n} \log {h (X_{i}, η, θ)}$ , where $\begin{aligned} \log {h (x, η, θ)} & = η^{⊤} g (x, θ) + \log f (x, θ) \\ - \log [\int \exp {η^{⊤} g (t, θ)} f (t, θ) d t] . \end{aligned}$ If $\log {h (x, η, θ)}$ satisfies the conditions of Theorem 14 of van de Vaart (Citation2000) on $m_{θ} (x)$ , then $(\tilde{θ}, \tilde{η})$ is consistent with $(θ_{0}, 0)$ .

Result (Equation8(8) $\sqrt{n} (\tilde{θ} - θ) \overset{d}{⟶} N (0, V),$ (8) ) follows from Theorem 23 of van de Vaart (Citation2000). With tedious algebra, we find that $\begin{aligned} \nabla_{θ} \log {h (x, θ_{0}, 0)} = 0, \\ \nabla_{η} \log {h (x, θ_{0}, 0)} = g (x, θ_{0}), \\ E \nabla_{θ θ^{⊤}} \log {h (X, θ_{0}, 0)} = 0, \\ E \nabla_{η η^{⊤}} \log {h (X, θ_{0}, 0)} = - E (g g^{⊤}), \\ E \nabla_{η θ^{⊤}} \log {h (X, θ_{0}, 0)} = E \nabla_{θ^{⊤}} g (X, θ_{0}) \\ - E {\nabla_{θ^{⊤}} g (X, θ_{0}) + g (X, θ_{0}) \nabla_{θ^{⊤}} \log f (X, θ_{0})} . \end{aligned}$ Under some mild assumptions, such as that $\int g (x, θ) \times f (x, θ) d x = 0$ holds for θ in a neighbourhood of $θ_{0}$ , differentiating both sides with respect to θ leads to $E {\nabla_{θ} g (X, θ)} + E {\nabla_{θ} g (X, θ) \nabla_{θ} \log f (X, θ)} = 0,$ which means $E \nabla_{η θ^{⊤}} \log {h (X, θ_{0}, 0)} = E {\nabla_{θ^{⊤}} g (X, θ_{0})} .$ As $(\tilde{θ}, \tilde{η})$ is consistent with $(θ_{0}, 0)$ , by Theorem 5.23 of van de Vaart (Citation2000), we have (9) $\begin{aligned} \sqrt{n} (\begin{matrix} \tilde{θ} - θ_{0} \\ \tilde{η} - 0 \end{matrix}) = {(\begin{array}{cc} 0 & E (\nabla_{θ} g^{⊤}) \\ E (\nabla_{θ^{⊤}} g) & - E (g g^{⊤}) \end{array})}^{- 1} \\ \times (\begin{matrix} 0 \\ n^{- 1 / 2} \sum_{i = 1}^{n} g (X_{i}, θ_{0}) \end{matrix}) + o_{p} (1) . \end{aligned}$ (9) This, together with the fact that $n^{- 1 / 2} \sum_{i = 1}^{n} g (X_{i}, θ_{0}) \overset{d}{⟶} N (0, E (g g^{⊤}))$ as n goes to infinity, implies (Equation8(8) $\sqrt{n} (\tilde{θ} - θ) \overset{d}{⟶} N (0, V),$ (8) ).

3.5. Empirical entropy family

Again we assume that the available information is given by the estimating equation $E {g (X, θ)} = 0$ . The enlarged parametric model $h (x, η, θ)$ satisfies $\int h (x, η, θ) g (x, θ) d x = 0$ only if $η = 0$ . Naturally, one may require $η = η (θ)$ to satisfy $\int g (x, θ) \exp {η^{⊤} g (x, θ)} f (x, θ) d x = 0.$ It is often too restrictive to assume a known underlying parametric model $f (x, θ)$ in the construction of the enlarged parametric model $h (x, η, θ)$ . We may replace the cumulative distribution function $F (x, θ) = \int_{- \infty}^{x} f (t, θ) d t$ by the empirical distribution $F_{n} (x) = n^{- 1} \sum_{i = 1}^{n} I (X_{i} \leq x)$ . In this situation, $η = η (θ)$ is the solution to $\sum_{i = 1}^{n} g (x_{i}, θ) \exp {η^{⊤} g (x_{i}, θ)} = 0.$

Let $H (x, η, θ) = \int_{- \infty}^{x} h (t, η, θ) d t$ . For fixed parameter values $(η, θ)$ , the jump of H at $x = X_{i}$ is $\begin{aligned} d H (X_{i}, η, θ) & = \exp {η^{⊤} (θ) g (X_{i}, θ)} \div \\ [\sum_{j = 1}^{n} \exp {η^{⊤} (θ) g (X_{j}, θ)}], \end{aligned}$ and the likelihood becomes $\prod_{i = 1}^{n} d H (X_{i}, η, θ) = \prod_{i = 1}^{n} \frac{\exp {η^{⊤} (θ) g (X_{i}, θ)}}{\sum_{j = 1}^{n} \exp {η^{⊤} (θ) g (X_{j}, θ)}} .$ In fact, this is equivalent to the EL $\prod_{i = 1}^{n} p_{i}$ , where the $p_{i}$ s minimize the Kullback–Leibler divergence (up to a constant) or minus the exponential titling likelihood $\sum_{i = 1}^{n} p_{i} \log (p_{i})$ subject to the constraints $\sum_{i = 1}^{n} p_{i} = 1$ , $p_{i} \geq 0$ , and $\sum_{i = 1}^{n} p_{i} g (X_{i}, θ) = 0.$ See Susanne (Citation2007) for more details. We call this the empirical entropy family induced by the estimating equation $E {g (X, θ)} = 0$ .

4. Enhancing efficiency using auxiliary information

In this section, we discuss methods of incorporating auxiliary information to enhance estimation efficiency. This aspect was also investigated by Qin (Citation2000). We assume a parametric model $f (y | x, β)$ for the conditional density function of Y given X and leave the marginal distribution $G (x)$ of X unspecified. We wish to make inferences for β when some auxiliary information is summarized through an estimating equation $E {ϕ (X, β)} = 0.$ For example, if we know the mean μ of Y, then we can construct an estimating equation $E (Y - μ) = 0.$ We can take $\begin{aligned} ϕ (X, β) & = \int (y - μ) f (y | X, β) d y \\ = \int y f (y | X, β) d y - μ . \end{aligned}$ Furthermore, we allow that the response Y may have missing values. Let D be the non-missingness indicator, which takes the value 1 if Y is available, and 0 otherwise. We assume a missing-at-random model $\begin{aligned} p r (D = 1 | Y = y, X = x) = p r (D = 1 | X = x) \\ = π (x), \end{aligned}$ where $π (x)$ depends only on x. We denote the observed data by $(d_{i}, d_{i} y_{i}, x_{i})$ ( $i = 1, 2, \dots, n$ ) and $p_{i} = d G (x_{i})$ . The likelihood of $(β, G)$ is $\begin{aligned} L & = \prod_{i = 1}^{n} {π (x_{i}) f (y_{i} | x_{i}, β) d G (x_{i})}^{d_{i}} \\ \times [{1 - π (x_{i})} d G (x_{i})]^{1 - d_{i}} \\ = \prod_{j = 1}^{n} {π (x_{j})}^{d_{j}} {1 - π (x_{j})}^{1 - d_{j}} \\ \cdot \prod_{i = 1}^{n} {f (y_{i} | x_{i}, β)}^{d_{i}} \cdot p_{i} . \end{aligned}$ We can maximize this likelihood subject to the constraints $\sum_{i = 1}^{n} p_{i} = 1, p_{i} \geq 0, \sum_{i = 1}^{n} p_{i} ϕ (x_{i}, β) = 0.$ As $\prod_{j = 1}^{n} {π (x_{j})}^{d_{j}} {1 - π (x_{j})}^{1 - d_{j}}$ is not a function of β, the profile hybrid empirical log-likelihood (up to a constant) is (10) $\begin{aligned} ℓ (β) = \sum_{i = 1}^{n} [d_{i} \log f (y_{i} | x_{i}, β) - \log {1 + λ^{⊤} ϕ (x_{i}, β)}], \end{aligned}$ (10) where λ is the Lagrange multiplier determined by (11) $\sum_{i = 1}^{n} \frac{ϕ (x_{i}, β)}{1 + λ^{⊤} ϕ (x_{i}, β)} = 0.$ (11) For the special case where data are missing completely at random, i.e., $π (x)$ is a constant function of x, Qin (Citation2000) established the following theorem.

Theorem 4.1

Let $β_{0}$ be the true parameter value, let $\hat{β}$ be the maximum hybrid EL estimator, i.e., the maximizer of (Equation10(10) $\begin{aligned} ℓ (β) = \sum_{i = 1}^{n} [d_{i} \log f (y_{i} | x_{i}, β) - \log {1 + λ^{⊤} ϕ (x_{i}, β)}], \end{aligned}$ (10) ), and let $\hat{λ}$ be the corresponding Lagrange multiplier. Denote $ϕ = ϕ (X, β_{0})$ , $\nabla_{β} ϕ = \nabla_{β} ϕ (X, β_{0}),$ and $\begin{aligned} J & = - E {d_{i} \nabla_{β β^{⊤}} \log f (y_{i} | x_{i}, β_{0})} \\ = V a r {d_{i} \nabla_{β} \log f (y_{i} | x_{i}, β_{0})} . \end{aligned}$ Under some regularity conditions, when n goes to infinity, we have $\sqrt{n} ((\hat{β} - β_{0})^{⊤}, {\hat{λ}}^{⊤})^{⊤} \overset{d}{⟶} N (0, Σ),$ where $Σ = d i a g (Σ_{11}, Σ_{22})$ with (12) $\begin{aligned} Σ_{11} & = {J + E (\nabla_{β} ϕ^{⊤}) (E ϕ ϕ^{⊤})^{- 1} E (\nabla_{β^{⊤}} ϕ)}^{- 1}, \\ Σ_{22} & = {E (\nabla_{β} ϕ^{⊤}) J^{- 1} E (\nabla_{β^{⊤}} ϕ) + E (ϕ ϕ^{⊤})}^{- 1} . \end{aligned}$ (12)

Remark 4.1

Imbens and Lancaster (Citation1994) studied the same problem using GMM. In particular, they directly combined the conditional score estimating equation $\nabla_{β} \log f (y | x, β)$ and $ϕ (x, β)$ . Even though the first-order large-sample results are the same, the hybrid EL based approach is more appealing as it respects the parametric conditional likelihood and replaces only the marginal likelihood with the EL. See Qin (Citation2000) for numerical comparisons of results of the two methods.

5. Combining summary information: a more flexible method for meta-analysis

Developing systematic methods for combining published information is one of the main goals of meta-analysis, which has become increasingly popular since little extra cost is needed. The main restriction in meta-analysis is that all studies must include the same variables in their analyses. The only difference allowed is in the sample sizes. Thus, studies must be discarded if they contain different variables from those in other studies.

Summarized information is often available from publications such as census reports and results of national health studies. For reasons including confidentiality, it is typically not possible to gain access to the original data, only the summarized reports. Suppose we are interested in conducting a new study that may contain some new variables of interest that are not available in the summarized information, for example, a genetic study involving newly discovered biomarkers or genes. Below we discuss a more flexible method that could be used to combine published information and individual study data for enhanced inference in such cases. Chatterjee et al. (Citation2016) discussed a related problem on the utilization of auxiliary information. As Han and Lawless (Citation2016) pointed out, however, their methodology and theoretical results had already been developed by Imbens and Lancaster (Citation1994) and Qin (Citation2000) in the absence of selection bias in sampling.

We consider two cases. (I) The sample size for the summarized information is much larger than that of the new study. (II) Sample sizes from the two data sources are comparable. In Case I, we can treat the summarized information as known, i.e., the variation in the summarized data is negligible compared with the variation in the new study. In Case II, we have to take the variation in the summarized information into consideration as it is comparable to the variation in the new study. We focus on Case I in this section and study Case II in Section 6.

5.1. Setup and solution

Suppose that the summarized results were obtained from statistical analyses of response Y and covariate variables X (although the original data are not available), and that the new study includes an extra covariate Z in addition to $(Y, X)$ . We are interested in fitting a parametric model $f (y | x, z, β)$ for the conditional density function of Y given X and Z. Let $(y_{1}^{*}, x_{1}^{*}), \dots ., (y_{N}^{*}, x_{N}^{*})$ be the historic data even though they are unavailable. The published information can be summarized in two ways:

$\bar{h} = N^{- 1} \sum_{i = 1}^{N} h (y_{i}^{*}, x_{i}^{*})$ is known; and
$γ^{*}$ is the solution of an estimating equation $\sum_{i = 1}^{N} h (y_{i}^{*}, x_{i}^{*}, γ) = 0,$ where the function $h (y, x, γ)$ is known up to γ.

Let $(y_{1}, x_{1}, z_{1}), \dots, (y_{n}, x_{n}, z_{n})$ be observed data from the new study. The basic assumption is that $(y_{i}, x_{i}), i = 1, 2, \dots, n$ , and $(y_{i}^{*}, x_{i}^{*})$ have the same distribution. To utilize the summarized information, we can define estimating functions $\begin{matrix} g = (g_{1}, g_{3}), g_{1} (y, x, z) = \nabla_{β} \log f (y | x, z, β), \\ g_{3} (y, x) = h (y, x) - \bar{h} \end{matrix}$ in Scenario (I), and $\begin{aligned} g & = (g_{1}, g_{3}), \\ g_{1} (y, x, z) & = \nabla_{β} \log f (y | x, z, β), \\ g_{3} (y, x) & = h (y, x, γ^{*}) \end{aligned}$ in Scenario (II). We consider only the situation where $n / N \to 0$ . In other words, the variation in the auxiliary information is negligible.

The EL approach amounts to maximizing $\sum_{i = 1}^{n} \log p_{i}$ subject to the constraint $\sum_{i = 1}^{n} p_{i} g (y_{i}, x_{i}, z_{i}, β) = 0, p_{i} \geq 0, \sum_{i = 1}^{n} p_{i} = 1.$ According to Qin and Lawless (Citation1994), the asymptotic variance of the maximum EL estimator $\hat{β}$ based on estimating equation g is $[E (\nabla_{β} g^{⊤}) {E (g g^{⊤})}^{- 1} E (\nabla_{β^{⊤}} g)]^{- 1},$ where $\nabla_{β} g = \partial g (y, x, z, β) / \partial β |_{β = β_{0}}$ , $g = g (y, x, z, β_{0})$ , and $β_{0}$ is the true value of β. We denote $\begin{aligned} A & = E (g g^{⊤}) = (\begin{array}{cc} A_{11} & A_{12} \\ A_{12}^{⊤} & A_{22} \end{array}), \\ A_{22.1} & = A_{22} - A_{12}^{⊤} A_{11}^{- 1} A_{12} . \end{aligned}$ Equivalently, the asymptotic variance can be written as $\begin{aligned} [E (\nabla_{β} g_{1}^{⊤}) A_{11}^{- 1} E (\nabla_{β^{⊤}} g_{1}) + E (\nabla_{β} g_{1}^{⊤}) A_{11}^{- 1} A_{12} \\ \times A_{22.1}^{- 1} A_{21} A_{11}^{- 1} E (\nabla_{β^{⊤}} g_{1})]^{- 1}, \end{aligned}$ or $(J + A_{12} A_{22.1}^{- 1} A_{21})^{- 1},$ where $A_{11} = J$ is Fisher's information matrix.

In the above approach, the estimating equation $g_{3} = h (y, x) - \bar{h}$ does not involve the parameter β. However, there are ways to achieve higher efficiency. For example, we define $g_{2} (x, z, β) = ψ (x, z, β)$ with $\begin{aligned} ψ (x, z, β) & = E {h (Y, X) | X = x, Z = z} - \bar{h} \\ = \int h (y, x) f (y | x, z, β) d y - \bar{h} . \end{aligned}$ Then, $E {g_{2} (x, z, β)} = 0.$ If we combine the empirical log-likelihood based on the estimating equation $g_{2}$ and the log-likelihood $\sum_{i = 1}^{n} \log f (y_{i} | x_{i}, z_{i}, β)$ as in the previous section (see Equation (Equation12(12) $\begin{aligned} Σ_{11} & = {J + E (\nabla_{β} ϕ^{⊤}) (E ϕ ϕ^{⊤})^{- 1} E (\nabla_{β^{⊤}} ϕ)}^{- 1}, \\ Σ_{22} & = {E (\nabla_{β} ϕ^{⊤}) J^{- 1} E (\nabla_{β^{⊤}} ϕ) + E (ϕ ϕ^{⊤})}^{- 1} . \end{aligned}$ (12) )), then the asymptotic variance of the resulting MLE $\hat{β}$ is given by ${J + E (\nabla_{β} ψ^{⊤}) (E ψ ψ^{⊤})^{- 1} E (\nabla_{β^{⊤}} ψ)}^{- 1} .$ In general, this approach can achieve better efficiency.

5.2. A comparison

Given two pairs of estimation functions, ${g_{1}, g_{3}}$ and ${g_{1}, g_{2}}$ , we may wonder combining which pair leads to a better estimator if we directly compare their asymptotic variance formulae. Alternatively, we may enquire whether we should combine all three constraints $g = (g_{1}, g_{2}, g_{3})$ together. Write $g_{12} = g_{21}^{⊤} = (g_{1}, g_{2})$ , $a = E {h^{⊤} (y, x) \nabla_{β} \log f (y | x, z, β)}$ , and $\begin{aligned} E (g g^{⊤}) & = (\begin{array}{ccc} J & 0 & a \\ 0 & E (ψ ψ^{⊤}) & E (ψ ψ^{⊤}) \\ a^{⊤} & E (ψ ψ^{⊤}) & E (h h^{⊤}) \end{array}) \\ = (\begin{array}{cc} B_{11} & B_{12} \\ B_{12}^{⊤} & B_{22} \end{array}), \\ B_{11} & = (\begin{array}{cc} J & 0 \\ 0 & E (ψ ψ^{⊤}) \end{array}), B_{12} = (\begin{matrix} a \\ E (ψ ψ^{⊤}) \end{matrix}) . \end{aligned}$ Using results from Qin and Lawless (Citation1994) and $\begin{aligned} {(\begin{array}{cc} B_{11} & B_{12} \\ B_{21} & B_{22} \end{array})}^{- 1} & = (\begin{array}{cc} I & - B_{11}^{- 1} B_{12} \\ 0 & I \end{array}) \\ \times (\begin{array}{cc} B_{11}^{- 1} & 0 \\ 0 & B_{22.1}^{- 1} \end{array}) (\begin{array}{cc} I & 0 \\ - B_{21} B_{11}^{- 1} & I \end{array}) \end{aligned}$ with $B_{22.1} = B_{11} - B_{12}^{⊤} B_{11}^{- 1} B_{12}$ , we find that the asymptotic variance of $\hat{β}$ obtained by combining the three estimating equations and $\sum_{i = 1}^{n} \log f (y_{i} | x_{i}, z_{i}, β)$ is $\begin{aligned} [J + E (\nabla_{β} ψ^{⊤}) {E (ψ ψ^{⊤})}^{- 1} E (\nabla_{β^{⊤}} ψ) \\ + E (\nabla_{β} g_{21}) B_{11}^{- 1} B_{12} B_{22.1}^{- 1} B_{21} B_{11}^{- 1} E (\nabla_{β^{⊤}} g_{12})]^{- 1} . \end{aligned}$ It can be shown that $E (\nabla_{β} g) = (- J, E (\nabla_{β} ψ), 0)$ and $E (\nabla_{β} g_{12}) = (- J, a) .$ Immediately, we have $\begin{aligned} E (\nabla_{β} g_{12}) B_{11}^{- 1} B_{12} & = (- J, a) (\begin{array}{cc} J^{- 1} & 0 \\ 0 & {E (ψ ψ^{⊤})}^{- 1} \end{array}) \\ \times (\begin{matrix} a \\ E (ψ ψ^{⊤}) \end{matrix}) = 0, \end{aligned}$ which implies that the asymptotic variance in the case where $g_{1}$ , $g_{2}$ , and $g_{3}$ are combined is the same as that in the case where $g_{1}$ and $g_{2}$ only are combined. This indicates that taking $g_{3}$ into account leads to no efficiency gain in the estimation of β.

The method of combining $g_{2}$ and the parametric likelihood $\prod_{i = 1}^{n} f (y_{i} | x_{i}, z_{i}, β)$ is better than that of combining $g_{1}$ , $g_{3}$ , and the parametric likelihood. To see this, recall that the asymptotical variances for the MLEs of β with the two methods are $V_{1} = {J + E (\nabla_{β} ψ^{⊤}) (E ψ ψ^{⊤})^{- 1} E (\nabla_{β^{⊤}} ψ)}^{- 1}$ and $V_{2} = (J + A_{12} A_{22.1}^{- 1} A_{21})^{- 1} .$ It suffices to show that $V_{2} - V_{1} \geq 0$ , namely, $V_{2} - V_{1}$ is non-negative definite.

5.3. Proof of $V_{2} - V_{1} \geq 0$

For convenience, we assume that $E (h) = 0$ . As $E (\nabla_{β} ψ^{⊤}) = A_{12}$ and $ψ = E (h | X, Z)$ , it suffices to show that (13) $\begin{aligned} A_{22.1} - E (ψ ψ^{⊤}) = (A_{22} - A_{21} A_{11}^{- 1} A_{12}) \\ - E [{E (h | X, Z)}^{\otimes 2}] \geq 0. \end{aligned}$ (13) Let $E_{*}$ and ${V a r}_{*}$ denote $E (\cdot | X, Z)$ and $V a r (\cdot | X, Z)$ , respectively. As $\begin{aligned} (\begin{array}{cc} A_{11} & A_{12} \\ A_{21} & A_{22} \end{array}) & = E {{(\begin{matrix} g_{1} \\ h \end{matrix})}^{\otimes 2}} \\ = E {{V a r}_{*} (\begin{matrix} g_{1} \\ h \end{matrix})} + V a r {E_{*} (\begin{matrix} g_{1} \\ h \end{matrix})} \end{aligned}$ and $E_{*} (g_{1}) = 0$ , it follows that $\begin{aligned} (\begin{array}{cc} A_{11} & A_{12} \\ A_{21} & A_{22} \end{array}) & \geq V a r {E_{*} (\begin{matrix} g_{1} \\ h \end{matrix})} \\ = E (\begin{array}{cc} 0 & 0 \\ 0 & E_{*} (h) E_{*} (h^{⊤}) \end{array}) . \end{aligned}$ Multiplying both sides by $(- A_{21} A_{11}^{- 1}, I)$ from the left and by $(- A_{21} A_{11}^{- 1}, I)^{⊤}$ from the right, we arrive at $A_{22} - A_{21} A_{11}^{- 1} A_{12} \geq E {E_{*} (h) E_{*} (h^{⊤})},$ that is, inequality (Equation13(13) $\begin{aligned} A_{22.1} - E (ψ ψ^{⊤}) = (A_{22} - A_{21} A_{11}^{- 1} A_{12}) \\ - E [{E (h | X, Z)}^{\otimes 2}] \geq 0. \end{aligned}$ (13) ) holds, which implies $V_{2} - V_{1} \geq 0$ .

6. Calibration of information from previous studies

We consider calibration of information using parametric likelihood, EL (Owen, Citation1988), and GMM (Hansen, Citation1982). When only summary information from previous studies is available, these three well-known methods can be used to calibrate such summary information and to make inferences about the unknown parameters of interest. We may wonder whether doing so results in efficiency loss compared with inferences based on the pooled data if they were all available. Zeng and Lin (Citation2015) found that parametric-likelihood-based meta-analysis of summarized information retained first-order asymptotic efficiency compared with analysis based on individual data. We show here that EL and GMM also possess this property. This is extremely important, as individual data may involve privacy issues, whereas summarized information does not.

6.1. Efficiency comparison

Suppose that $(Y_{i j}, X_{i j})$ ( $j = 1, 2, \dots, n_{i}; i = 1, 2, \dots, K$ ) are independent observations from the same population. We consider two scenarios according to the model's assumption about the population.

The conditional probability function (i.e., the probability density/mass function of a continuous/discrete random variable) of Y given X has a parametric form $f (y | x, β)$ .
The population satisfies $E {g (Y, X, β)} = 0.$

Here, β is a finite-dimensional unknown parameter, and $β_{*}$ is its true value. Assume that data are available batch by batch, and that $n_{i} / n = ρ_{i} \in (0, 1)$ , where $n = \sum_{i = 1}^{K} n_{i}$ . For the i-th batch ( $i = 1, 2, \dots, K$ ) of data:

under assumption (I), the parametric log-likelihood function of β is $ℓ_{i, P L} (β) = \sum_{j = 1}^{n_{i}} \log {f (Y_{i j} | X_{i j}, β)};$
under assumption (II), we define an empirical log-likelihood function $\begin{aligned} ℓ_{i, E L} (β) & = sup {\sum_{j = 1}^{n_{i}} \log (n_{i} p_{j}) : p_{j} \geq 0, \\ \sum_{j = 1}^{n_{i}} p_{j} = 1, \\ \sum_{j = 1}^{n_{i}} p_{j} g (Y_{i j}, X_{i j}; β) = 0} \\ = - \sum_{j = 1}^{n_{i}} \log {1 + λ_{i}^{⊤} g (Y_{i j}, X_{i j}; β)} \\ - n_{i} \log (n_{i}), \end{aligned}$ where $λ_{i}$ satisfies $\sum_{j = 1}^{n_{i}} \frac{g (Y_{i j}, X_{i j}; β)}{1 + λ_{i}^{⊤} g (Y_{i j}, X_{i j}; β)} = 0$ ;
under assumption (II), we define the objective function of the GMM method (GMM log-likelihood for short) as $\begin{aligned} ℓ_{i, G M M} (β) & = - {\sum_{j = 1}^{n_{i}} g (Y_{i j}, X_{i j}; β)}^{⊤} \\ \times Ω^{- 1} {\sum_{j = 1}^{n_{i}} g (Y_{i j}, X_{i j}; β)}, \end{aligned}$ where $Ω = V a r {g (Y, X, β_{*})}$ and $β_{*}$ is the true value of β. In practice, $β_{*}$ is generally replaced by a consistent estimator of β in the expression for Ω. Using the true value $β_{*}$ of β does not affect the theoretical analysis presented in this section.

Let $ℓ_{i} (β) = ℓ_{i, P L} (β)$ , $ℓ_{i, E L} (β)$ , or $ℓ_{i, G M M} (β)$ . Under certain regularity conditions, it can be verified that for $β = β_{*} + O_{p} (n^{- 1 / 2})$ , (14) $\begin{aligned} ℓ_{i} (β) & = U_{i}^{⊤} \sqrt{n_{i}} (β - β_{*}) - \frac{n_{i}}{2} (β - β_{*})^{⊤} \\ \times V (β - β_{*}) + o_{p} (1) . \end{aligned}$ (14) In Case (a), $\begin{aligned} U_{i} & = n_{i}^{- \frac{1}{2}} \sum_{j = 1}^{n_{i}} \nabla_{β} \log {f (Y_{i j} | X_{i j}, β_{*})}, \\ V & = V a r [\nabla_{β} \log {f (Y | X, β_{*})}] . \end{aligned}$ In Case (b) $U_{i} = n_{i}^{- \frac{1}{2}} \sum_{j = 1}^{n_{i}} g (Y_{i j}, X_{i j}; β_{*}), V = A_{12} A_{22}^{- 1} A_{21},$ where $\begin{aligned} A & = (\begin{array}{cc} 0 & E {\nabla_{β} g^{⊤} (Y, X; β_{*})} \\ E {\nabla_{β^{⊤}} g (Y, X; β_{*})} & E {g (Y, X; β_{*}) g (Y, X; β_{*})} \end{array}) \\ \equiv (\begin{array}{cc} A_{11} & A_{12} \\ A_{21} & A_{22} \end{array}) . \end{aligned}$ In Case (c), $\begin{aligned} U_{i} & = - {E \nabla_{θ} g^{⊤} (Y, X, β_{*})} Ω^{- 1} n_{i}^{- \frac{1}{2}} \sum_{j = 1}^{n_{i}} g (Y_{i j}, X_{i j}, β_{*}), \\ V & = {E \nabla_{θ} g^{⊤} (Y, X, β_{*})} Ω^{- 1} {E \nabla_{β^{⊤}} g (Y, X, β_{*})} . \end{aligned}$ We denote the MLE of β based on the i-th batch of data by ${\hat{β}}_{i} = \arg max ℓ_{r} (β) .$ The above approximation implies that $\sqrt{n_{i}} ({\hat{β}}_{i} - β_{*}) = V^{- 1} U_{i} + o_{p} (1) \overset{d}{⟶} N (0, V^{- 1}) .$ When the K-th batch of individual data are available, we no longer have access to the individual data of the previous K−1 batches but only have summarized information $({\hat{β}}_{i}, {\hat{Σ}}_{i}), i = 1, 2, \dots, K - 1$ , where ${\hat{β}}_{i}$ is the MLE based on the i-th batch of data and ${\hat{Σ}}_{i} = V^{- 1} / n_{i} + o (n^{- 1})$ , and we can define an augmented log-likelihood $ℓ_{A} (β) = ℓ_{K} (β) - \frac{1}{2} \sum_{i = 1}^{K - 1} ({\hat{β}}_{i} - β)^{⊤} {\hat{Σ}}_{i}^{- 1} ({\hat{β}}_{i} - β)$ and the corresponding MLE ${\hat{β}}_{A} = \arg max ℓ_{A} (β) .$ For $β = β_{*} + O_{p} (n^{- 1 / 2})$ , using the approximation in (Equation14(14) $\begin{aligned} ℓ_{i} (β) & = U_{i}^{⊤} \sqrt{n_{i}} (β - β_{*}) - \frac{n_{i}}{2} (β - β_{*})^{⊤} \\ \times V (β - β_{*}) + o_{p} (1) . \end{aligned}$ (14) ), we have $\begin{aligned} ℓ_{A} (β) & = U_{K}^{⊤} \sqrt{n_{K}} (β - β_{*}) - \frac{n_{K}}{2} (β - β_{*}) V (β - β_{*}) \\ - \frac{1}{2} \sum_{i = 1}^{K - 1} n_{i} (β - β_{*})^{⊤} V (β - β_{*}) \\ + \sum_{i = 1}^{K - 1} n_{i} ({\hat{β}}_{i} - β_{*})^{⊤} V (β - β_{*}) + C + o_{p} (1) \\ = n^{- 1 / 2} \sum_{i = 1}^{K} \sqrt{n_{i}} U_{i}^{⊤} \cdot \sqrt{n} (β - β_{*}) \\ - \frac{n}{2} (β - β_{*})^{⊤} V (β - β_{*}) + C + o_{p} (1), \end{aligned}$ where the constant C differs in different equations.

For comparison, based on the pooled data, in Case (a) we define the parametric log-likelihood as $ℓ_{P L} (β) = \sum_{i = 1}^{K} \sum_{j = 1}^{n_{i}} \log {f (Y_{i j} | X_{i j}, β)};$ in Case (b) we define the empirical log-likelihood function as $\begin{aligned} ℓ_{E L} (β) & = sup {\sum_{i = 1}^{K} \sum_{j = 1}^{n_{i}} \log (n p_{i j}) : p_{i j} \geq 0, \\ \sum_{i = 1}^{K} \sum_{j = 1}^{n_{i}} p_{i j} = 1, \\ \sum_{i = 1}^{K} \sum_{j = 1}^{n_{i}} p_{i j} g (Y_{i j}, X_{i j}; β) = 0} \\ = - \sum_{i = 1}^{K} \sum_{j = 1}^{n_{i}} \log {1 + λ^{⊤} g (Y_{i j}, X_{i j}; β)} \\ - \sum_{i = 1}^{K} n_{i} \log (n_{i}), \end{aligned}$ where λ satisfies $\sum_{i = 1}^{K} \sum_{j = 1}^{n_{i}} \frac{g (Y_{i j}, X_{i j}; β)}{1 + λ^{⊤} g (Y_{i j}, X_{i j}; β)} = 0$ ; and in Case (c) we define the GMM log-likelihood as $\begin{aligned} ℓ_{G M M} (β) & = - {\sum_{i = 1}^{K} \sum_{j = 1}^{n_{i}} g (Y_{i j}, X_{i j}; β)}^{⊤} \\ \times Ω^{- 1} {\sum_{i = 1}^{K} \sum_{j = 1}^{n_{i}} g (Y_{i j}, X_{i j}; β)} . \end{aligned}$ Let the log-likelihood based on the pooled data be $ℓ_{p o o l} (β) = ℓ_{P L} (β)$ , $ℓ_{E L} (β)$ , and $ℓ_{G M M} (β)$ in Cases (a), (b), and (c), respectively. Then, it can be shown that $\begin{aligned} ℓ_{p o o l e d} (β) & = n^{- 1 / 2} \sum_{j = 1}^{K} \sqrt{n_{j}} U_{j}^{⊤} \cdot \sqrt{n} (β - β_{*}) \\ - \frac{n}{2} (β - β_{*})^{⊤} V (β - β_{*}) + C + o_{p} (1), \end{aligned}$ for some constant C. Let ${\hat{β}}_{p o o l e d} = \arg max ℓ_{p o o l e d} (β)$ . By comparing $ℓ_{p o o l e d} (β)$ and $ℓ_{A} (β)$ , we obtain $ℓ_{p o o l e d} (β) = ℓ_{A} (β) + C + o_{p} (1)$ and $\begin{aligned} \sqrt{n} ({\hat{β}}_{A} - β_{*}) & = \sqrt{n} ({\hat{β}}_{p o o l e d} - β_{*}) + o_{p} (1) \\ = V^{- 1} \cdot n^{- 1 / 2} \sum_{j = 1}^{K} \sqrt{n_{j}} U_{j}^{⊤} + o_{p} (1) \\ \overset{d}{⟶} N (0, V^{- 1}) . \end{aligned}$ This indicates that compared with the methods, including parametric likelihood, EL, and GMM, based on all individual data, the calibration method based on the last batch of individual data and all summary results of the previous batches has no efficiency loss.

6.2. When nuisance parameters are present

For batch i, assume that the data $(Y_{i j}, X_{i j})$ ( $j = 1, 2, \dots, n_{i}$ ) satisfy either $p r (Y_{i j} = y | X_{i j} = x) = f (y | x, β, γ_{i})$ or $E {g (Y, X, β, γ_{i})} = 0,$ where β is common but $γ_{i}$ is a batch-specific parameter. We define $ℓ_{r} (β, γ_{r})$ in the same way as $ℓ_{r} (β)$ . Let $({\hat{β}}_{i}, {\hat{γ}}_{i})$ be the MLE of $(β, γ_{i})$ based on the i-th batch of data, and assume that approximately $(({\hat{β}}_{i} - β)^{⊤}, ({\hat{γ}}_{i} - γ_{i})^{⊤})^{⊤} \sim N (0, {\hat{Σ}}_{i})$ with ${\hat{Σ}}_{i} = ({\hat{Σ}}_{i, r s})_{1 \leq r, s \leq 2}$ .

We have two ways of combining information from previous studies. If we use all the previous summary information, we can define (15) $\begin{aligned} ℓ_{A}^{(1)} (β, γ_{1}, \dots, γ_{K}) = ℓ_{K} (β, γ_{K}) - \frac{1}{2} \sum_{i = 1}^{K - 1} (({\hat{β}}_{i} - β)^{⊤}, \\ ({\hat{γ}}_{i} - γ_{i})^{⊤}) {\hat{Σ}}_{i}^{- 1} (({\hat{β}}_{i} - β)^{⊤}, ({\hat{γ}}_{i} - γ_{i})^{⊤})^{⊤} . \end{aligned}$ (15) As ${\hat{β}}_{i} | {\hat{γ}}_{i} \sim N (β, {\hat{Σ}}_{i, 11 \cdot 2}),$ where ${\hat{Σ}}_{i, 11 \cdot 2} = {\hat{Σ}}_{i, 11} - {\hat{Σ}}_{i, 12} {\hat{Σ}}_{i, 22}^{- 1} {\hat{Σ}}_{i, 21},$ using only this summary information, we can define $\begin{aligned} ℓ_{A}^{(2)} (β, γ_{K}) & = ℓ_{K} (β, γ_{K}) - \frac{1}{2} \sum_{i = 1}^{K - 1} ({\hat{β}}_{i} - β)^{⊤} \\ \times {\hat{Σ}}_{i, 11 \cdot 2}^{- 1} ({\hat{β}}_{i} - β) . \end{aligned}$ Below we show that the MLEs of β based on these two likelihoods are actually equal to each other. In other words, there is no efficiency loss when estimating β based on $ℓ_{A}^{(2)} (β, γ_{K})$ instead of $ℓ_{A}^{(1)} (β, γ_{1}, \dots, γ_{K})$ .

To see this, it suffices to show that (16) $sup_{γ_{1}, \dots, γ_{K - 1}} ℓ_{A}^{(1)} (β, γ_{1}, \dots, γ_{K}) = ℓ_{A}^{(2)} (β, γ_{K}) .$ (16) We denote the inverse matrix of $Σ_{i}$ by $Σ_{i}^{- 1} = (Σ_{i}^{r s})_{1 \leq r, s \leq 2},$ where $\begin{aligned} Σ_{i}^{11} & = Σ_{i, 11 \cdot 2}^{- 1}, Σ_{i}^{21} = - Σ_{i, 22}^{- 1} Σ_{i, 21} Σ_{i, 11 \cdot 2}^{- 1}, \\ Σ_{i}^{12} & = - Σ_{i, 11 \cdot 2}^{- 1} Σ_{i, 12} Σ_{i, 22}^{- 1}, \\ Σ_{i}^{22} & = Σ_{i, 22}^{- 1} + Σ_{i, 22}^{- 1} Σ_{i, 21} Σ_{i, 11 \cdot 2}^{- 1} Σ_{i, 12} Σ_{i, 22}^{- 1} . \end{aligned}$ It can be seen that $\begin{aligned} ℓ_{A}^{(1)} (β, γ_{1}, \dots, γ_{K}) \\ = ℓ_{K} (β, γ_{K}) - \frac{1}{2} \sum_{i = 1}^{K - 1} ({\hat{β}}_{i} - β)^{⊤} Σ_{i}^{11} ({\hat{β}}_{i} - β) \\ + \sum_{i = 1}^{K - 1} ({\hat{β}}_{i} - β)^{⊤} Σ_{i}^{12} (γ_{i} - {\hat{γ}}_{i}) \\ - \frac{1}{2} \sum_{i = 1}^{K - 1} (γ_{i} - {\hat{γ}}_{i})^{⊤} Σ_{i}^{22} (γ_{i} - {\hat{γ}}_{i}) . \end{aligned}$ Setting $\partial ℓ_{A}^{(1)} (β, γ_{1}, \dots, γ_{K}) / \partial γ_{i} = 0$ ( $1 \leq i \leq K - 1$ ) gives $(γ_{i} - {\hat{γ}}_{i}) = (Σ_{i}^{22})^{- 1} Σ_{i}^{21} ({\hat{β}}_{i} - β) .$ Putting this back into $ℓ_{A}^{(1)} (β, γ_{1}, \dots, γ_{K})$ gives $\begin{aligned} sup_{γ_{1}, \dots, γ_{K - 1}} ℓ_{A}^{(1)} (β, γ_{1}, \dots, γ_{K}) \\ = ℓ_{K} (β, γ_{K}) - \frac{1}{2} \sum_{i = 1}^{K - 1} ({\hat{β}}_{i} - β)^{⊤} \\ \times {Σ_{i}^{11} - Σ_{i}^{12} (Σ_{i}^{22})^{- 1} Σ_{i}^{21}} ({\hat{β}}_{i} - β) + C \\ = ℓ_{K} (β, γ_{K}) - \frac{1}{2} \sum_{i = 1}^{K - 1} ({\hat{β}}_{i} - β)^{⊤} \\ \times Σ_{i, 11 \cdot 2}^{- 1} ({\hat{β}}_{i} - β) + C, \end{aligned}$ where we used the definition of $Σ_{i, 11 \cdot 2}$ in the last equation. We arrive at Equation (Equation16(16) $sup_{γ_{1}, \dots, γ_{K - 1}} ℓ_{A}^{(1)} (β, γ_{1}, \dots, γ_{K}) = ℓ_{A}^{(2)} (β, γ_{K}) .$ (16) ) after comparing this with the definition of $ℓ_{A}^{(2)} (β, γ_{K})$ .

7. Using covariate-specific disease prevalent information

As discussed in the previous section, summarized statistics from previous studies can sometimes be utilized to enhance the estimation efficiency in a current study. This is especially important in the big data era, when many types of information can be found through the internet. More specifically, suppose the prevalence of a disease is known at various levels of a known risk factor X. In this section, we combine this type of information in a case–control biased sampling setup.

7.1. Induced estimating equations under case–control sampling

Case–control sampling is among the most popular methods in cancer epidemiological studies. This is mainly because it is the most convenient, economic, and effective method. In the study of rare diseases in particular, one has to collect large samples in order to get a reasonable number of cases by using prospective sampling, which may not be practical. Using case–control sampling, a pre-specified number of cases ( $n_{1}$ ) and controls ( $n_{0}$ ) are collected retrospectively from case and control populations, respectively. Typically, this can be accomplished by sampling cases from hospitals and controls from the general disease-free population.

For a given risk factor X, let $F_{i} (x) = p r (X \leq x | D = i)$ for i = 0, 1. Given X in a range $(a, b]$ , the disease prevalence is $p r (D = 1 | a < X \leq b) = ϕ (a, b),$ where $ϕ (a, b)$ is known. Using Bayes' formula, we have $\begin{aligned} ϕ (a, b) & = \frac{π \int_{a}^{b} d F_{1} (x)}{p r (a < X \leq b)}, \\ 1 - ϕ (a, b) & = \frac{(1 - π) \int_{a}^{b} d F_{0} (x)}{p r (a < X \leq b)} \end{aligned}$ with $π = p r (D = 1) .$ It follows that $\int_{a}^{b} d F_{1} (x) = \frac{1 - π}{π} \frac{ϕ (a, b)}{1 - ϕ (a, b)} \int_{a}^{b} d F_{0} (x),$ or $\begin{aligned} E_{1} [I (a < X \leq b)] \\ = \frac{1 - π}{π} \frac{ϕ (a, b)}{1 - ϕ (a, b)} E_{0} [I (a < X \leq b)], \end{aligned}$ where $E_{0}$ and $E_{1}$ denote the expectation operators with respect to $F_{0}$ and $F_{1}$ , respectively.

We assume that given covariates X and Y, the underlying disease model is given by the conventional logistic regression (17) $\begin{aligned} p r (D = 1 | x, y) = \frac{\exp (α^{*} + x β + y γ + y x ξ)}{1 + \exp (α^{*} + x β + y γ + y x ξ)} . \end{aligned}$ (17) Let $α = α^{*} - η$ with $η = \log {(1 - π) / π}$ . It can be shown (see Qin, Citation2017) that this is equivalent to the exponential tilting model $\begin{aligned} f_{1} (x, y) & = f (x, y | D = 1) \\ = \exp (α + x β + y γ + y x ξ) f_{0} (x, y), \end{aligned}$ where $f_{0} (x, y) = f (x, y | D = 0)$ . As a consequence, $\begin{aligned} E_{0} {I (a < X \leq b) \exp (η + α + β X + γ Y + ξ X Y)} \\ = \frac{1 - π}{π} \frac{ϕ (a, b)}{1 - ϕ (a, b)} E_{0} [I (a < X \leq b)] \end{aligned}$ or (18) $\begin{aligned} E_{0} {I (a < X \leq b) \exp (α + β X + γ Y + ξ X Y) \\ - \frac{ϕ (a, b)}{1 - ϕ (a, b)} I (a < X \leq b)} = 0. \end{aligned}$ (18) We denote $g_{0} (X, Y) = \exp (η + α + β X_{i} + γ Y_{i} + ξ X_{i} Y_{i}) - 1$ and the summarized auxiliary information equations as $\begin{aligned} g_{i} (X, Y) & = I (a_{i - 1} < X \leq a_{i}) \exp (α + β X + γ Y \\ + ξ X Y) - \frac{ϕ (a_{i - 1}, a_{i})}{1 - ϕ (a_{i - 1}, a_{i})} \\ \times I (a_{i - 1} < X \leq a_{i}) \end{aligned}$ with $i = 1, 2, \dots, I$ . Then $E_{0} {g (X, Y)} = 0$ , where $g (X, Y) = (g_{0} (X, Y), g_{1} (X, Y), \dots, g_{I} (X, Y))^{⊤} .$

7.2. Empirical likelihood approach

The log-likelihood is (19) $\begin{aligned} ℓ = \sum_{i = 1}^{n} d_{i} (η + α + β x_{i} + γ y_{i} + ξ x_{i} y_{i}) + \sum_{i = 1}^{n} \log (p_{i}), \end{aligned}$ (19) where $p_{i} = d F_{0} (x_{i}), i = 1, 2, \dots, n$ , and the constraints are $p_{i} \geq 0, \sum_{i = 1}^{n} p_{i} = 1, \sum_{i = 1}^{n} p_{i} g (x_{i}, y_{i}) = 0.$ The profile log-likelihood is $\begin{aligned} ℓ & = \sum_{i = 1}^{n} d_{i} (η + α + β x_{i} + γ y_{i} + ξ x_{i} y_{i}) \\ - \sum_{i = 1}^{n} \log {1 + λ^{⊤} g (x_{i}, y_{i})}, \end{aligned}$ where the Lagrange multiplier λ is determined by $\sum_{i = 1}^{n} \frac{g (x_{i}, y_{i})}{1 + λ^{⊤} g (x_{i}, y_{i})} = 0.$ Finally, the underlying parameters can be obtained by maximizing ℓ.

If the overall disease prevalence probability $π = p r (D = 1)$ is known, then $η = \log {(1 - π) / π}$ is known. On the other hand, if it is unknown but $I \geq 1$ , then π is identifiable. If I>1, then we have an over-identified equation problem. This can be treated as a generalization of the EL method for estimating functions (Qin & Lawless, Citation1994) for biased sampling problems. Qin et al. (Citation2015) considered the case where η is unknown and $I \geq 1$ .

Let $ω = (η, α, β, γ, ξ, λ)$ and let $\hat{ω}$ be its maximum EL estimator. As the first estimating function $g_{0}$ corrects biased sampling in a case–control study, the remaining estimating functions $g_{1}, \dots, g_{I}$ are used for improving efficiency. When n goes to infinity, it can be shown that the limit of λ is a $(I + 1)$ -dimensional vector where the first component is $lim_{n \to \infty} (n_{1} / n)$ and the remainder are all zero. Qin et al. (Citation2015) showed that if $ρ = n_{1} / n_{0}$ remains constant as $n \to \infty$ and $ρ \in (0, 1)$ , then under suitable regularity conditions $\sqrt{n} (\hat{ω} - ω_{0})$ is asymptotically normally distributed with mean zero. Moreover, the estimation of the logistic regression parameters $(β, γ, ξ)$ improves as the number I of estimating functions increases. This means that a richer set of auxiliary information leads to better estimators. In practice, however, this consideration must be balanced with the numerical difficulty of solving a larger number of equations.

Notably, auxiliary information is informative for estimating β and ξ but not for estimating γ. This can be observed through the following equations: $\begin{aligned} \int I (a < x < b) \exp (α + β x + γ y + ξ x y) d F_{0} (x, y) \\ = \int I (a < x < b) \exp {α + β x + s + ξ x (s / γ)} \\ \times d F_{0} (x, s / γ) . \end{aligned}$ As the underlying distribution $F_{0} (x, y)$ is unspecified, we can treat $F_{0} (x, s / γ)$ as a new underlying distribution $F_{0}^{*} (x, s)$ . With $F_{0}^{*}$ profiled out, the auxiliary information equation does not involve γ if $ξ = 0$ . Hence, even if $ξ \neq 0$ , the information for γ is minimal as γ and ξ cannot be separated.

7.3. Generalizations

The simulation results of Qin et al. (Citation2015) indicate that when covariate-specific auxiliary information is employed, the estimator of the coefficient β of X has the maximum variance reduction, whereas the variance reductions for other coefficients are small. If the auxiliary information $p r (D = 1 | b_{j - 1} < Y \leq b_{j}) = ψ_{j}, j = 1, 2, \dots, J$ is also available, we can combine them through estimating equations $\begin{aligned} g_{i} (X, Y) & = I (a_{i - 1} < X \leq a_{i}) e^{α + β X + γ Y + ξ X Y)} \\ - \frac{ϕ (a_{i - 1}, a_{i})}{1 - ϕ (a_{i - 1}, a_{i})} I (a_{i - 1} < X \leq a_{i}), \\ h_{j} (X, Y) & = I (b_{j - 1} < Y \leq b_{j}) e^{α + β X + γ Y + ξ X Y} \\ - \frac{ψ (b_{j - 1}, b_{j})}{1 - ψ (b_{j - 1}, b_{j})} I (b_{j - 1} < Y \leq b_{j}) . \end{aligned}$ It would be more informative if the auxiliary information $p r (D = 1 | a < X < b, c < Y < d)$ is available.

7.4. More on the use of auxiliary information

Under a logistic regression model, the case and control densities are linked by the exponential tilting model (20) $\begin{aligned} p r (x, y | D = 1) = p r (x, y | D = 0) \\ \times \exp (α + x β + y γ + ξ x y) . \end{aligned}$ (20) Suppose that for the general population $E (X) = μ_{1}$ , $E (Y) = μ_{2}$ , and $E (X Y) = μ_{3}$ are all known, and $π = p r (D = 1)$ is known or can be estimated using external data. Under the exponential tilting model (Equation20(20) $\begin{aligned} p r (x, y | D = 1) = p r (x, y | D = 0) \\ \times \exp (α + x β + y γ + ξ x y) . \end{aligned}$ (20) ), the density $f (x, y)$ in the general population and the density $p r (x, y | D = 0)$ in the control population are linked by $\begin{aligned} p r (x, y) & = {π e^{α + x β + y γ + ξ x y} + (1 - π)} \\ \times p r (x, y | D = 0) . \end{aligned}$ As a consequence $E (X) = E_{0} [X {π e^{α + X β + Y γ + ξ X Y} + (1 - π)}] = μ_{1},$ where $E_{0}$ is an expectation with respect to $p r (x, y | D = 0)$ . Let $h (x, y) = (x - μ_{1}, y - μ_{2}, x y - μ_{3})$ with known $μ_{1}$ , $μ_{2}$ , and $μ_{3}$ . The log-likelihood under case–control data is still (Equation19(19) $\begin{aligned} ℓ = \sum_{i = 1}^{n} d_{i} (η + α + β x_{i} + γ y_{i} + ξ x_{i} y_{i}) + \sum_{i = 1}^{n} \log (p_{i}), \end{aligned}$ (19) ), where the $p_{i}$ s satisfy the following constraints: $\begin{aligned} \sum_{i = 1}^{n} p_{i} = 1, p_{i} \geq 0, \sum_{i = 1}^{n} p_{i} e^{α + x_{i} β + y_{i} γ + x_{i} y_{i} ξ} = 1, \\ \sum_{i = 1}^{n} p_{i} h (x_{i}, y_{i}) {π e^{α + x_{i} β + y_{i} γ + x_{i} y_{i} ξ} + (1 - π)} = 0. \end{aligned}$ More generally, any information in the general population such as $E [ψ (Y, X)] = 0$ can be converted to an equation for the control population, $E_{0} [{π e^{α + X β + Y γ + ξ X Y} + (1 - π)} ψ (Y, X)] = 0.$ Therefore, the results developed by Qin et al. (Citation2015) can be applied. The results of Chatterjee et al. (Citation2016) for case–control data can be considered as a special case of Qin et al. (Citation2015).

8. Communication-efficient distributed inference

In the era of big data, it is commonplace for data analyses to run on hundreds or thousands of machines, with the data distributed across those machines and no longer available in a single central location. Recently, parallel and distributed inference has become popular in the statistical literature in both frequentist and Bayesian settings. In essence, the data-parallel procedures are intended to break the overall dataset into subsets that are processed independently. To the extent that communication-avoiding procedures have been discussed explicitly, the focus has been on one-shot or embarrassingly parallel approaches that use only one round of communication in which estimators or posterior samples are first obtained in parallel on local machines, then communicated to a centre node, and finally combined to form a global estimator or approximation to the posterior distribution (Lee et al., Citation2017; Neiswanger et al., Citation2015; Wang & Dunson, Citation2015; Zhang et al., Citation2013). In the frequentist setting, most one-shot approaches rely on averaging (Zhang et al., Citation2013), where the global estimator is the average of the local estimators. Lee et al. (Citation2017) extend this idea to high-dimensional sparse linear regression by combining local debiased Lasso estimates (van de Geer et al., Citation2014). Recent work by Duchi et al. (Citation2015) shows that under certain conditions, these averaging estimators can attain the information-theoretic complexity lower bound for linear regression, and at least $O (d k)$ bits must be communicated in order to attain the minimax rate of parameter estimation, where d is the dimension of the parameter and k is the number of machines. This result holds even in the sparse setting (Braverman et al., Citation2016).

The method of Jordan et al. (Citation2019) proceeds as follows. Suppose the big data consists of N observations and there are k machines. For the convenience of presentation, we assume that each machine has n observations, i.e., N = nk. Denote the full-data likelihood by $ℓ_{N} (θ) = \frac{1}{k} \sum_{j = 1}^{k} ℓ_{j} (θ),$ where $ℓ_{j} (θ)$ is the log-likelihood based on the data from the j-th machine. For θ near its target value $\bar{θ}$ , $\begin{aligned} ℓ_{N} (θ) & = ℓ_{N} (\bar{θ}) + \nabla_{θ} ℓ_{N} (θ) |_{θ = \bar{θ}} (θ - \bar{θ}) + R_{N} (θ), \\ ℓ_{1} (θ) & = ℓ_{1} (\bar{θ}) + \nabla_{θ} ℓ_{1} (θ) |_{θ = \bar{θ}} (θ - \bar{θ}) + R_{1} (θ), \end{aligned}$ where $R_{N} (θ)$ and $R_{1} (θ)$ are remainders. Observing that $R_{N} \approx R_{1}$ , define a surrogate log-likelihood $\begin{aligned} \bar{ℓ} (θ) & = ℓ_{N} (\bar{θ}) + (θ - \bar{θ})^{⊤} \nabla_{θ} ℓ_{N} (θ) |_{θ = \bar{θ}} \\ + {ℓ_{1} (θ) - ℓ_{1} (\bar{θ}) - (θ - \bar{θ})^{⊤} \nabla_{θ} ℓ_{1} (θ) |_{θ = \bar{θ}}} . \end{aligned}$ Ignoring the constant terms, the surrogate log-likelihood is $\bar{ℓ} (θ) = ℓ_{1} (θ) + θ^{⊤} {\nabla_{θ} ℓ_{N} (θ) |_{θ = \bar{θ}} - \nabla_{θ} ℓ_{1} (θ) |_{θ = \bar{θ}}} .$ The score equation based on the surrogate likelihood is $\begin{aligned} \nabla_{θ} \bar{ℓ} (θ) & = \nabla_{θ} ℓ_{1} (θ) + {\nabla_{θ} ℓ_{N} (θ) |_{θ = \bar{θ}} - \nabla_{θ} ℓ_{1} (θ) |_{θ = \bar{θ}}} \\ = 0. \end{aligned}$ Let $\hat{θ}$ be the solution. Expanding it at $θ_{0}$ and using the fact that $\begin{aligned} N^{- 1} {\nabla_{θ θ^{⊤}} ℓ_{1} (θ_{0}) - \nabla_{θ θ^{⊤}} ℓ_{N} (θ_{0})} \\ \to 0 in probability, \end{aligned}$ we can easily show that if $\bar{θ} - θ_{0} = O_{p} (N^{- 1 / 2})$ then $(\hat{θ} - θ_{0}) = {\nabla_{θ θ^{⊤}} ℓ_{N} (θ_{0})}^{- 1} \nabla_{θ} ℓ_{N} (θ_{0}) + o_{p} (N^{- 1 / 2}) .$ If we let $\bar{θ}$ be the MLE based on $ℓ_{1} (θ)$ , the surrogate log-likelihood can be simplified to $\bar{ℓ} (θ) = ℓ_{1} (θ) + θ^{⊤} \nabla_{θ} ℓ_{N} (\bar{θ}),$ because $\nabla_{θ} ℓ_{1} (\bar{θ}) = 0$ .

If the dimension of θ is high, one may add a penalty function in the surrogate log-likelihood and estimate θ by $\tilde{θ} = \arg max_{θ \in Θ} {\bar{ℓ} (θ) - λ ‖ θ ‖_{1}},$ where $‖ θ ‖_{1}$ is the $L_{1}$ -norm of θ. Similarly, Bayesian inference can be adapted to the surrogate likelihood as well.

Duan et al. (Citation2020) proposed distributed algorithms that account for heterogeneous distributions by allowing site-specific nuisance parameters. The proposed methods extend the surrogate likelihood approach (Jordan et al., Citation2019; Wang et al., Citation2017) to the heterogeneous setting by applying a novel density ratio tilting method to the efficient score function. Asymptotically, the approach described in Section 6.2 on nuisance parameters is equivalent to that of Duan et al. (Citation2020).

9. Renewal estimation and incremental inference

Let $U (D_{1}, β) = \nabla_{β} M (D_{1}, β)$ be a score function of β based on some objective function $M (D_{1}, β)$ from the first batch of data, where M can be either the log-likelihood $M (D_{1}, β) = \sum_{i = 1}^{n_{1}} \log f (y_{1 i} | x_{1 i}, β)$ or a pseudo log-likelihood.

Let ${\hat{β}}_{1}$ be the solution to $U (D_{1}, β) = 0,$ when only the first batch of data $D_{1}$ is available. Let $D_{2}$ denote the second batch of data. If both of them are available, we let ${\hat{β}}_{2}$ be the solution to the pooled score equation, $U (D_{1}, β) + U (D_{2}, β) = 0.$ Clearly, ${\hat{β}}_{2}$ is the most efficient estimator of β when $D_{1}$ and $D_{2}$ are both available.

If $D_{2}$ is available but $D_{1}$ is not, with only some summary information ${\hat{β}}_{1}$ and ${\hat{Σ}}_{1}$ in its place, how can we utilize the summary information efficiently? It is not feasible to estimate β by directly solving $U (β) \equiv U (D_{1}, β) + U (D_{2}, β) = 0,$ which involves the individual data of the unavailable $D_{1}$ . Luo (Citation2020) considers expanding $U (D_{1}, β)$ at $β = {\hat{β}}_{1}$ , i.e., $\begin{aligned} U (D_{1}, β) & = U (D_{1}, {\hat{β}}_{1}) + (β - {\hat{β}}_{1})^{⊤} \nabla_{β} U (D_{1}, {\hat{β}}_{1}) \\ + O (‖ β - {\hat{β}}_{1} ‖^{2}) \end{aligned}$ for β close to ${\hat{β}}_{1}$ . As $U (D_{1}, {\hat{β}}_{1}) = 0$ , it follows that $\begin{aligned} U (β) & = U (D_{2}, β) + (β - {\hat{β}}_{1})^{⊤} \nabla_{β} U (D_{1}, {\hat{β}}_{1}) \\ + O (‖ β - {\hat{β}}_{1} ‖^{2}) . \end{aligned}$ Luo (Citation2020) proposes obtaining an updated estimator of β by solving (21) $(β - {\hat{β}}_{1})^{⊤} \nabla_{β} U (D_{1}, {\hat{β}}_{1}) + U (D_{2}, β) = 0.$ (21) Alternatively, we may understand this renewal estimation strategy in the manner of Zhang et al. (Citation2020), who propose estimating β by maximizing (22) $\begin{aligned} \sum_{i = 1}^{n_{2}} \log f (y_{2 i} | x_{2 i}, β) - \frac{1}{2} n_{1} ({\hat{β}}_{1} - β)^{⊤} Σ ({\hat{β}}_{1} - β), \end{aligned}$ (22) where $Σ = E {\nabla_{β} \log f (Y | X, β) \nabla_{β^{⊤}} \log f (Y | X, β)}$ is the Fisher information. If both batches are available, the score for β is $\begin{aligned} S (β) & = \sum_{i = 1}^{n_{1}} \nabla_{β} \log f (y_{1 i} | x_{1 i}, β) \\ + \sum_{i = 1}^{n_{2}} \nabla_{β} \log f (y_{2 i} | x_{2 i}, β) . \end{aligned}$ After recording ${\hat{β}}_{1}$ and Σ, we no longer have the raw data ${(y_{1 i}, x_{1 i}), i = 1, 2, \dots, n_{1}}$ . As $\begin{aligned} {\hat{β}}_{1} - β & = - n_{1}^{- 1} Σ^{- 1} \sum_{i = 1}^{n_{1}} \nabla_{β} \log f (y_{1 i} | x_{1 i}, β) \\ + o_{p} (n_{1}^{- 1 / 2}), \end{aligned}$ differentiating (Equation22(22) $\begin{aligned} \sum_{i = 1}^{n_{2}} \log f (y_{2 i} | x_{2 i}, β) - \frac{1}{2} n_{1} ({\hat{β}}_{1} - β)^{⊤} Σ ({\hat{β}}_{1} - β), \end{aligned}$ (22) ) with respect to β gives $\begin{aligned} \sum_{i = 1}^{n_{2}} \nabla_{β} \log f (y_{2 i} | x_{2 i}, β) - n_{1} Σ ({\hat{β}}_{1} - β) \\ = \sum_{i = 1}^{n_{1}} \nabla_{β} \log f (y_{1 i} | x_{1 i}, β) \\ + \sum_{i = 1}^{n_{2}} \nabla_{β} \log f (y_{2 i} | x_{2 i}, β) + o_{p} (n^{1 / 2}) . \end{aligned}$ Here, we have assumed that $n_{1} = O (n_{2}) = O (n)$ . This indicates that estimating β by maximizing (Equation22(22) $\begin{aligned} \sum_{i = 1}^{n_{2}} \log f (y_{2 i} | x_{2 i}, β) - \frac{1}{2} n_{1} ({\hat{β}}_{1} - β)^{⊤} Σ ({\hat{β}}_{1} - β), \end{aligned}$ (22) ) results in no efficiency loss asymptotically compared with the MLE based on all individual data, where the latter is infeasible in the current situation.

10. Concluding remarks

Rapid growth in hardware technology has made data collection much easier and more effective. In many applications, data often arrive in streams and chunks, which leads to batch-by-batch data or streaming data. For example, web sites served by widely distributed web servers may need to coordinate many distributed clickstream analyses, e.g., to track heavily accessed web pages as part of their real-time performance monitoring. Other examples include financial applications, network monitoring, security, telecommunications data management, manufacturing, and sensor networks (Babcock et al., Citation2002; Nguyen et al., Citation2021). The continuous arrival of such data in multiple, rapid, time-varying, possibly unpredictable and unbounded streams not only yields many fundamentally new research problems but provides contains various forms of auxiliary information.

Assembling information from different data sources has become indispensable in big data and artificial intelligence research. Statistical tools play an essential part in updating information. In this paper, we have presented a selective review of several traditional statistical methods, including meta-analysis, calibration information methods in survey sampling, and EL together with over-identified estimating equations and GMM. We have also briefly reviewed some recently developed statistical methods, including communication-efficient distributed statistical inference and renewal estimation and incremental inference, which can be regarded as the latest developments of calibration information methods in the era of big data. Although these methods were developed in different fields and in different statistical frameworks, in principle, they are asymptotically equivalent to well-known methods developed for meta-analysis. These methods result in almost no or little information loss compared with the case when full data are available.

Finally, we apologize to people whose work has inadvertently have been left out of our reference list.

Acknowledgments

The authors thank the editor and two referees for constructive comments and suggestions that led to significant improvements in this paper.

Correction Statement

This article has been republished with minor changes. These changes do not impact the academic content of the article.

Additional information

Funding

This research was supported by the National Natural Science Foundation of China [grant numbers 71931004, 12171157, and 32030063], the 111 Project [grant number B14019], the Development Fund for Shanghai Talents and the Natural Sciences and Engineering Research Council of Canada (grant number RGPIN-2020-04964).

References

Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom, J. (2002). Models and issues in data stream systems. In Proceedings of the 21 ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (pp. 1–16). ACM.
Google Scholar
Back, K., & Brown, D. P. (1992). GMM, maximum likelihood, and nonparametric efficiency. Economics Letters, 39(1), 23–28. https://doi.org/10.1016/0165-1765(92)90095-G
Web of Science ®Google Scholar
Braverman, M., Garg, A., Ma, T., Nguyen, H., & Woodruff, D. (2016). Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the 48th annual ACM symposium on theory of computing (pp. 1011–1020). ACM.
Google Scholar
Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. (2009). Introduction to meta-analysis. Wiley.
Google Scholar
Chatterjee, N., Chen, Y.-H., Maas, P., & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111(513), 107–117. https://doi.org/10.1080/01621459.2015.1123157
PubMed Web of Science ®Google Scholar
Chaudhuri, S., Handcock, M. S., & Rendall, M. S. (2008). Generalized linear models incorporating population level information: an empirical likelihood based approach. Journal of the Royal Statistical Society: Series B, 70(2), 311–328. https://doi.org/10.1111/rssb.2008.70.issue-2
Google Scholar
Chen, J., & Qin, J. (1993). Empirical likelihood estimation for finite populations and the effective usage of auxiliary information. Biometrika, 80(1), 107–116. https://doi.org/10.1093/biomet/80.1.107
Web of Science ®Google Scholar
Chen, J., Sitter, R., & Wu, C. (2002). Using empirical likelihood methods to obtain range restricted weights in regression estimators for surveys. Biometrika, 89(1), 230–237. https://doi.org/10.1093/biomet/89.1.230
Web of Science ®Google Scholar
Cochran, W. G. (1977). Sampling techniques (3rd ed.). Wiley.
Google Scholar
Dersimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7(3), 177–188. https://doi.org/10.1016/0197-2456(86)90046-2
PubMedGoogle Scholar
Duan, R., Ning, Y., & Chen, Y. (2020). Heterogeneity-aware and communication-efficient distributed statistical inference. arXiv:1912.09623v1.
Google Scholar
Duchi, J., Jordan, M., Wainwright, M., & Zhang, Y. (2015). Optimality guarantees for distributed statistical estimation. arXiv:1405.0782.
Google Scholar
Han, P., & Lawless, J. (2016). Comment. Journal of the American Statistical Association, 111(513), 118–121. https://doi.org/10.1080/01621459.2016.1149399
Google Scholar
Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica, 50(4), 1029–1054. https://doi.org/10.2307/1912775
Web of Science ®Google Scholar
Hartely, H. O., & Rao, J. N. K. (1968). A new estimation theory for sample surveys. Biometrika, 55(3), 547–557. https://doi.org/10.1093/biomet/55.3.547
Web of Science ®Google Scholar
Imbens, G., & Lancaster, T. (1994). Combining micro and macro data in microeconometric models. Review of Economic Studies, 61(4), 655–680. https://doi.org/10.2307/2297913
Web of Science ®Google Scholar
Jordan, M. I., Lee, J. D., & Yang, Y. (2019). Communication-efficient distribution statistical inference. Journal of the American Statistical Association, 114(526), 668–681. https://doi.org/10.1080/01621459.2018.1429274
Web of Science ®Google Scholar
Lee, J., Liu, Q., Sun, Y., & Taylor, J. (2017). Communication-efficient sparse regression. Journal of Machine Learning Research, 18, 1–30. http://jmlr.org/papers/v18/16-002.html
Web of Science ®Google Scholar
Lin, D. Y., & Zeng, D. (2010). On the relative efficiency of using summary statistics versus individual-level data in meta-analysis. Biometrika, 97(2), 321–332. https://doi.org/10.1093/biomet/asq006
PubMed Web of Science ®Google Scholar
Luo, L. (2020). Renewable estimation and incremental inference in generalized linear models with streaming data sets. Journal of the Royal Statistical Society, Series B, 82(1), 69–97. https://doi.org/10.1111/rssb.12352
Google Scholar
Neiswanger, W., Wang, C., & Xing, E. (2015). Asymptotically exact, embarrassingly parallel MCMC. In Proceedings of the 30th conference on uncertainty in artificial intelligence (pp. 623–632). AUAI Press.
Google Scholar
Nguyen, T. D., Shih, M. H., Srivastava, D., Tirthapura, S., & Xu, B. (2021). Stratified random sampling from streaming and stored data. Distributed and Parallel Databases, 39(3), 665–710. https://doi.org/10.1007/s10619-020-07315-w
Web of Science ®Google Scholar
Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2), 237–249. https://doi.org/10.1093/biomet/75.2.237
Web of Science ®Google Scholar
Owen, A. B. (1990). Empirical likelihood ratio confidence regions. Annals of Statistics, 18(1), 90–120. https://doi.org/10.1214/aos/1176347494
Web of Science ®Google Scholar
Owen, A. B. (2001). Empirical likelihood. CRC.
Google Scholar
Qin, J. (2000). Combining parametric and empirical likelihoods. Biometrika, 87(2), 484–490. https://doi.org/10.1093/biomet/87.2.484
Web of Science ®Google Scholar
Qin, J. (2017). Biased sampling, over-identified parameter problems and beyond. Springer.
Google Scholar
Qin, J., & Lawless, J. (1994). Empirical likelihood and general equations. Annals of Statistics, 22(1), 300–325. https://doi.org/10.1214/aos/1176325370
Web of Science ®Google Scholar
Qin, J., Zhang, H., Li, P., Albanes, D., & Yu, K. (2015). Using covariate specific disease prevalence information to increase the power of case-control study. Biometrika, 102(1), 169–180. https://doi.org/10.1093/biomet/asu048
Web of Science ®Google Scholar
Susanne, M. S. (2007). Point estimation with exponentially tilted empirical likelihood. Annals of Statistics, 35(2), 634–672. https://doi.org/10.1214/009053606000001208
Web of Science ®Google Scholar
Tian, L., & Gu, Q. (2016). Communication-efficient distributed sparse linear discriminant analysis. arXiv:1610.04798.
Google Scholar
van de Geer, S., Buhlmann, P., Ritov, Y., & Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high dimensional models. Annals of Statistics, 42(3), 1166–1202. https://doi.org/10.1214/14-AOS1221
Web of Science ®Google Scholar
van de Vaart, V. W. (2000). Asymptotic statistics. Cambridge University Press.
Google Scholar
Wang, X., & Dunson, D. (2015). Parallelizing MCMC via Weierstrass sampler. arXiv:1312.4605.
Google Scholar
Wang, J., Kolar, M., Srebro, N., & Zhang, T. (2017). Efficient distributed learning with sparsity. In Proceedings of the 34th international conference on machine learning, Sydney, Australia, PMLR 70 (pp. 3636–3645).
Google Scholar
Wu, C., & Sitter, R. R. (2001). A model-calibration approach to using complete auxiliary information from survey data. Journal of the American Statistical Association, 96(453), 185–193. https://doi.org/10.1198/016214501750333054
Web of Science ®Google Scholar
Wu, C., & Thompson, M. E. (2020). Sampling theory and practice. Springer.
Google Scholar
Zeng, D. & Lin, D. Y. (2015). On random-effects meta-analysis. Biometrika, 102(2), 281–294.
Google Scholar
Zhang, Y., Duchi, J., & Wainwright, M. (2013). Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research, 14, 3321–3363.
Web of Science ®Google Scholar
Zhang, H., Deng, L., Schiffman, M., Qin, J., & Yu, K. (2020). Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika, 107(3), 689–703. https://doi.org/10.1093/biomet/asaa014
Web of Science ®Google Scholar

A selective review of statistical methods using calibration information from similar studies

Abstract

1. Introduction

2. Two simple information-combining methods

2.1. Convex combination

2.2. Random-effect meta-analysis

3. Empirical likelihood and general estimating equations

3.1. Definition of empirical likelihood

3.2. General estimating equations

Qin & Lawless, Citation1994

3.3. Calculation of the information bound

3.4. A sketched proof of (8)

3.5. Empirical entropy family

4. Enhancing efficiency using auxiliary information

5. Combining summary information: a more flexible method for meta-analysis

5.1. Setup and solution

5.2. A comparison

5.3. Proof of $V_{2} - V_{1} \geq 0$

6. Calibration of information from previous studies

6.1. Efficiency comparison

6.2. When nuisance parameters are present

7. Using covariate-specific disease prevalent information

7.1. Induced estimating equations under case–control sampling

7.2. Empirical likelihood approach

7.3. Generalizations

7.4. More on the use of auxiliary information

8. Communication-efficient distributed inference

9. Renewal estimation and incremental inference

10. Concluding remarks

Acknowledgments

References

Information for

Open access

Opportunities

Help and information

A selective review of statistical methods using calibration information from similar studies

Abstract

1. Introduction

2. Two simple information-combining methods

2.1. Convex combination

2.2. Random-effect meta-analysis

3. Empirical likelihood and general estimating equations

3.1. Definition of empirical likelihood

3.2. General estimating equations

Qin & Lawless, Citation1994

3.3. Calculation of the information bound

3.4. A sketched proof of (8)

3.5. Empirical entropy family

4. Enhancing efficiency using auxiliary information

5. Combining summary information: a more flexible method for meta-analysis

5.1. Setup and solution

5.2. A comparison

5.3. Proof of V2−V1≥0

6. Calibration of information from previous studies

6.1. Efficiency comparison

6.2. When nuisance parameters are present

7. Using covariate-specific disease prevalent information

7.1. Induced estimating equations under case–control sampling

7.2. Empirical likelihood approach

7.3. Generalizations

7.4. More on the use of auxiliary information

8. Communication-efficient distributed inference

9. Renewal estimation and incremental inference

10. Concluding remarks

Acknowledgments

Correction Statement

Additional information

Funding

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

5.3. Proof of $V_{2} - V_{1} \geq 0$