Full article: MLE with datasets from populations having shared parameters

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

We consider maximum likelihood estimation with two or more datasets sampled from different populations with shared parameters. Although more datasets with shared parameters can increase statistical accuracy, this paper shows how to handle heterogeneity among different populations for correctness of estimation and inference. Asymptotic distributions of maximum likelihood estimators are derived under either regular cases where regularity conditions are satisfied or some non-regular situations. A bootstrap variance estimator for assessing performance of estimators and/or making large sample inference is also introduced and evaluated in a simulation study.

Keywords:

1. Introduction

With advanced technologies in data collection and storage, in modern statistical analyses we often have multiple datasets as independent samples from different populations having shared parameters. Typically, one of these multiple datasets is primary with carefully collected data from a population of interest. The other datasets are from external sources, such as data from other studies, administrative records and publicly available information from internet.

On one hand, the fact that populations share common parameters provides a great opportunity for increasing statistical accuracy by utilizing multiple datasets instead of a single dataset. On the other hand, because of the difference in data collection, study purpose and/or time of investigation, heterogeneity often exists among populations so that we cannot simply combine all datasets into a single large dataset to run analysis, but must develop or modify statistical methodology to correctly utilize multiple datasets. The research on analysis with multiple datasets fits into a general framework of data integration (Kim et al., Citation2021; Lohr & Raghunathan, Citation2017; Merkouris, Citation2004; Rao, Citation2021; Yang & Kim, Citation2020; Zhang et al., Citation2017; Zieschang, Citation1990).

In this article, we study maximum likelihood estimation (MLE) for independent datasets with parametric populations sharing some (not necessarily all) parameters. For simplicity of presentation, we focus on the case of two independent datasets. The main idea and result can be extended to multiple datasets. Our research can also be extended to semi-parametric estimation, such as empirical likelihood or Cox regression for survival data.

Throughout, we consider two independent random samples. One random sample of size n, resulting a dataset ${X_{1}, \dots, X_{n}}$ , is sampled from a parametric population with probability density $f (x, θ, ϕ)$ (for either continuous or discrete x), where f is a known function and θ and ϕ are unknown parameter vectors. Another random sample of size m, resulting a dataset ${Y_{1}, \dots, Y_{m}}$ , is sampled from a population with probability density $g (y, θ, φ)$ , where g is a known function and θ and φ are unknown parameter vectors. Note that $X_{i}$ and $Y_{j}$ can be vectors. The shared parameter θ can be either the main parameter vector of interest or a nuisance parameter vector, and ϕ and φ are other parameter vectors in two populations.

Let ϑ denote the vector with θ, ϕ, and φ as sub-vectors. In Section 2, we derive the maximum likelihood estimator (MLE) of ϑ based on two datasets, which is expected to be asymptotically more efficient than each MLE based on a single dataset, since more data are used for estimating the shared parameter θ, a component of ϑ. The asymptotic normality of MLE of ϑ is established when densities f and g satisfy regularity conditions that are typically assumed for MLE. Applications to location-scale problems are discussed in Section 3, where we also present a situation in which f or g does not satisfy the regularity conditions. Section 4 contains an example in which regularity conditions do not hold and MLE is not asymptotically normal. The common mean of a discrete data problem is considered in Section 5. Section 6 is devoted to the scenario where an additional uncertainty exists in the second population density g. To handle the situation where asymptotic normality of the MLE of ϑ is not available, we introduce a bootstrap variance estimator in Section 7 and provide some simulation results to examine finite sample performances.

2. MLEs with two datasets

The following are regularity conditions for probability density $p (x, ϑ)$ (with a fixed ϑ) of a continuous or discrete random variable/vector X, typically assumed for MLEs in parametric populations (Shao, Citation2003).

(R1)	For every x in the range of X, $p (x, ϑ)$ is twice continuously differentiable with respect to ϑ in an open set of the Euclidean space with a fixed dimension.
(R2)	$\frac{\partial}{\partial ϑ} \int p (x, ϑ) d x = \int \frac{\partial}{\partial ϑ} p (x, ϑ) d x$ and $\frac{\partial}{\partial ϑ} \int \frac{\partial}{\partial ϑ} p (x, ϑ) d x = \int \frac{\partial^{2}}{\partial ϑ \partial ϑ^{⊤}} p (x, ϑ) d x$ , where $C^{⊤}$ denotes the transpose of a vector or matrix C and the integral should be replaced by an appropriate summation when X is discrete.
(R3)	The Fisher information matrix $- E {\frac{\partial^{2}}{\partial ϑ \partial ϑ^{⊤}} \log p (X, ϑ)}$ exists and is positive definite,
(R4)	For any given ϑ, there exists a positive number $c_{ϑ}$ and a positive function $h_{ϑ}$ such that $E {h_{ϑ} (X)} < \infty$ and $sup_{γ : ‖ γ - ϑ ‖ < c_{ϑ}} ‖ \frac{\partial^{2} \log p (x, γ)}{\partial γ \partial γ^{⊤}} ‖ \leq h_{ϑ} (x)$ for all x in the range of X, where $‖ A ‖ = \sqrt{trace (A^{⊤} A)}$ for any matrix A.

In this section, we assume that both f and g satisfy regularity conditions (R1) –(R4). When some regularity conditions are not satisfied, we have to deal with the problem case by case. See, for example, the problem of normal and Laplace distributions in Section 3.2 and the problem of two truncation distributions in Section 4.

The log likelihood function of ϑ is $ℓ (ϑ) = \sum_{i = 1}^{n} \log f (X_{i}, θ, ϕ) + \sum_{j = 1}^{m} \log g (Y_{j}, θ, φ)$ and the score function is $s (ϑ) = \frac{\partial ℓ (ϑ)}{\partial ϑ} = (\begin{matrix} \sum_{i = 1}^{n} \frac{\partial \log f (X_{i}, θ, ϕ)}{\partial θ} + \sum_{j = 1}^{m} \frac{\partial \log g (Y_{j}, θ, φ)}{\partial θ} \\ \sum_{i = 1}^{n} \frac{\partial \log f (X_{i}, θ, ϕ)}{\partial ϕ} \\ \sum_{j = 1}^{m} \frac{\partial \log g (Y_{j}, θ, φ)}{\partial φ} \end{matrix}) .$ If $\hat{ϑ}$ is a solution to the score equation $s (ϑ) = 0$ , then we call $\hat{ϑ}$ an MLE of ϑ, although traditionally an MLE is defined as a maximizer of $ℓ (ϑ)$ over the range of ϑ and $\hat{ϑ}$ satisfying $s (\hat{ϑ}) = 0$ may not be a maximizer.

A solution to the score equation often does not have an explicit form, even when each MLE of a single population has an explicit solution.

Under regularity conditions (R1)-(R4), $E {s (ϑ)} = 0$ and $V a r {s (ϑ)} = - E {\frac{\partial s (ϑ)}{\partial ϑ^{⊤}}} = n I (ϑ)$ is the Fisher information matrix of information contained in two samples. Let $\begin{aligned} I_{θ θ} (θ, ϕ) = - E {\frac{\partial^{2} \log f (X_{i}, θ, ϕ)}{\partial θ \partial θ^{⊤}}}, I_{θ θ} (θ, φ) = - E {\frac{\partial^{2} \log g (Y_{j}, θ, φ)}{\partial θ \partial θ^{⊤}}}, \\ I_{θ ϕ} (θ, ϕ) = - E {\frac{\partial^{2} \log f (X_{i}, θ, ϕ)}{\partial θ \partial ϕ^{⊤}}}, I_{θ φ} (θ, φ) = - E {\frac{\partial^{2} \log g (Y_{j}, θ, φ)}{\partial θ \partial φ^{⊤}}}, \\ I_{ϕ ϕ} (θ, ϕ) = - E {\frac{\partial^{2} \log f (X_{i}, θ, ϕ)}{\partial ϕ \partial ϕ^{⊤}}}, I_{φ φ} (θ, φ) = - E {\frac{\partial^{2} \log g (Y_{j}, θ, φ)}{\partial φ \partial φ^{⊤}}} . \end{aligned}$ Then $I (ϑ) = (\begin{matrix} I_{θ θ} (θ, ϕ) + a I_{θ θ} (θ, φ) & I_{θ ϕ} (θ, ϕ) & a I_{θ φ} (θ, φ) \\ I_{θ ϕ} (θ, ϕ)^{⊤} & I_{ϕ ϕ} (θ, ϕ) & 0 \\ a I_{θ φ} (θ, φ)^{⊤} & 0 & a I_{φ φ} (θ, φ) \end{matrix})$ is positive definite, where $a = m / n$ and without loss of generality we assume that m = an for a fixed a>0. It can be seen that $I (ϑ)$ is increasing in a in the sense that $A \geq B$ for two non-negative definite matrices A and B if and only if A−B is non-negative definite.

Using the standard argument in asymptotic theory, e.g., Theorem 4.17 in Shao (Citation2003), we obtain the following result.

Theorem 2.1

Assume (R1) –(R4) and that m = an with a remaining fixed as n increases. Then, with probability tending to 1 as $n \to \infty$ , there exists $\hat{ϑ}$ (depending on n) such that $P {s (\hat{ϑ}) = 0} \to 1$ and (1) $\sqrt{n} (\hat{ϑ} - ϑ) ⟹ d N (0, {I (ϑ)}^{- 1}),$ (1) where $⟹ d$ denotes convergence in distribution and $N (C, D)$ is the normal distribution with mean C and covariance matrix D.

The asymptotic result (Equation1(1) $\sqrt{n} (\hat{ϑ} - ϑ) ⟹ d N (0, {I (ϑ)}^{- 1}),$ (1) ) enables us to assess performance of $\hat{ϑ}$ and to carry out large sample statistical inference on parameter ϑ or any of its components θ, ϕ, and φ. When some of regularity conditions (R1) –(R4) are not satisfied, however, we may apply the bootstrap method (see Section 3.2 and Section 7 for the normal and Laplace problem) or directly derive the asymptotic distribution of $\hat{ϑ}$ (see Section 4 for the problem of two truncation distributions).

3. Application to location-Scale problems

An application of our general result in Section 2 is to the case where $f (x, θ, ϕ) = \frac{1}{σ} f (\frac{x - μ}{σ})$ and $g (y, θ, φ) = \frac{1}{τ} g (\frac{x - ν}{τ})$ for two continuous probability density functions f and g on real line, i.e., both populations are in location-scale families. We have several scenarios.

Two location-scale families sharing the same location and scale parameters: $μ = ν$ , $σ = τ$ , $θ = (μ, σ)^{⊤}$ , and both ϕ and φ are constants.
Two location-scale families sharing the same location parameter but having different scale parameters: $μ = ν$ , $θ = μ$ , $ϕ = σ$ , and $φ = τ$ .
Two location-scale families sharing the same scale parameter but having different location parameters: $σ = τ$ , $θ = σ$ , $ϕ = μ$ , and $φ = ν$ .

Under any location-scale problem, it is often true that $I_{θ ϕ} (θ, ϕ) = 0$ and $I_{θ φ} (θ, φ) = 0$ and, hence, the inverse of $I (ϑ)$ can be easily obtained. For example, if both f and g are continuously differentiable functions symmetric about 0, then it follows from Example 3.9 in Shao (Citation2003) that both $I_{θ ϕ} (θ, ϕ)$ and $I_{θ φ} (θ, φ)$ varnish.

In the following we consider a special case in details.

3.1. Normal and Laplace densities with a single scale parameter

Suppose that $f (x, θ) = \frac{1}{\sqrt{2 π} θ} e^{- x^{2} / (2 θ^{2})}$ , $x \in (- \infty, \infty)$ , which is the normal distribution $N (0, θ^{2})$ , and that $g (y, θ) = \frac{1}{2 θ} e^{- | y | / θ}$ , $y \in (- \infty, \infty)$ , which is the Laplace distribution (also called double exponential distribution) with mean zero and standard deviation $\sqrt{2} θ$ . The two densities share the common scale parameter $θ > 0$ .

The MLEs of θ based on data from f and g, respectively, are ${\hat{θ}}_{N} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} X_{i}^{2}} and {\hat{θ}}_{E} = \frac{1}{m} \sum_{j = 1}^{m} | Y_{j} | .$ In this particular case, we can obtain an explicit form of the MLE $\hat{θ}$ of θ based on all data from two samples. The log likelihood based on two samples is $ℓ (θ) = - \sum_{i = 1}^{n} \frac{X_{i}^{2}}{2 θ^{2}} - \sum_{j = 1}^{m} \frac{| Y_{j} |}{θ} - \log {(2 π)^{n / 2} 2^{m} θ^{n + m}} .$ The score function is $s (θ) = \frac{1}{θ^{3}} \sum_{i = 1}^{n} X_{i}^{2} + \frac{1}{θ^{2}} \sum_{j = 1}^{m} | Y_{j} | - \frac{n + m}{θ} .$ Setting $s (θ) = 0$ and using the form of MLE from each sample, we obtain that $θ^{2} - (\frac{m {\hat{θ}}_{E}}{n + m}) θ - (\frac{n {\hat{θ}}_{N}^{2}}{n + m}) = 0 .$ Since $θ > 0$ and only one root is positive, we obtain that the MLE of θ is (2) $\hat{θ} = \frac{1}{2} {\frac{a {\hat{θ}}_{E}}{a + 1} + \sqrt{{(\frac{a {\hat{θ}}_{E}}{a + 1})}^{2} + \frac{4 {\hat{θ}}_{N}^{2}}{a + 1}}}, where a = m / n .$ (2) Note that $\hat{θ}$ is a nonlinear function of ${\hat{θ}}_{N}$ and ${\hat{θ}}_{E}$ . In general, the MLE of the shared parameter based on two datasets is not a simple function of separate MLEs based on each single dataset.

To derive the asymptotic distribution of $\hat{θ}$ , we can use the general result (Equation1(1) $\sqrt{n} (\hat{ϑ} - ϑ) ⟹ d N (0, {I (ϑ)}^{- 1}),$ (1) ), because regularity conditions (R1) –(R4) are satisfied for f and g. Since $\hat{θ}$ has an explicit form, we can also simply derive it. Because $X_{i}$ 's and $Y_{j}$ 's are independent and $a = m / n$ , $\sqrt{n} (\begin{matrix} {\hat{θ}}_{N} - θ \\ \sqrt{a} {\hat{θ}}_{E} - \sqrt{a} θ \end{matrix}) ⟹ d N (0, (\begin{matrix} θ^{2} / 2 & 0 \\ 0 & θ^{2} \end{matrix})) .$ Define $g (t, s) = \frac{1}{2} {\frac{\sqrt{a} s}{a + 1} + \sqrt{\frac{a s^{2}}{(a + 1)^{2}} + \frac{4 t^{2}}{a + 1}}} .$ Then, $g ({\hat{θ}}_{N}, \sqrt{a} {\hat{θ}}_{E}) = \hat{θ} and g (θ, \sqrt{a} θ) = θ .$ Hence, by the delta method, e.g., Theorem 1.12 in Shao (2003), $\sqrt{n} (\hat{θ} - θ) ⟹ d N (0, \nabla g^{⊤} (\begin{array}{cc} θ^{2} / 2 & 0 \\ 0 & θ^{2} \end{array}) \nabla g),$ where $\nabla g$ is the derivative vector of g at $(t, s) = (θ, \sqrt{a} θ)$ , i.e., $\begin{aligned} \frac{\partial g}{\partial t} & = \frac{2 t}{a + 1} / \sqrt{\frac{a s^{2}}{(a + 1)^{2}} + \frac{4 t^{2}}{a + 1}}, \\ \frac{\partial g}{\partial s} & = \frac{1}{2} {\frac{\sqrt{a}}{a + 1} + \frac{a s}{(a + 1)^{2}} / \sqrt{\frac{a s^{2}}{(a + 1)^{2}} + \frac{4 t^{2}}{a + 1}}}, \\ \nabla g & = {(\frac{2}{a + 2}, \frac{\sqrt{a}}{a + 2})}^{⊤} . \end{aligned}$ This leads to the following result.

Corollary 3.1

Assume that m = an with a remaining fixed as n increases. Then, as $n \to \infty$ , $\sqrt{n} (\hat{θ} - θ) ⟹ d N (0, \frac{θ^{2}}{a + 2}) .$

The asymptotic relative efficiency of ${\hat{θ}}_{N}$ with respect to $\hat{θ}$ is $2 / (a + 2)$ , which is decreasing in a and bounded between 0 and 1. The asymptotic relative efficiency of ${\hat{θ}}_{E}$ with respect to $\hat{θ}$ is $a / (a + 2)$ , which is increasing in a and bounded between 0 and 1.

3.2. Normal and Laplace densities with shared scale and location parameters

Consider a more general case where f and g share a scale parameter and a location parameter. That is, $f (x, θ, μ) = \frac{1}{\sqrt{2 π} θ} e^{- (x - μ)^{2} / (2 θ^{2})}$ , $x \in (- \infty, \infty)$ , which is the normal distribution $N (μ, θ^{2})$ , and $g (y, θ, μ) = \frac{1}{2 θ} e^{- | y - μ | / θ}$ , $y \in (- \infty, \infty)$ , which is the Laplace distribution with mean µ and standard deviation $\sqrt{2} θ$ . Note that regularity conditions (R1) –(R4) are not satisfied for g, since g is not always differentiable in µ.

For parameter vector $ϑ = (μ, θ)^{⊤}$ , the log likelihood is $ℓ (ϑ) = - \sum_{i = 1}^{n} \frac{(X_{i} - μ)^{2}}{2 θ^{2}} - \sum_{j = 1}^{m} \frac{| Y_{j} - μ |}{θ} - \log {(2 π)^{n / 2} 2^{m} θ^{n + m}} .$ Although $ℓ (ϑ)$ is not always differentiable in µ, it is concave in µ and, hence, the MLE $\hat{μ}$ of µ exists though it does not have an explicit form. The MLE of θ is given by (Equation2(2) $\hat{θ} = \frac{1}{2} {\frac{a {\hat{θ}}_{E}}{a + 1} + \sqrt{{(\frac{a {\hat{θ}}_{E}}{a + 1})}^{2} + \frac{4 {\hat{θ}}_{N}^{2}}{a + 1}}}, where a = m / n .$ (2) ) with ${\hat{θ}}_{N}$ and ${\hat{θ}}_{E}$ replaced by, respectively, $\sqrt{\frac{1}{n} \sum_{i = 1}^{n} (X_{i} - \hat{μ})^{2}} and \frac{1}{m} \sum_{j = 1}^{m} | Y_{j} - \hat{μ} | .$ The asymptotic distribution of $\hat{ϑ} = (\hat{μ}, \hat{θ})^{⊤}$ cannot be obtained from (Equation1(1) $\sqrt{n} (\hat{ϑ} - ϑ) ⟹ d N (0, {I (ϑ)}^{- 1}),$ (1) ), since g does not satisfy conditions (R1) –(R4). For assessing performance of $\hat{ϑ}$ and/or making inference, we recommend a bootstrap method, which is discussed in Section 7 and studied by simulation.

4. Application to two truncation distributions

Let $f (x, θ)$ and $g (y, θ)$ be positive density functions on the interval $(0, θ)$ and zero outside $(0, θ)$ , where $θ > 0$ is an unknown scale parameter common for both populations, and f and g are known when θ is known. The likelihood is $\prod_{i = 1}^{n} f (X_{i}, θ) I_{{X_{i} < θ}} \prod_{j = 1}^{m} g (Y_{j}, θ) I_{{Y_{j} < θ}} = {\prod_{i = 1}^{n} f (X_{i}, θ) \prod_{j = 1}^{m} g (Y_{j}, θ)} I_{{X_{(n)} < θ}} I_{{Y_{(m)} < θ}},$ where $I_{A}$ is the indicator of event A, $X_{(n)} = max (X_{1}, \dots, X_{n})$ and $Y_{(m)} = max (Y_{1}, \dots, Y_{m})$ . This likelihood is not always differentiable in θ, but it can be seen that the MLE of θ is $\hat{θ} = max (X_{(n)}, Y_{(m)})$ , a maximizer of the likelihood.

This is an example in which regularity conditions (R1) –(R4) in Section 2 are not satisfied so that result (Equation1(1) $\sqrt{n} (\hat{ϑ} - ϑ) ⟹ d N (0, {I (ϑ)}^{- 1}),$ (1) ) does not hold. The MLE $\hat{θ}$ is not even asymptotically normal. In the following we directly derive the asymptotic distribution of $\hat{θ}$ .

It follows from the result in Example 2.34 of Shao (Citation2003), the independence of $X_{i}$ 's and $Y_{j}$ 's, and m = an that $n (\begin{matrix} θ - X_{(n)} \\ θ - Y_{(m)} \end{matrix}) ⟹ d (\begin{matrix} \frac{ϵ_{1}}{f (θ, θ)} \\ \frac{ϵ_{2}}{a g (θ, θ)} \end{matrix}),$ where $ϵ_{1}$ and $ϵ_{2}$ are independent random variables with the same exponential distribution having density $e^{- x}$ , x>0. Because $min {n (θ - X_{(n)}), n (θ - Y_{(m)})} = n {θ - max (X_{(n)}, Y_{(m)})} = n (θ - \hat{θ}),$ we obtain that $n (θ - \hat{θ}) ⟹ d min {\frac{ϵ_{1}}{f (θ, θ)}, \frac{ϵ_{2}}{a g (θ, θ)}} .$ From the independence of $ϵ_{1}$ and $ϵ_{2}$ , for any t>0, $\begin{aligned} P {min {\frac{ϵ_{1}}{f (θ, θ)}, \frac{ϵ_{2}}{a g (θ, θ)}} > t} & = P {\frac{ϵ_{1}}{f (θ, θ)} > t} P {\frac{ϵ_{2}}{a g (θ, θ)} > t} \\ = \exp {- t {f (θ, θ) + a g (θ, θ)}} . \end{aligned}$ This leads to the following result.

Theorem 4.1

Under the assumed conditions on f and g in this section, $n (θ - \hat{θ}) ⟹ d E (θ, a),$ where $E (θ, a)$ is the exponential distribution with scale parameter $1 / {f (θ, θ) + a g (θ, θ)}$ .

Inference on θ can be made using this asymptotic result.

The asymptotic relative efficiency of the MLE $X_{(n)}$ based on the first dataset with respect to the MLE $\hat{θ}$ based on two datasets is ${1 + a g (θ, θ) / f (θ, θ)}^{- 2}$ , which is increasing in a and bounded between 0 and 1. The asymptotic relative efficiency of the MLE $Y_{(m)}$ based on the second dataset with respect to the MLE $\hat{θ}$ based on two datasets is ${1 + a^{- 1} f (θ, θ) / g (θ, θ)}^{- 2}$ , which is decreasing in a and bounded between 0 and 1.

5. Application to Poisson and binomial samples

Here we consider a discrete data problem, where $X_{i}$ has the Poisson distribution with mean θ, $Y_{j}$ is binary with $P (Y_{j} = 1) = θ$ , and $θ \in (0, 1)$ is the shared parameter. Let $\bar{X}$ be the sample mean of $X_{i}$ 's and $\bar{Y}$ be the sample mean of $Y_{j}$ 's. The score function based on two samples is $s (θ) = n (\frac{\bar{X}}{θ} - 1 + \frac{a \bar{Y}}{θ} - \frac{1 - a \bar{Y}}{1 - θ}), where a = m / n .$ Setting $s (θ) = 0$ , we obtain the score equation $θ^{2} - (1 + a + \bar{X}) θ + \bar{X} + a \bar{Y} = 0 .$ Since the score equation is a quadratic equation, it has two solutions if and only if $(1 + a + \bar{X})^{2} - 4 (\bar{X} + a \bar{Y}) > 0 .$ By the law of large numbers, as $n \to \infty$ , both $\bar{X}$ and $\bar{Y}$ converge to θ almost surely and $(1 + a + \bar{X})^{2} - 4 (\bar{X} + a \bar{Y}) \to (1 + a + θ)^{2} - 4 (1 + a) θ = (1 + a - θ)^{2} > 0$ almost surely. This shows that, with probability tending to 1 as $n \to \infty$ , the score equation has two real solutions, ${1 + a + \bar{X} \pm \sqrt{(1 + a + \bar{X})^{2} - 4 (\bar{X} + a \bar{Y})}} / 2.$ The solution with + sign in front of the squared root is always larger than 1, out of the range $(0, 1)$ for θ in this problem. Hence, we conclude that the MLE of θ is $\hat{θ} = min {1, \frac{1 + a + \bar{X} - \sqrt{{(1 + a + \bar{X})}^{2} - 4 (\bar{X} + a \bar{Y})}}{2}} .$ The minimum is taken because $0 < θ < 1$ . Again, the MLE $\hat{θ}$ is a nonlinear function of the separate MLEs, $\bar{X}$ and $\bar{Y}$ .

The asymptotic distribution of $\hat{θ}$ can be derived using the delta-method, but because regularity conditions (R1) –(R4) are satisfied, it is a corollary of Theorem 2.1 in Section 2.

Corollary 5.1

Under the Poisson and binary assumptions for two datasets and m = an, as $n \to \infty$ , $\sqrt{n} (\hat{θ} - θ) ⟹ d N (0, \frac{θ (1 - θ)}{1 - θ + a}) .$

The asymptotic relative efficiency of the MLE $\bar{X}$ based on the first dataset with respect to the MLE $\hat{θ}$ based on two datasets is $(1 - θ) / (1 - θ + a)$ , which is decreasing in a and bounded between 0 and 1. The asymptotic relative efficiency of the MLE $\bar{Y}$ based on the second dataset with respect to the MLE $\hat{θ}$ based on two datasets is $a / (1 - θ + a)$ , which is increasing in a and bounded between 0 and 1.

6. MLEs with two samples and an additional uncertainty

In this section, we consider a scenario in which the first sample is obtained under a controlled study so that we know the form of probability density $f (x, θ, ϕ)$ , but the form of $g (y, θ, φ)$ for the second sample has an additional uncertainty, because the second sample may be obtained through a past study and/or public records. We assume that the additional uncertainty comes from an unknown parameter ζ taking two possible values, 0 and 1, i.e., the probability density of the second sample is $g (y, θ, φ, ζ)$ , where $ζ = 0$ or 1 and g is still a known density when θ, φ, and ζ are known.

How do we derive the MLE of $ϑ = (θ^{⊤}, ϕ^{⊤}, φ^{⊤})^{⊤}$ ? If ζ is known, then the MLE can be obtained using the method in Section 2. Since ζ takes only two values, if $\hat{ζ}$ is a consistent estimator of ζ, i.e., (3) $lim_{n \to \infty} P (\hat{ζ} = ζ) = 1,$ (3) then we obtain the MLE of ϑ as $\hat{ϑ} = {\begin{cases} \hat{ϑ} (0), & \hat{ζ} = 0, \\ \hat{ϑ} (1), & \hat{ζ} = 1, \end{cases}$ where $\hat{ϑ} (0)$ and $\hat{ϑ} (1)$ are MLEs under $ζ = 0$ and $ζ = 1$ , respectively.

A suggested consistent estimator of ζ is the MLE of ζ based on the second sample, $Y_{j}$ 's. Let $\hat{θ} (ζ)$ and $\hat{φ} (ζ)$ be the MLEs of θ and φ, respectively, based on $Y_{j}$ 's, when the value of ζ is fixed. Then the MLE of ζ is $\hat{ζ} = {\begin{cases} 0, & \prod_{j = 1}^{m} g (Y_{j}, \hat{θ} (0), \hat{φ} (0), 0) \geq \prod_{j = 1}^{m} g (Y_{j}, \hat{θ} (1), \hat{φ} (1), 1), \\ 1, & \prod_{j = 1}^{m} g (Y_{j}, \hat{θ} (0), \hat{φ} (0), 0) < \prod_{j = 1}^{m} g (Y_{j}, \hat{θ} (1), \hat{φ} (1), 1) . \end{cases}$ The following result gives the asymptotic distribution of the MLE $\hat{ϑ}$ .

Theorem 6.1

If (Equation3(3) $lim_{n \to \infty} P (\hat{ζ} = ζ) = 1,$ (3) ) holds and regularity conditions (R1) –(R4) are satisfied when $ζ = 0$ or 1, and if $m = a n$ with a remaining fixed as n increases, then $\sqrt{n} (\hat{ϑ} - ϑ) ⟹ d N (0, {I (ϑ, ζ)}^{- 1}),$ where $I (ϑ, ζ)$ is the Fisher information as defined in Section 2 under the true value of ζ.

Condition (Equation3(3) $lim_{n \to \infty} P (\hat{ζ} = ζ) = 1,$ (3) ) has to be checked for each particular problem. The following is an example.

Suppose that $f (x, θ)$ is the density of $N (0, θ^{2})$ , $g (y, θ, 0)$ is the same normal density for $N (0, θ^{2})$ but $g (y, θ, 1)$ is the Laplace distribution with zero mean and standard deviation $\sqrt{2} θ$ given in Section 3.1. In other words, sample one is from the main study whereas sample two is from an external source in which the data may follow the same distribution as sample one but may deviate from sample one. The parameters ϕ and φ are constant (non-existing).

In this example, when $\hat{ζ} = 0$ , we can simply combine the two samples and the MLE of θ is $\sqrt{(\sum_{i = 1}^{n} X_{i}^{2} + \sum_{j = 1}^{m} Y_{j}^{2}) / (n + m)}$ ; on the other hand, when $\hat{ζ} = 1$ , the MLE of θ is given by (Equation2(2) $\hat{θ} = \frac{1}{2} {\frac{a {\hat{θ}}_{E}}{a + 1} + \sqrt{{(\frac{a {\hat{θ}}_{E}}{a + 1})}^{2} + \frac{4 {\hat{θ}}_{N}^{2}}{a + 1}}}, where a = m / n .$ (2) ). To check (Equation3(3) $lim_{n \to \infty} P (\hat{ζ} = ζ) = 1,$ (3) ), note that $\hat{θ} (0) = \sqrt{\frac{1}{m} \sum_{j = 1}^{m} Y_{j}^{2}} and \hat{θ} (1) = \frac{1}{m} \sum_{j = 1}^{m} | Y_{j} | .$ Then, $\log {\prod_{j = 1}^{m} g (\hat{θ} (0), 0)} = - \frac{m}{2} - m \log \hat{θ} (0) - \frac{m \log (2 π)}{2}$ and $\log {\prod_{j = 1}^{m} g (\hat{θ} (1), 1)} = - m - m \log \hat{θ} (1) - m \log 2 .$ When $ζ = 0$ , $\hat{θ} (0) ⟹ p θ$ and $\hat{θ} (1) ⟹ p (2 / π)^{1 / 2} θ$ , where $⟹ p$ denotes convergence in probability as $n \to \infty$ . Hence $\frac{1}{m} \log {\prod_{j = 1}^{m} g (\hat{θ} (0), 0)} - \frac{1}{m} \log {\prod_{j = 1}^{m} g (\hat{θ} (1), 1)} ⟹ p \frac{1}{2} + \log \frac{2}{π} > 0,$ which implies that $P (\hat{ζ} = 0) \to 1$ . On the other hand, when $ζ = 1$ , $\hat{θ} (0) ⟹ p \sqrt{2} θ$ , $\hat{θ} (1) ⟹ p θ$ , and $\frac{1}{m} \log {\prod_{j = 1}^{m} g (\hat{θ} (0), 0)} - \frac{1}{m} \log {\prod_{j = 1}^{m} g (\hat{θ} (1), 1)} ⟹ p \frac{1}{2} - \frac{\log π}{2} < 0,$ which implies that $P (\hat{ζ} = 1) \to 1$ . This shows that (Equation3(3) $lim_{n \to \infty} P (\hat{ζ} = ζ) = 1,$ (3) ) always holds in this example.

Still in this example, the results here and in Section 3.1 indicate that $\sqrt{n} (\hat{θ} - θ) ⟹ d {\begin{cases} N (0, \frac{θ^{2}}{2 a + 2}), & ζ = 0, \\ N (0, \frac{θ^{2}}{a + 2}), & ζ = 1 . \end{cases}$ The result can obviously be extended to the situation where the second sample is from a population that is one of k populations with $k \geq 3$ .

7. Bootstrap variance estimation

In situations where regularity conditions (R1) –(R4) are not satisfied for f or g, the asymptotic distribution of MLE $\hat{ϑ}$ may not be available, either it does not exist or it is not established. Here, we introduce a bootstrap variance estimator which can be used for assessing performance of $\hat{ϑ}$ or making large sample inference. A description about the general bootstrap methodology can be found, for example, in Efron and Tibshirani (Citation1993) and Shao (Citation2003).

Let ${X_{1}^{* b}, \dots, X_{n}^{* b}}$ and ${Y_{1}^{* b}, \dots, Y_{m}^{* b}}$ be two independent simple random samples with replacement from ${X_{1}, \dots, X_{n}}$ and ${Y_{1}, \dots, Y_{m}}$ , respectively, and let ${\hat{ϑ}}^{* b}$ be the MLE of ϑ based on dataset ${X_{1}^{* b}, \dots, X_{n}^{* b}, Y_{1}^{* b}, \dots, Y_{m}^{* b}}$ . If we independently repeat this for $b = 1, \dots, B$ , where B is called the bootstrap replication size and is typically large, then the bootstrap variance estimator for $\hat{ϑ}$ is the sample covariance matrix of ${\hat{ϑ}}^{* b}$ , $b = 1, \dots, B$ .

We carry out a simulation study to examine the performance of this bootstrap variance estimator in the normal-Laplace problem considered in Section 3.2. At the same time, we also check the performance of MLE $(\hat{μ}, \hat{θ})$ based on two datasets, $X_{i}$ 's and $Y_{i}$ 's, and compare it with $(\bar{X}, {\hat{θ}}_{X})$ and $(\tilde{Y}, {\hat{θ}}_{Y})$ , which are the MLEs based on the single dataset of $X_{i}$ 's and single dataset of $Y_{i}$ 's, respectively, where $\bar{X} =$ sample mean of $X_{i}$ 's, $\tilde{Y}$ = sample median of $Y_{j}$ 's, ${\hat{θ}}_{X} = {\sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} / n}^{1 / 2}$ , and ${\hat{θ}}_{Y} = \sum_{i = 1}^{m} | Y_{i} - \tilde{Y} | / m$ . The bootstrap is applied to obtain $\hat{S D}$ for the standard deviation (SD) of any fixed point estimator.

The simulation results with 1000 replications are shown in Table . A summary is given as follows.

The MLE's, $\hat{μ}$ , $\bar{X}$ , and $\tilde{Y}$ , all have almost no bias as estimators of µ ( $= 1$ in simulation). In terms of the SD, The MLE $\hat{μ}$ is the best among the three. The sample median based on $Y_{j}$ 's is substantially worse than the other two, although asymptotically it is as efficient as the sample mean $\bar{X}$ of $X_{i}$ 's.
The MLE $\hat{θ}$ of θ does not have a negligible bias, although its performance is acceptable with sample size n + m = 200 and its SD is slightly smaller than the SD of ${\hat{θ}}_{X}$ . The large bias of the MLE $\hat{θ}$ mainly comes from the large bias of the MLE ${\hat{θ}}_{Y}$ for the Laplace dataset, as it has large bias and SD.
The bootstrap SD estimator $\hat{S D}$ performs very well for all estimators (see the rows under “SD by simulation” and “mean of $\hat{S D}$ by simulation” in Table ), even when the point estimator has non-negligible bias.

Table 1. Results from 1000 simulations for the normal-Laplace problem with location $μ = 1$ and scale $θ = 1$ (n = m = 100, SD = standard deviation, $(\hat{μ}, \hat{θ}) =$ the MLE of $(μ, θ)$ based on $X_{i}$ 's and $Y_{i}$ 's, $(\bar{X}, {\hat{θ}}_{X}) =$ the MLE of $(μ, θ)$ based on $X_{i}$ 's, $(\tilde{Y}, {\hat{θ}}_{Y}) =$ the MLE of $(μ, θ)$ based on $Y_{j}$ 's, and $\hat{S D}$ is by bootstrap with B = 500).

Display Table

The histogram of 1000 values of $\hat{μ}$ from simulation is shown in Figure , together with a Q–Q plot. The result suggests $\hat{μ}$ is asymptotically normal, although such a theoretical result has not been established.

Figure 1. Histogram and Q–Q plot of 1000 simulated values of $\hat{μ}$ .

Figure 1. Histogram and Q–Q plot of 1000 simulated values of μˆ.

Acknowledgments

The authors would like to thank two anonymous referees for helpful comments and suggestions.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

Jun Shao's research was partially supported by the National Natural Science Foundation of China [Grant Number 11831008] and the U.S. National Science Foundation [Grant Number DMS-1914411].

References

Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman and Halll/CRC.
Google Scholar
Kim, H. J., Wang, Z., & Kim, J. K (2021). Survey data integration for regression analysis using model calibration. arXiv 2107.06448.
Google Scholar
Lohr, S. L., & Raghunathan, T. E. (2017). Combining survey data with other data sources. Statistical Science, 32(2), 293–312. https://doi.org/10.1214/16-STS584
Web of Science ®Google Scholar
Merkouris, T. (2004). Combining independent regression estimators from multiple surveys. Journal of the American Statistical Association, 99(468), 1131–1139. https://doi.org/10.1198/016214504000000601
Web of Science ®Google Scholar
Rao, J. N. K. (2021). On making valid inferences by integrating data from surveys and other sources. Sankhya B, 83(1), 242–272. https://doi.org/10.1007/s13571-020-00227-w
Google Scholar
Shao, J. (2003). Mathematical statistics. 2nd ed. Springer.
Google Scholar
Yang, S., & Kim, J. K. (2020). Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science, 3(2), 625–650. https://doi.org/10.1007/s42081-020-00093-w
Google Scholar
Zhang, Y., Ouyang, Z., & Zhao, H. (2017). A statistical framework for data integration through graphical models with application to cancer genomics. The Annals of Applied Statistics, 11(1), 161–184. https://doi.org/10.1214/16-AOAS998
PubMed Web of Science ®Google Scholar
Zieschang, K. D. (1990). Sample weighting methods and estimation of totals in the consumer expenditure survey. Journal of the American Statistical Association, 85(412), 986–1001. https://doi.org/10.1080/01621459.1990.10474969
Web of Science ®Google Scholar

MLE with datasets from populations having shared parameters

Abstract

1. Introduction

2. MLEs with two datasets

3. Application to location-Scale problems

3.1. Normal and Laplace densities with a single scale parameter

3.2. Normal and Laplace densities with shared scale and location parameters

4. Application to two truncation distributions

5. Application to Poisson and binomial samples

6. MLEs with two samples and an additional uncertainty

7. Bootstrap variance estimation

Acknowledgments

Disclosure statement

References

Information for

Open access

Opportunities

Help and information

MLE with datasets from populations having shared parameters

Abstract

1. Introduction

2. MLEs with two datasets

3. Application to location-Scale problems

3.1. Normal and Laplace densities with a single scale parameter

3.2. Normal and Laplace densities with shared scale and location parameters

4. Application to two truncation distributions

5. Application to Poisson and binomial samples

6. MLEs with two samples and an additional uncertainty

7. Bootstrap variance estimation

Acknowledgments

Disclosure statement

Additional information

Funding

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date