Full article: Nonignorable item nonresponse in panel data

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

To estimate unknown population parameters based on panel data having nonignorable item nonresponse, we propose an innovative data grouping approach according to the number of observed components in the multivariate outcome $y$ when the joint distribution of $y$ and associated covariate $x$ is nonparametric and the nonresponse probability conditional on $y$ and $x$ has a parametric form. To deal with the identifiability issue, we utilise a nonresponse instrument $z$ , an auxiliary variable related to $y$ but not related to the nonresponse probability conditional on $y$ and $x$ . We apply a modified generalised method of moments to obtain estimators of the parameters in the nonresponse probability, and a generalised regression estimation to utilise covariate information for efficient estimation of population parameters. Consistency and asymptotic normality of the proposed estimators of the population parameters are established. Simulation and real data results are presented.

Keywords:

1. Introduction

Panel data are collected in many statistical applications, such as sample surveys, clinical trials, economics and social sciences. For example, cluster sampling results in panel data, which occurs in social studies and sample surveys when mutual homogeneity within clusters is evident in the population of interest. Multivariate outcome from a single sampled unit also leads to panel data.

Item nonresponse is a common phenomena in panel data, i.e., some components of the panel, not necessary the entire panel, may be missing. For example, in survey studies, subjects may not respond to all questions; in cluster sampling, some units within a cluster may not respond; in multivariate outcome, some components are measured while the others are not. Estimation and statistical inference without taking nonresponse into consideration could lead to seriously biased estimators and conclusions.

Consider a k-dimensional response or outcome vector $y$ of interest that is subject to item nonresponse. Let $r$ be the response indicator vector of $y$ , i.e., the jth component of $r$ is 1 (or 0) if the jth component of $y$ is observed (or not observed), $j = 1, \dots, k$ . Statistical approaches dealing with missing data usually depend on the nonresponse propensity (or mechanism), i.e., the conditional distribution of $r$ given $(y, x)$ , denoted by $p (r | y, x)$ , where $x$ is a covariate vector associated with $y$ and is always observed. If $p (r | y, x) = p (r | y_{o}, x)$ , where $y_{o}$ is the observed part of $y$ , then nonresponse is ignorable (Little & Rubin, Citation2002; Rubin, Citation1976). Otherwise, nonresponse is nonignorable. While there is a rich literature for valid inference on unknown $p (y)$ (the distribution of $y$ ) or $p (y | x)$ (the conditional distribution of $y$ given $x$ ) under ignorable nonresponse (S. X. Chen et al., Citation2008; Little & Rubin, Citation2002; Robins & Rotiv, Citation1997; Rotnitzky & Robins, Citation1997; Rubin, Citation1976), statistical inference faces serious challenges under nonignorable nonresponse when $p (r | y, x)$ depends on $y$ as well as some components of $x$ .

We provide a brief review of the progress in research on general nonignorable nonresponse in $y$ . Greenlees et al. (Citation1982) proposed to handle nonignorable item nonresponse by maximum likelihood estimation, assuming both $p (r | y, x)$ and $p (y | x)$ are parametric; however, the non-identifiability issue caused by nonignorable nonresponse is not well-addressed and, thus, the result is not rigorous. Besides, a fully parametric approach is sensitive to the parametric model assumptions. Since the population is not identifiable when both $p (r | y, x)$ and $p (y | x)$ are nonparametric (Robins & Rotiv, Citation1997), efforts have been made in situations where one of $p (r | y, x)$ and $p (y | x)$ is parametric or semiparametric. Tang et al. (Citation2003) considered the situation where $p (y | x)$ is parametric but $p (r | y, x)$ is nonparametric, and provided a rigorous treatment of the identifiability issue for the first time; but they assumed that the nonresponse propensity depends only on $y$ , i.e., $p (r | y, x) = p (r | y)$ , which may be impractical. This result was extended by Zhao and Shao (Citation2015) to more realistic situation where $p (r | y, x) = p (r | y, u)$ and $u$ is a sub-vector of $x$ . While both previously cited papers assumed a parametric $p (y | x)$ but a unspecified $p (r | y, x)$ , parallel results were established by Wang et al. (Citation2014) and J. Shao and L. Wang (Citation2016) under a univariate $y$ (k = 1) with a nonparametric $p (y | x)$ and a parametric or semi-parametric $p (r | y, x)$ , which are particularly useful in sample surveys where it is difficult to find a suitable parametric model for $p (y | x)$ . Other than the results under a parametric model on $p (y | x)$ , there is no general result on multivariate $y$ having nonignorable item nonresponse, although Wu and Carroll (Citation1988), Xu and Shao (Citation2009), and Shao and Zhang (Citation2015) obtained some results when the dependence of $r$ on $y$ is through an unobserved random effect $b$ , i.e., $p (r | y, x) = p (r | b, x)$ .

Under nonparametric $p (y)$ and $p (y | x)$ , in this paper we propose an innovative data grouping approach to construct valid estimators of population parameters in the presence of nonignorable item nonresponse in $y$ , assuming the following two main assumptions.

(A1)	Given $(y, x)$ , components of $r$ are conditionally independent and identically distributed.
(A2)	Given $(y, x)$ , the conditional probability of observing a component of $y$ is $π_{θ} (y, u)$ , where $π_{θ}$ is a parametric function of $(y, u)$ with an unknown parameter vector θ and $x = (u, z)$ with $p (y \| x)$ depending on $z$ .

Our main methodology is introduced in Section 2, followed by some simulation results in Section 3 and two real data examples in Section 4. Section 5 contains some technical proofs.

2. Methodology

We use the notation developed in Section 1. Our inference is based on a training sample of size n, $(y_{i}, x_{i}, r_{i})$ , $i = 1, \dots, n$ , which are independent and identically distributed with $(y, x, r)$ . Values of $x_{i}$ are always observed and components of $y_{i}$ are observed if and only if the corresponding components of $r_{i}$ are equal to 1.

2.1. Grouping

When there is no nonresponse, values in the entire set ${(y_{i}, x_{i}), i = 1, \dots, n}$ are exchangeable. But this does not hold in the presence of nonignorable item nonresponse in $y$ . Although $(y_{i}, x_{i})$ 's with the same nonresponse pattern $r_{i}$ are exchangeable, there are a total of $2^{k}$ different nonresponse patterns when k is the dimension of $y$ . Thus although grouping according to nonresponse patter is natural to achieve within-group homogeneity, each group may not have enough units for efficient estimation or inference.

Our main idea is to divide data into k + 1 groups with within-group homogeneity, using the following key lemma under assumption (A1).

Lemma 2.1.

Let Δ be the number of observed components in $y$ . Under (A1), $p (y, x | r) = p (y, x | Δ)$ , i.e., the conditional distribution of $(y, x)$ given $r$ is the same as the conditional distribution of $(y, x)$ given Δ.

Proof.

Let $π = π (y, x)$ be the conditional probability of observing a component of $y$ given $(y, x)$ . Under (A1), $p (r | y, x) = π^{Δ} (1 - π)^{k - Δ}$ and Δ follows a binomial distribution with probability π and size k conditioned on $(y, x)$ . The result follows from $\begin{aligned} \frac{p (y, x | r)}{p (y, x | Δ)} \\ = \frac{p (r | y, x) p (y, x)}{\int \int p (r | y, x) p (y, x) d y d x} / \\ \frac{p (Δ | y, x) p (y, x)}{\int \int p (Δ | y, x) p (y, x) d y d x} \\ = \frac{p (r | y, x)}{p (Δ | y, x)} \frac{\int \int p (Δ | y, x) p (y, x) d y d x}{\int \int p (r | y, x) p (y, x) d y d x} \\ = \frac{π^{Δ} (1 - π)^{k - Δ}}{(\binom{k}{Δ}) π^{Δ} (1 - π)^{k - Δ}} \\ \frac{(\binom{k}{Δ}) \int \int π^{Δ} (1 - π)^{k - Δ} p (y, x) d y d x}{\int \int π^{Δ} (1 - π)^{k - Δ} p (y, x) d y d x} \\ = 1. \end{aligned}$ According to Lemma 2.1, we can partition the whole dataset into k + 1 groups, ${(y_{i}, x_{i}), Δ_{i} = d}$ , $d = 0, 1, \dots, k$ , where $Δ_{i}$ is the number of observed components in $y_{i}$ . Each group ${(y_{i}, x_{i}), Δ_{i} = d}$ contains exchangeable values and enough units for inference as long as k is much smaller than n.

2.2. Estimation under cluster sampling

We consider the situation where components of $y$ have the same distribution (e.g., we have panel data under cluster sampling) and estimation of a parameter in the population of $y$ is our interest. To illustrate, we focus on the estimation of μ, the mean of a component of $y$ . For $d = 1, \dots, k,$ and each group with $Δ_{i} = d$ , the within-group sample mean of observed values is (1) ${\bar{y}}_{d} = \frac{1}{d n_{d}} \sum_{i : Δ_{i} = d} \sum_{j = 1}^{k} r_{i j} y_{i j},$ (1) where $y_{i j}$ and $r_{i j}$ are the jth components of $y_{i}$ and $r_{i}$ , respectively, and $n_{d}$ is the number of units with $Δ_{i} = d$ . Each ${\bar{y}}_{d}$ is an estimator of $μ_{d} = E (y_{i j} | Δ_{i} = d)$ . Note that ${\bar{y}}_{0}$ is not defined.

If $μ_{0} = E (y_{i j} | Δ_{i} = 0)$ is known, then the overall population mean $μ = \sum_{d = 0}^{k} p_{d} μ_{d},$ where $p_{d} = P (Δ = d)$ , can be estimated by (2) $\tilde{μ} = \frac{n_{0}}{n} μ_{0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{d}$ (2) The proof of the following result is deferred to Section 5.

Theorem 2.1.

Assume (A1) holds and that components of $y$ have the same distribution with finite second-order moment Then, as $n \to \infty$ , (3) $\begin{aligned} \sqrt{n} (\tilde{μ} - μ) \to N \\ (0, p_{0} μ_{0}^{2} + \sum_{d = 1}^{k} p_{d} (σ_{d}^{2} + μ_{d}^{2}) - μ^{2}) in distribution, \end{aligned}$ (3) where $σ_{d}^{2} = V a r (y_{i j} | Δ_{i} = d)$ , $d = 1, \dots, k$ .

Since $μ_{0} = E (y_{i j} | Δ_{i} = 0)$ is usually unknown, however, $\tilde{μ}$ is not an estimator and we need to find a way to estimate $μ_{0}$ . In the group with $Δ_{i} = 0$ , all components of $y_{i}$ are missing. Thus some assumption is needed to relate this group with other groups. Under assumption (A2), our idea is to solve this problem using data in the group with $Δ_{i} = k$ , the group with completely observed $y_{i}$ 's. From $\begin{aligned} p (y, x) & = \frac{p (y, x | Δ = 0) P (Δ = 0)}{P (Δ = 0 | y, x)} \\ = \frac{p (y, x | Δ = k) P (Δ = k)}{P (Δ = k | y, x)}, \end{aligned}$ we obtain the following relationship: $\begin{aligned} p (y, x | Δ = 0) \\ = \frac{P (Δ = k)}{P (Δ = 0)} \frac{P (Δ = 0 | y, x)}{P (Δ = k | y, x)} p (y, x | Δ = k) \\ = \frac{P (Δ = k)}{P (Δ = 0)} \frac{{1 - π_{θ} (y, u)}^{k}}{{π_{θ} (y, u)}^{k}} p (y, x | Δ = k) \end{aligned}$ where the second equality follows from (A1)–(A2) and $π_{θ} (y, u)$ is defined in (A2), the conditional probability of observing a component of $y$ given $(y, x)$ . The ratio $P (Δ = k) / P (Δ = 0)$ can be estimated by $n_{k} / n_{0}$ . If we can obtain an estimator $\hat{θ}$ of θ, then characteristics in $p (y, x | Δ = 0)$ can be estimated using this relationship, $n_{k} / n_{0}$ , $\hat{θ}$ , and estimators of characteristics in $p (y, x | Δ = k)$ with completely observed $(y, x)$ .

Thus, $μ_{0} = E (y_{i j} | Δ_{i} = 0)$ can be estimated by $\begin{aligned} {\hat{μ}}_{0} & = \frac{n_{k}}{n_{0}} \int \frac{{1 - π_{\hat{θ}} (y, u)}^{k}}{{π_{\hat{θ}} (y, u)}^{k}} y d {\hat{F}}_{k} (y, x) \\ = \frac{1}{k n_{0}} \sum_{i : Δ_{i} = k} \sum_{j = 1}^{k} \frac{{1 - π_{\hat{θ}} (y_{i}, u_{i})}^{k}}{{π_{\hat{θ}} (y_{i}, u_{i})}^{k}} r_{i j} y_{i j} \end{aligned}$ where y is a component of $y$ and ${\hat{F}}_{k}$ is the empirical distribution based on the data set ${(y_{i}, x_{i}), Δ_{i} = k}$ .

Once $μ_{0}$ is estimated by ${\hat{μ}}_{0}$ , the overall population mean μ can be estimated by (4) $\hat{μ} = \frac{n_{0}}{n} {\hat{μ}}_{0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{d}$ (4) In this way, other population characteristics can be similarly estimated. For example, if we want to estimate the distribution of a component of $y$ at a point t, then we just need to replace $y_{i j}$ by the indicator of $y_{i j} \leq t$ in the previous discussion. Quantiles of F can then be estimated. Estimators of correlation between two components of $y$ and between $y$ and $x$ can be similarly derived. We can also estimate parameters via estimating equations.

2.3. Estimation of θ in propensity

To complete our proposed methodology, we need to construct an estimator $\hat{θ}$ of θ under (A1)–(A2). To estimate θ, we follow the approach of generalised method of moments (GMM) in Wang et al. (Citation2014) for the univariate response, but add a novel modification by utilising the multivariate structure of $y$ .

Define an L-dimensional estimating function $G (θ) = (g_{1} (y, x, r, θ), \dots, g_{L} (y, x, r, θ))'$ where $a'$ is the transpose of a, L is an integer $\geq$ the dimension of θ and the form of $g_{l}$ is specified later. These functions are chosen so that, at the true parameter value θ, $E {G (θ)} = 0$ and $E {\partial G (θ) / \partial θ}$ is of full rank. Let $\begin{aligned} G_{n} (θ) & = (\frac{1}{n} \sum_{i = 1}^{n} g_{1} (y_{i}, x_{i}, r_{i}, θ), \dots, \\ {\frac{1}{n} \sum_{i = 1}^{n} g_{L} (y_{i}, x_{i}, r_{i}, θ))}^{'} . \end{aligned}$ If L is the same as the dimension of θ, then we estimate θ by $\hat{θ}$ such that $G_{n} (\hat{θ}) = 0$ . If L is larger than the dimension of θ, we apply the two-step GMM (Hall, Citation2005; Hansen, Citation1982) as follows:

Obtain ${\hat{θ}}^{(1)}$ by minimising ${G_{n} (θ)}^{'} G_{n} (θ)$ .
Obtain $\hat{θ}$ by minimising ${G_{n} (θ)}^{'} \hat{W} G_{n} (θ)$ , where $\hat{W}$ is the inverse of $L \times L$ matrix whose $(l, m)$ element is $n^{- 1} \sum_{i = 1}^{n} g_{l} (y_{i}, x_{i}, r_{i}, {\hat{θ}}^{(1)}) g_{m} (y_{i}, x_{i}, r_{i}, {\hat{θ}}^{(1)})$ .

The optimisation can be solved by using the MATLAB function fminsearch, which is applied in our simulation and data analysis in Sections 3 and 4.

It remains to specify the form of $G (θ)$ . Suppose first that $z$ is discrete and has q categories, say $z = 1, \dots, q$ . A straightforward extension of the approach in Wang et al. (Citation2014) (from univariate response to multivariate $y$ ) is using (5) $G (θ) = v \{\frac{r_{1} \dots r_{k}}{[π_{θ} (y, u)]^{k}} - 1\},$ (5) where $r_{j}$ is the jth component of the vector $r$ of response indicators and $v$ is a vector whose first q components are indicators of $z = 1, \dots, q$ and the last p components are the p-dimensional covariate vector $u$ in (A2). With this choice of G, $E {G (θ)} = 0$ under (A1)–(A2).

However, there are two problems. First, the partially observed responses in $y$ are not used in (Equation5(5) $G (θ) = v \{\frac{r_{1} \dots r_{k}}{[π_{θ} (y, u)]^{k}} - 1\},$ (5) ), since $r_{1} \dots r_{k} = 1$ if and only if all components of $y$ are observed. Second, a more serious issue is that L may be smaller than the dimension of θ. For example, if $u$ is continuous and (6) $π_{θ} (y, u) = {1 + \exp (α + β^{'} y + γ^{'} u)}^{- 1},$ (6) where $θ = (α, β^{'}, γ^{'})^{'}$ , α is univariate, β is k-dimensional, and γ is p-dimensional, then the dimension of θ is p + k + 1 and L = p + q; in this case $L \geq p + k + 1$ requires that q>k. That is, we are not able to apply GMM if z does not have more than k categories.

To overcome this difficulty, we consider the following modification. First, we construct k overlapped subsets $D_{1}, \dots, D_{k}$ of the entire data set, where $D_{h}$ contains data from units whose $y_{i h}$ may be missing but all other components are observed, $h = 1, \dots, k$ . With the notation $r_{j} =$ the jth component of $r$ , $D_{h} = {r_{1} = \dots = r_{h - 1} = r_{h + 1} = \dots = r_{k} = 1}$ . Table provides an example of $D_{1}, D_{2}, D_{3}$ , where a check mark indicates an observed datum and a question mark indicates a nonresponse.

Table 1. Example of $D_{1}, D_{2}, D_{3}$ when k = 3 and n = 30.

Display Table

For each fixed h, we consider (7) $G^{(h)} (θ) = v_{h} \{\frac{r_{h}}{π_{θ} (y, u)} - 1\},$ (7) where L = p + q + k−1, $v_{h}$ is the L-dimensional vector whose first p + q components are the same as those of $v$ in (Equation5(5) $G (θ) = v \{\frac{r_{1} \dots r_{k}}{[π_{θ} (y, u)]^{k}} - 1\},$ (5) ), the rest k−1 components are $r_{1} y_{1}, \dots, r_{h - 1} y_{h - 1}$ , $r_{h + 1} y_{h + 1}, \dots, r_{k} y_{k}$ , and $y_{h}$ is the hth component of $y$ . To see why the function $G^{(h)} (θ)$ in (Equation7(7) $G^{(h)} (θ) = v_{h} \{\frac{r_{h}}{π_{θ} (y, u)} - 1\},$ (7) ) can be used in estimation equation, note that $\begin{aligned} E {G^{(h)} (θ)} \\ = E \{E \{v_{h} [\frac{r_{h}}{π_{θ} (y, u)} - 1] | y, u, D_{h}\}\} \\ = E \{E (v_{h} | y, u, D_{h}) [\frac{E (r_{h} | y, u, D_{h})}{π_{θ} (y, u)} - 1]\} \\ = 0, \end{aligned}$ where the second equality follows from the independence between z and $r_{h}$ conditioned on $(y, u, D_{h})$ and the last equality follows from the fact that $E (r_{h} | y, u, D_{h})$ $= E (r_{h} | y, u) = π_{θ} (y, u)$ under (A1)–(A2).

Note that the key difference between $G (θ)$ in (Equation5(5) $G (θ) = v \{\frac{r_{1} \dots r_{k}}{[π_{θ} (y, u)]^{k}} - 1\},$ (5) ) and $G^{(h)} (θ)$ in (Equation7(7) $G^{(h)} (θ) = v_{h} \{\frac{r_{h}}{π_{θ} (y, u)} - 1\},$ (7) ) is that the components of $y$ other than the hth component are used as ‘covariates’ and included in the vector $v_{h}$ in (Equation7(7) $G^{(h)} (θ) = v_{h} \{\frac{r_{h}}{π_{θ} (y, u)} - 1\},$ (7) ). In this way, we have more estimating functions and does not need to have the restriction q>k in the case of (Equation6(6) $π_{θ} (y, u) = {1 + \exp (α + β^{'} y + γ^{'} u)}^{- 1},$ (6) ), because $L = p + q + k - 1 \geq p + k + 1$ is easily satisfied as long as $q \geq 2$ .

If we apply the GMM algorithm with $G (θ)$ in (Equation5(5) $G (θ) = v \{\frac{r_{1} \dots r_{k}}{[π_{θ} (y, u)]^{k}} - 1\},$ (5) ) replaced by $G^{(h)} (θ)$ in (Equation7(7) $G^{(h)} (θ) = v_{h} \{\frac{r_{h}}{π_{θ} (y, u)} - 1\},$ (7) ), we can obtain a GMM estimator ${\hat{θ}}^{(h)}$ for every h. Our proposed final GMM estimator of θ is then the weighted average estimator $\hat{θ} = \sum_{h = 1}^{k} m_{h} {\hat{θ}}^{(h)} / \sum_{h = 1}^{k} m_{h}$ where $m_{h}$ is the number of units in $D_{h}$ .

When $z$ has continuous components, we can apply the method by discretising $z$ into q categories or use moments of $z$ as components of $v$ .

Under the same regularity conditions assumed in Wang et al. (Citation2014), consistency and asymptotic normality of the estimator $\hat{θ}$ can be established and details are omitted. For a point estimator such as $\hat{μ}$ defined in (Equation4(4) $\hat{μ} = \frac{n_{0}}{n} {\hat{μ}}_{0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{d}$ (4) ), its consistency and asymptotic normality can also be established but its asymptotic variance does not have a simple explicit form such as the one for $\tilde{μ}$ given in (Equation2(2) $\tilde{μ} = \frac{n_{0}}{n} μ_{0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{d}$ (2) ). The complication comes from the estimation of $μ_{0}$ , the correlation between ${\hat{μ}}_{0}$ and ${\bar{y}}_{k}$ in (Equation4(4) $\hat{μ} = \frac{n_{0}}{n} {\hat{μ}}_{0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{d}$ (4) ), and the estimation of θ that produces $\hat{θ}$ correlated with ${\hat{μ}}_{0}$ and all ${\bar{y}}_{d}$ 's in (Equation4(4) $\hat{μ} = \frac{n_{0}}{n} {\hat{μ}}_{0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{d}$ (4) ).

Thus we do not try to obtain an explicit form of the asymptotic variance of $\hat{μ}$ defined by (Equation4(4) $\hat{μ} = \frac{n_{0}}{n} {\hat{μ}}_{0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{d}$ (4) ). Instead, we recommend the bootstrap method for variance estimation or inference. Since our point estimators are all functions of averages and GMM estimators, the general bootstrap theory ensures that the bootstrap variance estimators are consistent and can be effectively applied to avoid the complicated derivation of asymptotic variances of estimators such as $\hat{μ}$ in (Equation4(4) $\hat{μ} = \frac{n_{0}}{n} {\hat{μ}}_{0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{d}$ (4) ), at the expense of a large amount of computations. In Section 3, the performance of bootstrap variance estimators is evaluated by simulations.

2.4. Estimation for multivariate outcomes

In Sections 2.2 and 2.3, we consider the situation where components of $y$ have the same distribution and the population parameter such as the mean μ of a component of $y$ can be estimated using the observed values from all components within each group under assumption (A1) to compensate the missing components. We now consider a multivariate outcome $y$ whose components have different distributions, and we need to estimate population parameters of the jth component $y_{j}$ of $y$ , $j = 1, \dots, k$ . To illustrate, we focus on the estimation of population mean $μ_{j} = E (y_{j})$ with a fixed $j = 1, \dots, k$ .

To handle the nonignorable nonresponse under assumption (A1), we still group data according to the value of Δ, the number of observed components in $y$ , as described in Section 2.1. However, we cannot make use of observations from different components of $y$ within each group; instead, to estimate $μ_{j}$ we can only use observed values from the fixed jth component. Assuming that $μ_{j 0} = E (y_{j} | Δ = 0)$ is known, an analog of $\tilde{μ}$ in (Equation2(2) $\tilde{μ} = \frac{n_{0}}{n} μ_{0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{d}$ (2) ) is (8) ${\tilde{μ}}_{j} = \frac{n_{0}}{n} μ_{j 0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{j d},$ (8) where ${\bar{y}}_{j d}$ is the sample mean of observed values of the jth component of $y$ within group $Δ = d$ . The number of observations used for ${\bar{y}}_{j d}$ , $n_{j d}$ , is smaller than the number of observations $d n_{d} = \sum_{j = 1}^{k} n_{j d}$ used for ${\bar{y}}_{d}$ in (Equation1(1) ${\bar{y}}_{d} = \frac{1}{d n_{d}} \sum_{i : Δ_{i} = d} \sum_{j = 1}^{k} r_{i j} y_{i j},$ (1) ). Hence, ${\tilde{μ}}_{j}$ in (Equation8(8) ${\tilde{μ}}_{j} = \frac{n_{0}}{n} μ_{j 0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{j d},$ (8) ) may be not stable when the sample size n is not very large. To overcome this difficulty, we consider making use of the always observed covariate $x$ to improve the estimation efficiency.

If a correct parametric model between $y$ and $x$ is imposed, then covariate information can be effectively utilised through the model. Although a linear or parametric relationship between $y$ and $x$ for the whole dataset without nonresponse might be possible, it is unrealistic to expect such relationship still exists between $y$ and $x$ in each group with $Δ = d$ . A purely nonparametric regression between $y$ and $x$ in each group may be applied, but a nonparametric method may be inefficient and suffers from the well-known curse of dimensionality.

A popular approach in sample surveys for improving efficiency without relying on any model between $y$ and $x$ is the Generalised Regression (GREG) method. The GREG is first discussed in Cassel et al. (Citation1976) and studied extensively in the literature; for example, Sarndal et al. (Citation2003) and J. Shao and S. Wang (Citation2014). Since this approach is model-assisted but not model-based, i.e., a model is used to derive efficient estimators that are still asymptotically valid even if the model is incorrect, it suits our purpose of utilising covariates without modelling within each group.

For each d and j, let ${\bar{y}}_{j d}$ be the sample mean of observed values of the jth component of $y$ within group $Δ = d$ , ${\bar{x}}_{j d}$ be the sample mean vector of $x$ values corresponding to the observed values used in computing ${\bar{y}}_{j d}$ within group $Δ = d$ , ${\bar{x}}_{d}$ be the sample mean of $x$ values based on all units in group $Δ = d$ , and (9) $\begin{aligned} {\hat{β}}_{j d} & = {[\sum_{i : Δ_{i} = d} r_{i j} (x_{i} - {\bar{x}}_{j d}) (x_{i} - {\bar{x}}_{j d})^{'}]}^{- 1} \\ \sum_{i : Δ_{i} = d} (x_{i} - {\bar{x}}_{j d}) r_{i j} y_{i j}, \end{aligned}$ (9) which is a least squares estimator based on observed data from jth component of $y$ and $x$ within group $Δ = d$ . Assuming that $μ_{j 0}$ is known, our proposed GREG estimator of population mean $μ_{j}$ is (10) ${\tilde{μ}}_{j}^{G R} = \frac{n_{0}}{n} μ_{j 0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {{\bar{y}}_{j d} + {\hat{β}}_{j d}^{'} ({\bar{x}}_{d} - {\bar{x}}_{j d})} .$ (10) The following theorem summarises the asymptotic behaviour of the proposed GREG estimator ${\tilde{μ}}_{j}^{G R}$ in (Equation10(10) ${\tilde{μ}}_{j}^{G R} = \frac{n_{0}}{n} μ_{j 0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {{\bar{y}}_{j d} + {\hat{β}}_{j d}^{'} ({\bar{x}}_{d} - {\bar{x}}_{j d})} .$ (10) ), for each fixed $j = 1, \dots, k$ . Note that no model assumption is imposed on the relationship between $y$ and $x$ .

Theorem 2.2.

Assume (A1) and that, for each $j = 1, \dots, k$ , the second-order moments of $x$ and $x y_{j}$ are finite, where $y_{j}$ is the jth component of $y$ . Assume also that, for every $d = 1, \dots, k$ , $Σ_{d} = V a r (x | Δ = d)$ , the conditional variance of $x$ given $Δ = d$ , is positive definite. Then, as $n \to \infty$ , (11) $\begin{aligned} \sqrt{n} ({\tilde{μ}}_{j}^{G R} - μ_{j}) \to N \\ (0, τ_{j}^{2} + \sum_{d = 0}^{k} p_{d} μ_{j d}^{2} - μ_{j}^{2}) in distribution, \end{aligned}$ (11) where $μ_{j d} = E (y_{j} | Δ = d)$ , (12) $τ_{j}^{2} = \frac{1}{n} E \{n_{Δ}^{2} (\frac{σ_{j Δ}^{2}}{n_{j Δ}} - \frac{n_{Δ} - n_{j Δ}}{n_{Δ} n_{j Δ}} β_{j Δ}^{'} Σ_{Δ} β_{j Δ})\}$ (12) $n_{j d}$ is the number of observed $y_{i j}$ 's within group $Δ = d$ , $σ_{j d}^{2} = V a r (y_{j} | Δ = d)$ , $β_{j d} = Σ_{d}^{- 1} C o v (x, y_{j} | Δ = d)$ , $C o v (x, y_{j} | Δ = d)$ is the conditional covariance between $x$ and $y_{j}$ given $Δ = d$ , $d = 1, \dots, k$ , and $β_{j 0}$ and $σ_{j 0}^{2}$ are defined to be 0. In addition, result (Equation11(11) $\begin{aligned} \sqrt{n} ({\tilde{μ}}_{j}^{G R} - μ_{j}) \to N \\ (0, τ_{j}^{2} + \sum_{d = 0}^{k} p_{d} μ_{j d}^{2} - μ_{j}^{2}) in distribution, \end{aligned}$ (11) ) holds with ${\tilde{μ}}_{j}^{G R}$ replaced by ${\tilde{μ}}_{j}$ in (Equation8(8) ${\tilde{μ}}_{j} = \frac{n_{0}}{n} μ_{j 0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{j d},$ (8) ) and $β_{j Δ}$ in (Equation12(12) $τ_{j}^{2} = \frac{1}{n} E \{n_{Δ}^{2} (\frac{σ_{j Δ}^{2}}{n_{j Δ}} - \frac{n_{Δ} - n_{j Δ}}{n_{Δ} n_{j Δ}} β_{j Δ}^{'} Σ_{Δ} β_{j Δ})\}$ (12) ) replaced by 0.

As indicated by Theorem 2.2, the GREG estimator ${\tilde{μ}}_{j}^{G R}$ is always asymptotically more efficient than ${\tilde{μ}}_{j}$ unless $β_{d j} = 0$ for all $d = 1, \dots, k - 1$ . It can also be seen that $n_{d} = n_{d j}$ when d = k, the group with all completely observed response vectors. This means that the GREG approach does not help in the group $Δ = k$ .

Note that we still need to estimate $μ_{j 0}$ for each fixed j. But this can be done using the same approach we discussed in Sections 2.2 and 2.3. Also, the final estimator of $μ_{j}$ (after replacing $μ_{j 0}$ in (Equation10(10) ${\tilde{μ}}_{j}^{G R} = \frac{n_{0}}{n} μ_{j 0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {{\bar{y}}_{j d} + {\hat{β}}_{j d}^{'} ({\bar{x}}_{d} - {\bar{x}}_{j d})} .$ (10) ) by its estimator) can be shown to be consistency and asymptotically normal under the same regularity conditions assumed in Wang et al. (Citation2014), but its asymptotic variance does not have a simple explicit form such as the one given in Theorem 2.2. Thus we do not try to obtain an explicit form of the asymptotic variance of the GREG estimator of $μ_{j}$ . Instead, we recommend the bootstrap method for variance estimation, as we discussed in the end of Section 2.3.

3. Simulation results

In this section, simulation results are presented to investigate the finite sample performance of our proposed estimators developed in Section 2. We consider some different settings. In all simulation studies, the proposed GMM estimator $\hat{θ}$ is calculated using the MATLAB function fminsearch with initial value $θ = 0$ .

3.1. Results for a single covariate $x = z$ and $y$ with identically distributed components

We first present simulation results under situations where k = 3, $y = (y_{1}, y_{2}, y_{3})$ , components $y_{j}$ 's are identically distributed, and there is only a single covariate $x = z$ satisfying (A2), i.e., there is no covariate $u$ . Our interest is to estimate the marginal population mean μ of a component of $y$ , without applying GREG.

For comparison, in addition to the proposed estimator $\hat{μ}$ in (Equation4(4) $\hat{μ} = \frac{n_{0}}{n} {\hat{μ}}_{0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{d}$ (4) ), we also include the naive estimate ${\hat{μ}}^{N}$ , the sample mean of observed $y$ -values, and ${\hat{μ}}^{F}$ , the sample mean when there is no nonresponse, used as a benchmark.

In the first simulation study, z is discrete with q = 2 categories, $P (z = 1) = 0.4$ and $P (z = 2) = 0.6$ . Conditional on z, k = 3 components of $y$ are independently generated from $N (20 + 10 z, 8^{2})$ . Note that components of $y$ are conditionally independent, but are dependent unconditionally, and have the same distribution with unconditional mean $μ = 36$ . Given the generated data, the nonrespondents are generated according to the propensity (13) $π_{θ} (y, z) = [1 + \exp (α + β_{1} y_{1} + β_{2} y_{2} + β_{3} y_{3})]^{- 1}$ (13) where $θ = (α, β_{1}, β_{2}, β_{3})$ with value $(2.5, - 0.03, - 0.03, - 0.03)$ in case I and value $(- 3, 0.02, 0.02, 0.02)$ in case II. These values of θ are chosen so that β's have different signs and the unconditional nonresponse probability is approximately between 30% and 40%.

The population in cases III and IV is the same except that z has q = 3 categories with $P (z = 1) = 0.3$ , $P (z = 2) = 0.3$ , and $P (z = 3) = 0.4$ , the unconditional population mean is 41, and $θ = (α, β_{1}, β_{2}, β_{3}) = (2.8, - 0.03, - 0.03, - 0.03)$ and $(- 3.3, 0.02, 0.02, 0.02)$ in cases III and IV, respectively.

Table reports simulation results for n = 2000 with 1000 simulation runs. The reported quantities are values of estimate, bias in percentage, and standard deviation (SD) for the estimators of μ and parameters in the propensity, based on 1000 simulations. For the estimation of μ, we also calculate the simulation average of the standard error (SE) and coverage probability (CP) of the approximate 95% confidence interval, using the bootstrap variance estimator with bootstrap sample size 100. We do not compute SE and CP for estimators of α and β's as parameters in propensity are not the main parameters of interest.

Table 2. Simulation results for a single discrete covariate $x = z$ and $y$ with identically distributed components (n = 2000 with 1000 simulations).

Display Table

The results in Table show that the GMM estimator $\hat{θ}$ and $\hat{μ}$ in (Equation4(4) $\hat{μ} = \frac{n_{0}}{n} {\hat{μ}}_{0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{d}$ (4) ) work well for all cases, in terms of estimation bias, SD, and CP. In addition, the bootstrap SE performs well. The naive estimator ${\hat{μ}}^{N}$ has a serious positive bias when β's are negative (larger y has smaller nonresponse probability) and has a negative bias when β's are positive (larger y has larger nonresponse probability). Although ${\hat{μ}}^{N}$ may have a small SD, its bias have a serious effect on inference as its related CP is far from the nominal level 95%.

We next turn to a continuous $z \sim N (0, 4^{2})$ and compare different ways to use z in estimation equations in (Equation7(7) $G^{(h)} (θ) = v_{h} \{\frac{r_{h}}{π_{θ} (y, u)} - 1\},$ (7) ). Conditional on z, components of $y$ are independent and identically distributed as $N (30 + 1.5 z, 8^{2})$ , which gives the unconditional mean $μ = 30$ . Given the generated data, the nonrespondents are generated according to (Equation13(13) $π_{θ} (y, z) = [1 + \exp (α + β_{1} y_{1} + β_{2} y_{2} + β_{3} y_{3})]^{- 1}$ (13) ) with $θ = (α, β_{1}, β_{2}, β_{3}) = (1.8, - 0.03, - 0.03, - 0.03)$ . For the continuous z, we consider three ways of using z in the GMM estimation of θ. In case V, z is discretised into q = 2 categories according to the median of z. In case VI, z is discretised into q = 3 categories according to the 33% and 66% percentiles of z. In case VII, we use a moment of z, i.e., the vector $v_{h}$ in (Equation7(7) $G^{(h)} (θ) = v_{h} \{\frac{r_{h}}{π_{θ} (y, u)} - 1\},$ (7) ) has its first two components as $(1, z)$ . Results for n = 2000 with 1000 simulation runs are given in Table , with the same quantities in Table .

Table 3. Simulation results for a single continuous covariate $x = z$ and $y$ with identically distributed components (n = 2000 with 1000 simulations).

Display Table

From the results in Table , we can see that cutting z into three categories results in a smaller SD compared with that for discretising z into two categories. Using $(1, z)$ for $v_{h}$ with a continuous z results in the most efficient estimators of μ among the three ways of using z in (Equation7(7) $G^{(h)} (θ) = v_{h} \{\frac{r_{h}}{π_{θ} (y, u)} - 1\},$ (7) ).

3.2. Results for $x = (u, z)$ and $y$ with identically distributed components

We now add a covariate u into the cases in Section 3.1 and consider $x = (u, z)$ with a univariate continuous u and a categorical z. We consider four cases. In cases VIII–IX, z is a discrete covariate having q = 2 categories, $P (z = 1) = 0.4$ , and $P (z = 2) = 0.6$ . Given z, $u \sim N (10 z, 10^{2})$ . Given z = 1 and u, components of $y$ are independent and identically distributed as $N (u + 5 z, 8^{2})$ ; given z = 2 and u, components of $y$ are independent and identically distributed as $N (10 + 0.5 u + 5 z, 8^{2})$ . The unconditional mean μ is 24. The propensity is (14) $\begin{aligned} π_{θ} (y, u) \\ = [1 + \exp (α + β_{1} y_{1} + β_{2} y_{2} + β_{3} y_{3} + γ u)]^{- 1} \end{aligned}$ (14) where $θ = (α, β_{1}, β_{2}, β_{3}, γ) = (0.6, - 0.03, - 0.03, - 0.03, 0.04)$ in case VIII and $(1.7, - 0.03, - 0.03, - 0.03, - 0.04)$ in case IX. These values are chosen so that γ has different signs and the unconditional nonresponse probability is approximately between 30% and 40%.

In cases X–XI, z has q = 3 categories, $P (z = 1) = 0.3$ , $P (z = 2) = 0.3$ and $P (z = 3) = 0.4$ . Given z, $u \sim N (10 z, 10^{2})$ . Given z = 1 and u, components of $y$ are independent and identically distributed as $N (u + 5 z, 3^{2})$ ; given z = 2 and u, components of $y$ are independent and identically distributed as $N (1.5 u + 5 z, 3^{2})$ ; given z = 3 and u, components of $y$ are independent and identically distributed as $N (10 + 0.5 u + 5 z, 3^{2})$ . The unconditional mean μ is 32.5. The propensity is given by (Equation14(14) $\begin{aligned} π_{θ} (y, u) \\ = [1 + \exp (α + β_{1} y_{1} + β_{2} y_{2} + β_{3} y_{3} + γ u)]^{- 1} \end{aligned}$ (14) ) with $θ = (α, β_{1}, β_{2}, β_{3}, γ) = (1, - 0.03, - 0.03, - 0.03, 0.05)$ in case X and $(2.8, - 0.03, - 0.03, - 0.03, - 0.05)$ in case XI.

Results for n = 2000 with 1000 simulation runs are given in Table . Conclusions for results in Table are similar to those in Tables and .

Table 4. Simulation results for $x = (u, z)$ with a categorical z and a continuous u (n = 2000 with 1000 simulations).

Display Table

3.3. Results for a multivariate outcome $y$

In this section, we present simulation results under situations where k = 3, components of $y = (y_{1}, y_{2}, y_{3})$ have different distributions, and our interest is to estimate each marginal population mean $μ_{j} = E (y_{j})$ , $j = 1, \dots, k$ . We consider the proposed GREG estimator ${\hat{μ}}_{j}^{G R}$ as well as the estimator ${\hat{μ}}_{j}$ without applying GREG, $j = 1, \dots, k$ . The naive estimator ${\hat{μ}}_{j}^{N}$ , the sample mean of observed values of $y_{j}$ , and ${\hat{μ}}_{j}^{F}$ , the sample mean of $y_{j}$ when there is no nonresponse, are also included.

We consider $x = (u, z)$ with independent u and z, where u is continuous and distributed as $N (3, 5^{2})$ . In cases XII–XIII, z is continuous and distributed as $N (2, 1)$ and given z and u, $y_{1} \sim N (u + 3 z, 3^{2})$ , $y_{2} \sim N (u + 4 z, 3^{2})$ , $y_{3} \sim N (2 u + 5 z, 3^{2})$ and $y_{j}$ 's are independent. The unconditional mean vector $E (y)$ is $(9, 11, 16)$ . In cases XIV–XV, z is discrete with q = 3 categories, $P (z = 1) = 0.3$ , $P (z = 2) = 0.3$ and $P (z = 3) = 0.4$ ; given z and u, $y_{1} \sim N (2 u + 2 z, 3^{2})$ , $y_{2} \sim N (2 u + 4 z, 3^{2})$ , $y_{3} \sim N (4 u + 2 z, 3^{2})$ , and $y_{j}$ 's are independent. The unconditional mean μ is $(10.2, 14.4, 16.2)$ . The propensity is given by (Equation14(14) $\begin{aligned} π_{θ} (y, u) \\ = [1 + \exp (α + β_{1} y_{1} + β_{2} y_{2} + β_{3} y_{3} + γ u)]^{- 1} \end{aligned}$ (14) ) with $θ = (α, β_{1}, β_{2}, β_{3}, γ) = (0.1, - 0.02, - 0.02, - 0.02, 0.05)$ in cases XII and XIV and $(- 1.2, 0.02, 0.02, 0.02, - 0.1)$ in cases XIII and XV. These values are chosen so that γ has different signs and the unconditional nonresponse probability is approximately between 30% and 40%.

Results for n = 2000 with 1000 simulation runs are given in Table . The results show that both proposed estimators ${\hat{μ}}_{j}$ and ${\hat{μ}}_{j}^{G R}$ perform well for each component of $y$ under all cases with coverage probabilities close to the nominal level 0.95. They are much better compared with the naive biased estimator ${\hat{μ}}_{j}^{N}$ . Also, the estimator ${\hat{μ}}_{j}^{G R}$ with GREG has a respectable improvement in standard deviation compared with ${\hat{μ}}_{j}$ without GREG.

Table 5. Simulation results for $x = (u, z)$ and multivariate $y$ (n = 2000 with 1000 simulations); SDimp(%) $= (1 -$ SD of ${\hat{μ}}_{j}^{G R} /$ SD of ${\hat{μ}}_{j}) \times 100 %$ .

Display Table

4. Real data examples

We apply our proposed estimators to two real data sets from the National Longitudinal Survey of Mature and Young Women (NLSW) and the National Health and Nutrition Examination Survey (NHANES). The proposed estimation approach introduced in Section 2.2 is applied on the NLSW survey data since components of the outcome we choose from the dataset can be treated as from the same distribution. The proposed estimation method introduced in Section 2.4 with or without the GREG is applied on the NHANES data since the outcome we choose from the dataset is multivariate.

We present the estimated values and standard error (SE) under bootstrap method of the marginal means as well as the estimated values of the parameters in the nonresponse propensity. Our results and conclusions are based on assumptions (A1)–(A2) which, unfortunately, cannot be checked using available data. The assumption that components of $r$ are conditionally independent and identically distributed given $(y, x)$ seems reasonable from the specific problems under investigation.

4.1. Application to NLSW data

The NLSW started in the mid-1960s because the U.S. Department of Labor was interested in studying the employment patterns of non-institutionalised civilian women in the United States. We focus on the survey of mature women cohort with ages from 30s to early 40s. A detailed description of this survey can be found at https://www.bls.gov/nls/original-cohorts/mature-and-young-women.htm.

Among many topics, we focus on the variable of women's weight in pounds (ERNYR-P) from heath topic as our example. More specifically, we consider the outcome $y = (y_{1}, y_{2}, y_{3})$ $(k = 3)$ , where $y_{j}$ 's are weights (in lbs) of respondent in 1997, 1999 and 2001, respectively. The outcome values are self-reported in roughly every 2 years. Since the participants are matured women, the three components of $y$ have almost the same distribution. We are interested in estimating the overall population mean μ of the weight using the proposed method in Section 2.2. We use the age of participant when she joined the NLSW survey as the nonresponse instrument z.

In the dataset, each of three components of $y$ has about 29% nonresponse probability while the covariate has no nonresponse. The number of observed values in each nonresponse pattern for the outcome $y$ is shown in Table .

Table 6. The number of observed values in each nonresponse pattern

Display Table

We computed the proposed estimator $\hat{μ}$ in Sections 2.2 and 2.3. Since the covariate $x =$ ‘age of respondent when joining the survey’ is univariate and continuous, we treat $x = z$ and use the moments of z directly in the GMM algorithm. The results are given in Table and the SE is computed as the squared root of the bootstrap variance estimate with bootstrap size 100.

Table 7. Estimation based on NLSW data.

Display Table

For comparison, we include the naive estimator ${\hat{μ}}^{N}$ , the sample mean of observed $y$ values. We can see that our proposed estimator $\hat{μ}$ has a significant difference from the naive estimate ${\hat{μ}}^{N}$ .

4.2. Application to NHANES data

The NHANES is a major program of the National Center for Health Statistics, which is a part of the Centers for Disease Control and Prevention responsible for producing vital and health statistics for the United States. The NHANES is a program designed to assess the health and nutritional status of adults and children in the non-institutionalised civilian resident population of the United States. A description of this survey can be found at https://www.cdc.gov/nchs/nhanes/about_nhanes.htm.

The NHANES program began in the early 1960s and had been conducted as a series of surveys focusing on different population groups or health topics. In 1999, the survey became a continuous program that has a changing focus on a variety of health and nutrition measurements to meet emerging needs. The survey is unique in that it combines interviews and physical examinations. The home-interview part collects answers from demographic, socioeconomic, dietary, and health-related questions. The examination component conducted in a mobile examination centre consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel.

The data set we focused on is for 2013–2014 consisting of 9100 persons who completed both interview and examination. We consider a multivariate outcome $y$ with k = 3 and two demographic covariates from the dataset. The three components of $y$ are the total cholesterol (mg/dL), ‘LBXSCH’, the first reading of systolic blood pressure (mm Hg), ‘BPXSY1’, and the average sagittal abdominal diameter (cm), ‘BMDAVSAD’. The two covariates are the age in years of the household reference person, ‘DMDHRAGE’, and the total household income (reported as a range value in dollars), ‘INDHHIN2’.

Each of the three components of $y$ has about 28% missing values while two covariates have no missing value. The number of observed values in each of nonresponse pattern for $y$ is shown in Table .

Table 8. The number of observed values in each of nonresponse pattern.

Display Table

In this example, the three components of $y$ have different distributions and we are interested in estimating the population mean for each $y_{j}$ . Therefore, we apply our proposed estimator in Section 2.4 with GREG, denoted by ${\hat{μ}}_{j}^{G R}$ , and the estimator without generalised regression, denoted by ${\hat{μ}}_{j}$ . For comparison, we also include the naive estimate ${\hat{μ}}_{j}^{N}$ , the sample mean of observed $y_{j}$ values.

Since $x$ is two dimensional, we try two scenarios, $z =$ DMDHRAGE and $u =$ INDHHIN2 in case 1, and $z =$ INDHHIN2 and $u =$ DMDHRAGE in case 2. The propensity model we used is given by (Equation14(14) $\begin{aligned} π_{θ} (y, u) \\ = [1 + \exp (α + β_{1} y_{1} + β_{2} y_{2} + β_{3} y_{3} + γ u)]^{- 1} \end{aligned}$ (14) ).

The results for two cases are given in Table , where SE is computed as the squared root of the bootstrap variance estimate with bootstrap size 100.

Table 9. Estimation based on NHANES data.

Display Table

From both cases, we can see that estimators ${\hat{μ}}_{j}$ and ${\hat{μ}}_{j}^{G R}$ are very similar but are significantly different from the naive estimator ${\hat{μ}}_{j}^{N}$ , indicating that the naive estimator is biased according to our theory and empirical results. The fact that different ways of defining z in (A2) result in very similar estimates of $μ_{j}$ 's indicates that both covariates DMDHRAGE and INDHHIN2 are suitable to be used as z in (A2), although different z's produce different estimates of parameters in propensity. In this example, covariates may not help very much in estimating the marginal population means, although they are very helpful in handling nonignorable nonresponse.

5. Technical proofs

Proof of Theorem 2.1

The asymptotic normality result (Equation3(3) $\begin{aligned} \sqrt{n} (\tilde{μ} - μ) \to N \\ (0, p_{0} μ_{0}^{2} + \sum_{d = 1}^{k} p_{d} (σ_{d}^{2} + μ_{d}^{2}) - μ^{2}) in distribution, \end{aligned}$ (3) ) follows from the Central Limit Theorem. Hence, it remains to show that the asymptotic mean and variance are of the given form. Let $Δ = {Δ_{i}, \dots, Δ_{i}, i = 1, \dots, n}$ . From conditioning, $\begin{aligned} E (\tilde{μ}) & = E \{E (\frac{n_{0}}{n} μ_{0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{d} | Δ)\} \\ = E \{\sum_{d = 0}^{k} \frac{n_{d}}{n} μ_{d}\} = \sum_{d = 0}^{k} p_{d} μ_{d} = μ \end{aligned}$ so that the mean of $\tilde{μ} - μ$ is 0. To derive the asymptotic variance, we calculate $\begin{aligned} V a r (\tilde{μ}) & = V a r (\frac{n_{0}}{n} μ_{0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{d}) \\ = E \{V a r (\sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{d} | Δ)\} \\ + V a r \{E (\frac{n_{0}}{n} μ_{0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {\bar{y}}_{d} | Δ)\} \\ = E \{\frac{1}{n^{2}} \sum_{d = 1}^{k} n_{d} σ_{d}^{2}\} + V a r \{\sum_{d = 0}^{k} \frac{n_{d}}{n} μ_{d}\} \\ = \frac{1}{n} \sum_{d = 1}^{k} p_{d} σ_{d}^{2} + \frac{1}{n^{2}} \sum_{d = 0}^{k} μ_{d}^{2} V a r (n_{d}) \\ + \frac{1}{n^{2}} \sum_{d \neq l} μ_{d} μ_{l} C o v (n_{d}, n_{l}) \\ = \frac{1}{n} \sum_{d = 0}^{k} p_{d} σ_{d}^{2} + \frac{1}{n} \sum_{d = 0}^{k} μ_{d}^{2} p_{d} (1 - p_{d}) \\ - \frac{1}{n} \sum_{d \neq l} μ_{d} μ_{l} p_{d} p_{l} \end{aligned}$ where the last equality follows from the fact that the vector $(n_{0}, n_{1}, \dots, n_{k})$ follows a multinomial distribution so that $V a r (n_{d}) = n p_{d} (1 - p_{d})$ and $C o v (n_{d}, n_{l}) = - n p_{d} p_{l}$ for any $d \neq l$ . Then, the result follows from $\sum_{d = 0}^{k} μ_{d}^{2} p_{d}^{2} + \sum_{d \neq l} μ_{d} μ_{l} p_{d} p_{l} = {(\sum_{d = 0}^{k} p_{d} μ_{d})}^{2} = μ^{2} .$

Lemma 5.1.

Under the conditions of Theorem 2.2, for each $j = 1, \dots, k$ and each $d = 1, \dots, k$ , ${\hat{β}}_{j d} \to β_{j d}$ in probability as $n \to \infty$ , where ${\hat{β}}_{j d}$ is defined in (Equation9(9) $\begin{aligned} {\hat{β}}_{j d} & = {[\sum_{i : Δ_{i} = d} r_{i j} (x_{i} - {\bar{x}}_{j d}) (x_{i} - {\bar{x}}_{j d})^{'}]}^{- 1} \\ \sum_{i : Δ_{i} = d} (x_{i} - {\bar{x}}_{j d}) r_{i j} y_{i j}, \end{aligned}$ (9) ) and $β_{j d} = Σ_{d}^{- 1} C o v (x, y_{j} | Δ = d)$ .

Proof of Lemma 5.1

For fixed j and d, by (A1), the weak law of large numbers for independent random variables, and Lemma 2.1 in Section 2.1, as $n \to \infty$ , $\begin{aligned} \frac{1}{n_{j d}} \sum_{i : Δ_{i} = d} r_{i j} x_{i} y_{i j} \to E (x y_{j} | Δ = d), \\ {\bar{x}}_{j d} \to E (x | Δ = d), {\bar{y}}_{j d} \to E (y_{j} | Δ = d) \end{aligned}$ in probability. Therefore, as $n \to \infty$ , $\begin{aligned} \frac{1}{n_{j d}} \sum_{i : Δ_{i} = d} (x_{i} - {\bar{x}}_{j d}) r_{i j} y_{i j} \\ \to E (x y_{j} | Δ = d) - E (x | Δ = d) E (y_{j} | Δ = d) \\ = C o v (x, y_{j} | Δ = d) in probability. \end{aligned}$ Similarly, it can be shown that $\frac{1}{n_{j d}} \sum_{i : Δ_{i} = d} r_{i j} (x_{i} - {\bar{x}}_{i}) (x_{i} - {\bar{x}}_{i})^{'} \to Σ_{d} in probability.$ The proof is completed by combining the results and using the definitions of ${\hat{β}}_{j d}$ and $β_{j d}$ .

Proof of Theorem 2.2

From (Equation10(10) ${\tilde{μ}}_{j}^{G R} = \frac{n_{0}}{n} μ_{j 0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} {{\bar{y}}_{j d} + {\hat{β}}_{j d}^{'} ({\bar{x}}_{d} - {\bar{x}}_{j d})} .$ (10) ), $\begin{aligned} {\tilde{μ}}_{j}^{G R} - μ_{j} & = \frac{n_{0}}{n} μ_{j 0} + \sum_{d = 1}^{k} \frac{n_{d}}{n} [{\bar{y}}_{j d} + {\hat{β}}_{j d}^{'} ({\bar{x}}_{d} - {\bar{x}}_{j d})] \\ - \sum_{d = 0}^{k} p_{d} μ_{j d} \\ = \sum_{d = 0}^{k} (\frac{n_{d}}{n} - p_{d}) μ_{j d} \\ + \sum_{d = 1}^{k} \frac{n_{d}}{n} [({\bar{y}}_{j d} - μ_{j d}) + {\hat{β}}_{j d}^{'} ({\bar{x}}_{d} - {\bar{x}}_{j d})] \\ = U_{j} + V_{j} + W_{j}, \end{aligned}$ where $\begin{aligned} U_{j} & = \sum_{d = 1}^{k} \frac{n_{d}}{n} [({\bar{y}}_{j d} - μ_{j d}) + β_{j d}^{'} ({\bar{x}}_{d} - {\bar{x}}_{j d})] \\ V_{j} & = \sum_{d = 0}^{k} (\frac{n_{d}}{n} - p_{d}) μ_{j d} \\ W_{j} & = \sum_{d = 1}^{k} \frac{n_{d}}{n} [({\hat{β}}_{j d} - β_{j d})^{'} ({\bar{x}}_{d} - {\bar{x}}_{j d})] \end{aligned}$ By Lemma 5.1, $W_{j}$ is asymptotically negligible compared with $U_{j}$ and $V_{j}$ . Hence, to prove (Equation11(11) $\begin{aligned} \sqrt{n} ({\tilde{μ}}_{j}^{G R} - μ_{j}) \to N \\ (0, τ_{j}^{2} + \sum_{d = 0}^{k} p_{d} μ_{j d}^{2} - μ_{j}^{2}) in distribution, \end{aligned}$ (11) ), it suffices to show that $\sqrt{n} (U_{j} + V_{j})$ converges in distribution to the limiting normal distribution in (Equation11(11) $\begin{aligned} \sqrt{n} ({\tilde{μ}}_{j}^{G R} - μ_{j}) \to N \\ (0, τ_{j}^{2} + \sum_{d = 0}^{k} p_{d} μ_{j d}^{2} - μ_{j}^{2}) in distribution, \end{aligned}$ (11) ). Consider $V_{j}$ first. Note that $\begin{aligned} V_{j} & = \sum_{d = 0}^{k} \frac{n_{d}}{n} μ_{j d} - μ_{j} = \frac{1}{n} \sum_{d = 0}^{k} \sum_{i : Δ_{i} = d} μ_{j d} - μ_{j} \\ = \frac{1}{n} \sum_{i : Δ_{i} = d} \sum_{d = 0}^{k} μ_{j d} - μ_{j} . \end{aligned}$ Then $E (V_{j}) = 0 and V a r (V_{j}) = \frac{1}{n} (\sum_{d = 0}^{k} p_{d} μ_{j d}^{2} - μ_{j}^{2})$ From the Central Limit Theorem, $\frac{\sqrt{n} V_{j}}{\sqrt{V a r (V_{j})}} \to N (0, 1) in distribution .$ We now turn to $U_{j}$ . Let $Δ = {Δ_{1}, \dots, Δ_{n}}$ . Conditioned on $Δ$ , $\begin{aligned} E (U_{j} | Δ) \\ = E \{\sum_{d = 1}^{k} \frac{n_{d}}{n} [({\bar{y}}_{j d} - μ_{j d}) + β_{j d}^{'} ({\bar{x}}_{d} - {\bar{x}}_{j d})] | Δ\} \\ = \sum_{d = 1}^{k} \frac{n_{d}}{n} E ({\bar{y}}_{j d} - μ_{j d} | Δ) \\ + \sum_{d = 1}^{k} \frac{n_{d}}{n} β_{j d}^{'} E ({\bar{x}}_{d} - {\bar{x}}_{j d} | Δ) \\ = 0 \end{aligned}$ where the last equality follows from $E ({\bar{y}}_{j d} - μ_{j d} | Δ) = 0$ and $E ({\bar{x}}_{d} - {\bar{x}}_{j d} | Δ) = 0$ as given $Δ$ , $x$ and $y_{j}$ values in group $Δ = d$ are exchangeable. It follows from the Central Limit Theorem that, conditioned on $Δ$ , $\frac{\sqrt{n} U_{j}}{\sqrt{V a r (U_{j} | Δ)}} \to N (0, 1) in distribution .$ Then, unconditionally, $\frac{\sqrt{n} U_{j}}{\sqrt{E {V a r (U_{j} | Δ)}}} \to N (0, 1) in distribution .$ To complete the proof, it remains to show two items. One is $n^{- 1} E {V a r (U_{j} | Δ)} = τ_{j}^{2}$ given in (Equation12(12) $τ_{j}^{2} = \frac{1}{n} E \{n_{Δ}^{2} (\frac{σ_{j Δ}^{2}}{n_{j Δ}} - \frac{n_{Δ} - n_{j Δ}}{n_{Δ} n_{j Δ}} β_{j Δ}^{'} Σ_{Δ} β_{j Δ})\}$ (12) ); the other is that $C o v (U_{j}, V_{j}) = 0$ . The latter follows from $\begin{aligned} C o v (U_{j}, V_{j}) & = C o v {E (U_{j} | Δ), E (V_{j} | Δ)} \\ + E {C o v (U_{j}, V_{j} | Δ)} \\ = C o v {0, E (V_{j} | Δ)} + 0 \\ = 0, \end{aligned}$ where the second equality follows from the fact that $V_{j}$ is a function of $Δ$ so that $C o v (U_{j}, V_{j} | Δ) = 0$ almost surely. To calculate $E {V a r (U_{j} | Δ)}$ , note that, for any fixed d, ${\bar{y}}_{j d} + β_{j d}^{'} ({\bar{x}}_{d} - {\bar{x}}_{j d}) = {\bar{y}}_{j d} + \frac{n_{d} - n_{j d}}{n_{d}} β_{j d}^{'} ({\tilde{x}}_{j d} - {\bar{x}}_{j d})$ where $\begin{aligned} {\bar{x}}_{j d} & = \frac{1}{n_{j d}} \sum_{i : Δ_{i} = d} r_{i j} x_{i}, {\tilde{x}}_{j d} \\ = \frac{1}{n_{d} - n_{j d}} \sum_{i : Δ_{i} = d} (1 - r_{i j}) x_{i} . \end{aligned}$ Since observations in ${\bar{x}}_{j d}$ and ${\tilde{x}}_{j d}$ are not overlapped, conditioned on $Δ$ , $\begin{aligned} V a r ({\bar{y}}_{j d} + β_{j d}^{'} ({\bar{x}}_{d} - {\bar{x}}_{j d}) | Δ) \\ = V a r \{{\bar{y}}_{j d} + \frac{n_{d} - n_{j d}}{n_{d}} β_{j d}^{'} ({\tilde{x}}_{j d} - {\bar{x}}_{j d}) | Δ\} \\ = V a r ({\bar{y}}_{j d} | Δ) \\ + \frac{(n_{d} - n_{j d})^{2}}{n_{d}^{2}} β_{j d}^{'} V a r ({\tilde{x}}_{j d} - {\bar{x}}_{j d} | Δ) β_{j d} \\ + \frac{2 (n_{d} - n_{j d})}{n_{d}} β_{j d}^{'} C o v ({\bar{y}}_{j d}, {\tilde{x}}_{j d} - {\bar{x}}_{j d} | Δ) \\ = \frac{σ_{j d}^{2}}{n_{d j}} + \frac{(n_{d} - n_{j d})^{2}}{n_{d}^{2}} \frac{n_{d}}{(n_{d} - n_{j d}) n_{j d}} β_{j d}^{'} Σ_{d} β_{j d} \\ - \frac{2 (n_{d} - n_{j d})}{n_{d}} β_{j d}^{'} C o v ({\bar{y}}_{j d}, {\bar{x}}_{j d} | Δ) \\ = \frac{σ_{j d}^{2}}{n_{j d}} - \frac{n_{d} - n_{j d}}{n_{d} n_{j d}} β_{j d}^{'} Σ_{d} β_{j d}, \end{aligned}$ This shows the desired result and completes the proof.

Acknowledgments

We would like to thank the two referees for comments. The authors' research was partially supported by the National Natural Science Foundation of China grant 11831008 and the U.S. National Science Foundation grant DMS-1914411.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The authors' research was partially supported by the National Natural Science Foundation of China grant 11831008 and the U.S. National Science Foundation grant DMS-1914411.

Notes on contributors

Sijing Li

Dr. Sijing Li holds a Ph.D. in statistics from University of Wisconsin-Madison. She is now a statistician at Roche in Shanghai, China. Her research interest is in missing data.

Jun Shao

Dr. Jun Shao holds a PhD in statistics from the University of Wisconsin-Madison. He is a Professor of Statistics at the University of Wisconsin-Madison. His research interests include variable selection and inference with high dimensional data, sample surveys, and missing data problems.

References

Cassel, C. M., Sarndal, C. E., & Wretman, J. H. (1976). Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63, 615–620. https://doi.org/https://doi.org/10.1093/biomet/63.3.615
Web of Science ®Google Scholar
Chen, S. X., Leung, D. H., & Qin, J. (2008). Improving semiparametric estimation by using surrogate data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 803–823. https://doi.org/https://doi.org/10.1111/rssb.2008.70.issue-4
Web of Science ®Google Scholar
Greenlees, J. S., Reece, W. S., & Zieschang, K. D. (1982). Imputation of missing values when the probability of response depends on the variable being imputed. Journal of the American Statistical Association, 77, 251–261. https://doi.org/https://doi.org/10.1080/01621459.1982.10477793
Web of Science ®Google Scholar
Hall, A. R. (2005). Generalized method of moments. Oxford University Press.
Google Scholar
Hansen, L. (1982). Large sample properties of generalized method of moments estimators. Econometrica, 50, 1029–1054. https://doi.org/https://doi.org/10.2307/1912775
Web of Science ®Google Scholar
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data ( 2nd ed.). Wiley.
Google Scholar
Robins, J. M., & Rotiv, Y. (1997). Toward a curse of dimensionality appropriate (CODA) asymptotic theory for semiparametric models. Statistics in Medicine, 16, 285–319. https://doi.org/https://doi.org/10.1002/(ISSN)1097-0258
PubMed Web of Science ®Google Scholar
Rotnitzky, A., & Robins, J. M. (1997). Analysis of semiparametric regression models with nonignorable nonresponse. Statistics in Medicine, 16, 81–102. https://doi.org/https://doi.org/10.1002/(ISSN)1097-0258
PubMed Web of Science ®Google Scholar
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592. https://doi.org/https://doi.org/10.1093/biomet/63.3.581
Web of Science ®Google Scholar
Sarndal, C. E., Swensson, B., & Wretman, J. (2003). Model assisted survey sampling. Springer-Verlag.
Google Scholar
Shao, J., & Wang, L. (2016). Semiparametric inverse propensity weighting for nonignorable missing data. Biometrika, 103, 175–187. https://doi.org/https://doi.org/10.1093/biomet/asv071
Web of Science ®Google Scholar
Shao, J., & Wang, S. (2014). Efficiency of model-assisted regression estimators in sample surveys. Statistica Sinica, 24, 395–414.
Web of Science ®Google Scholar
Shao, J., & Zhang, J. (2015). A transformation approach in linear mixed-effect models with informative missing responses. Biometrika, 102, 107–119. https://doi.org/https://doi.org/10.1093/biomet/asu069
Web of Science ®Google Scholar
Tang, G., Little, R. J. A., & Raghunathan, T. E. (2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika, 90, 747–764. https://doi.org/https://doi.org/10.1093/biomet/90.4.747
Web of Science ®Google Scholar
Wang, S., Shao, J., & Kim, J. K. (2014). An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, 24, 1097–1116.
Web of Science ®Google Scholar
Wu, M. C., & Carroll, R. J. (1988). Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics, 44, 175–188. https://doi.org/https://doi.org/10.2307/2531905
Web of Science ®Google Scholar
Xu, L., & Shao, J. (2009). Estimation in longitudinal or panel data models with random-effect-based missing responses. Biometrics, 65, 1175–1183. https://doi.org/https://doi.org/10.1111/j.1541-0420.2009.01195.x
Web of Science ®Google Scholar
Zhao, J., & Shao, J. (2015). Semiparametric pseudo likelihoods in generalized linear models with nonignorable missing data. Journal of the American Statistical Association, 110, 1577–1590. https://doi.org/https://doi.org/10.1080/01621459.2014.983234
Web of Science ®Google Scholar

Nonignorable item nonresponse in panel data

Abstract

1. Introduction

2. Methodology

2.1. Grouping

2.2. Estimation under cluster sampling

2.3. Estimation of θ in propensity

Table 1. Example of $D_{1}, D_{2}, D_{3}$ when k = 3 and n = 30.

2.4. Estimation for multivariate outcomes

3. Simulation results

3.1. Results for a single covariate $x = z$ and $y$ with identically distributed components

Table 2. Simulation results for a single discrete covariate $x = z$ and $y$ with identically distributed components (n = 2000 with 1000 simulations).

Table 3. Simulation results for a single continuous covariate $x = z$ and $y$ with identically distributed components (n = 2000 with 1000 simulations).

3.2. Results for $x = (u, z)$ and $y$ with identically distributed components

Table 4. Simulation results for $x = (u, z)$ with a categorical z and a continuous u (n = 2000 with 1000 simulations).

3.3. Results for a multivariate outcome $y$

Table 5. Simulation results for $x = (u, z)$ and multivariate $y$ (n = 2000 with 1000 simulations); SDimp(%) $= (1 -$ SD of ${\hat{μ}}_{j}^{G R} /$ SD of ${\hat{μ}}_{j}) \times 100 %$ .

4. Real data examples

4.1. Application to NLSW data

Table 6. The number of observed values in each nonresponse pattern

Table 7. Estimation based on NLSW data.

4.2. Application to NHANES data

Table 8. The number of observed values in each of nonresponse pattern.

Table 9. Estimation based on NHANES data.

5. Technical proofs

Acknowledgments

Disclosure statement

Notes on contributors

Sijing Li

Jun Shao

References

Information for

Open access

Opportunities

Help and information

Nonignorable item nonresponse in panel data

Abstract

1. Introduction

2. Methodology

2.1. Grouping

2.2. Estimation under cluster sampling

2.3. Estimation of θ in propensity

Table 1. Example of D1,D2,D3 when k = 3 and n = 30.

2.4. Estimation for multivariate outcomes

3. Simulation results

3.1. Results for a single covariate x=z and y with identically distributed components

Table 2. Simulation results for a single discrete covariate x=z and y with identically distributed components (n = 2000 with 1000 simulations).

Table 3. Simulation results for a single continuous covariate x=z and y with identically distributed components (n = 2000 with 1000 simulations).

3.2. Results for x=(u,z) and y with identically distributed components

Table 4. Simulation results for x=(u,z) with a categorical z and a continuous u (n = 2000 with 1000 simulations).

3.3. Results for a multivariate outcome y

Table 5. Simulation results for x=(u,z) and multivariate y (n = 2000 with 1000 simulations); SDimp(%) =(1− SD of μˆjGR/ SD of μˆj)×100%.

4. Real data examples

4.1. Application to NLSW data

Table 6. The number of observed values in each nonresponse pattern

Table 7. Estimation based on NLSW data.

4.2. Application to NHANES data

Table 8. The number of observed values in each of nonresponse pattern.

Table 9. Estimation based on NHANES data.

5. Technical proofs

Acknowledgments

Disclosure statement

Additional information

Funding

Notes on contributors

Sijing Li

Jun Shao

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 1. Example of $D_{1}, D_{2}, D_{3}$ when k = 3 and n = 30.

3.1. Results for a single covariate $x = z$ and $y$ with identically distributed components

Table 2. Simulation results for a single discrete covariate $x = z$ and $y$ with identically distributed components (n = 2000 with 1000 simulations).

Table 3. Simulation results for a single continuous covariate $x = z$ and $y$ with identically distributed components (n = 2000 with 1000 simulations).

3.2. Results for $x = (u, z)$ and $y$ with identically distributed components

Table 4. Simulation results for $x = (u, z)$ with a categorical z and a continuous u (n = 2000 with 1000 simulations).

3.3. Results for a multivariate outcome $y$

Table 5. Simulation results for $x = (u, z)$ and multivariate $y$ (n = 2000 with 1000 simulations); SDimp(%) $= (1 -$ SD of ${\hat{μ}}_{j}^{G R} /$ SD of ${\hat{μ}}_{j}) \times 100 %$ .