691
Views
1
CrossRef citations to date
0
Altmetric
Articles

Nonignorable item nonresponse in panel data

&
Pages 58-71 | Received 11 Jun 2020, Accepted 24 Nov 2020, Published online: 17 Dec 2020

Abstract

To estimate unknown population parameters based on panel data having nonignorable item nonresponse, we propose an innovative data grouping approach according to the number of observed components in the multivariate outcome y when the joint distribution of y and associated covariate x is nonparametric and the nonresponse probability conditional on y and x has a parametric form. To deal with the identifiability issue, we utilise a nonresponse instrument z, an auxiliary variable related to y but not related to the nonresponse probability conditional on y and x. We apply a modified generalised method of moments to obtain estimators of the parameters in the nonresponse probability, and a generalised regression estimation to utilise covariate information for efficient estimation of population parameters. Consistency and asymptotic normality of the proposed estimators of the population parameters are established. Simulation and real data results are presented.

1. Introduction

Panel data are collected in many statistical applications, such as sample surveys, clinical trials, economics and social sciences. For example, cluster sampling results in panel data, which occurs in social studies and sample surveys when mutual homogeneity within clusters is evident in the population of interest. Multivariate outcome from a single sampled unit also leads to panel data.

Item nonresponse is a common phenomena in panel data, i.e., some components of the panel, not necessary the entire panel, may be missing. For example, in survey studies, subjects may not respond to all questions; in cluster sampling, some units within a cluster may not respond; in multivariate outcome, some components are measured while the others are not. Estimation and statistical inference without taking nonresponse into consideration could lead to seriously biased estimators and conclusions.

Consider a k-dimensional response or outcome vector y of interest that is subject to item nonresponse. Let r be the response indicator vector of y, i.e., the jth component of r is 1 (or 0) if the jth component of y is observed (or not observed), j=1,,k. Statistical approaches dealing with missing data usually depend on the nonresponse propensity (or mechanism), i.e., the conditional distribution of r given (y,x), denoted by p(r|y,x), where x is a covariate vector associated with y and is always observed. If p(r|y,x)=p(r|yo,x), where yo is the observed part of y, then nonresponse is ignorable (Little & Rubin, Citation2002; Rubin, Citation1976). Otherwise, nonresponse is nonignorable. While there is a rich literature for valid inference on unknown p(y) (the distribution of y) or p(y|x) (the conditional distribution of y given x) under ignorable nonresponse (S. X. Chen et al., Citation2008; Little & Rubin, Citation2002; Robins & Rotiv, Citation1997; Rotnitzky & Robins, Citation1997; Rubin, Citation1976), statistical inference faces serious challenges under nonignorable nonresponse when p(r|y,x) depends on y as well as some components of x.

We provide a brief review of the progress in research on general nonignorable nonresponse in y. Greenlees et al. (Citation1982) proposed to handle nonignorable item nonresponse by maximum likelihood estimation, assuming both p(r|y,x) and p(y|x) are parametric; however, the non-identifiability issue caused by nonignorable nonresponse is not well-addressed and, thus, the result is not rigorous. Besides, a fully parametric approach is sensitive to the parametric model assumptions. Since the population is not identifiable when both p(r|y,x) and p(y|x) are nonparametric (Robins & Rotiv, Citation1997), efforts have been made in situations where one of p(r|y,x) and p(y|x) is parametric or semiparametric. Tang et al. (Citation2003) considered the situation where p(y|x) is parametric but p(r|y,x) is nonparametric, and provided a rigorous treatment of the identifiability issue for the first time; but they assumed that the nonresponse propensity depends only on y, i.e., p(r|y,x)=p(r|y), which may be impractical. This result was extended by Zhao and Shao (Citation2015) to more realistic situation where p(r|y,x)=p(r|y,u) and u is a sub-vector of x. While both previously cited papers assumed a parametric p(y|x) but a unspecified p(r|y,x), parallel results were established by Wang et al. (Citation2014) and J. Shao and L. Wang (Citation2016) under a univariate y (k = 1) with a nonparametric p(y|x) and a parametric or semi-parametric p(r|y,x), which are particularly useful in sample surveys where it is difficult to find a suitable parametric model for p(y|x). Other than the results under a parametric model on p(y|x), there is no general result on multivariate y having nonignorable item nonresponse, although Wu and Carroll (Citation1988), Xu and Shao (Citation2009), and Shao and Zhang (Citation2015) obtained some results when the dependence of r on y is through an unobserved random effect b, i.e., p(r|y,x)=p(r|b,x).

Under nonparametric p(y) and p(y|x), in this paper we propose an innovative data grouping approach to construct valid estimators of population parameters in the presence of nonignorable item nonresponse in y, assuming the following two main assumptions.

(A1)

Given (y,x), components of r are conditionally independent and identically distributed.

(A2)

Given (y,x), the conditional probability of observing a component of y is πθ(y,u), where πθ is a parametric function of (y,u) with an unknown parameter vector θ and x=(u,z) with p(y|x) depending on z.

Our main methodology is introduced in Section 2, followed by some simulation results in Section 3 and two real data examples in Section 4. Section 5 contains some technical proofs.

2. Methodology

We use the notation developed in Section 1. Our inference is based on a training sample of size n, (yi,xi,ri), i=1,,n, which are independent and identically distributed with (y,x,r). Values of xi are always observed and components of yi are observed if and only if the corresponding components of ri are equal to 1.

2.1. Grouping

When there is no nonresponse, values in the entire set {(yi,xi), i=1,,n} are exchangeable. But this does not hold in the presence of nonignorable item nonresponse in y. Although (yi,xi)'s with the same nonresponse pattern ri are exchangeable, there are a total of 2k different nonresponse patterns when k is the dimension of y. Thus although grouping according to nonresponse patter is natural to achieve within-group homogeneity, each group may not have enough units for efficient estimation or inference.

Our main idea is to divide data into k + 1 groups with within-group homogeneity, using the following key lemma under assumption (A1).

Lemma 2.1.

Let Δ be the number of observed components in y. Under (A1), p(y,x|r)=p(y,x|Δ), i.e., the conditional distribution of (y,x) given r is the same as the conditional distribution of (y,x) given Δ.

Proof.

Let π=π(y,x) be the conditional probability of observing a component of y given (y,x). Under (A1), p(r|y,x)=πΔ(1π)kΔ and Δ follows a binomial distribution with probability π and size k conditioned on (y,x). The result follows from p(y,x|r)p(y,x|Δ)=p(r|y,x)p(y,x)p(r|y,x)p(y,x)dydx/p(Δ|y,x)p(y,x)p(Δ|y,x)p(y,x)dydx=p(r|y,x)p(Δ|y,x)p(Δ|y,x)p(y,x)dydxp(r|y,x)p(y,x)dydx=πΔ(1π)kΔkΔπΔ(1π)kΔkΔπΔ(1π)kΔp(y,x)dydxπΔ(1π)kΔp(y,x)dydx=1. According to Lemma 2.1, we can partition the whole dataset into k + 1 groups, {(yi,xi),Δi=d}, d=0,1,,k, where Δi is the number of observed components in yi. Each group {(yi,xi),Δi=d} contains exchangeable values and enough units for inference as long as k is much smaller than n.

2.2. Estimation under cluster sampling

We consider the situation where components of y have the same distribution (e.g., we have panel data under cluster sampling) and estimation of a parameter in the population of y is our interest. To illustrate, we focus on the estimation of μ, the mean of a component of y. For d=1,,k, and each group with Δi=d, the within-group sample mean of observed values is (1) y¯d=1dndi:Δi=dj=1krijyij,(1) where yij and rij are the jth components of yi and ri, respectively, and nd is the number of units with Δi=d. Each y¯d is an estimator of μd=E(yij|Δi=d). Note that y¯0 is not defined.

If μ0=E(yij|Δi=0) is known, then the overall population mean μ=d=0kpdμd, where pd=P(Δ=d), can be estimated by (2) μ~=n0nμ0+d=1kndny¯d(2) The proof of the following result is deferred to Section 5.

Theorem 2.1.

Assume (A1) holds and that components of y have the same distribution with finite second-order moment Then, as n, (3) n(μ~μ)N0,p0μ02+d=1kpd(σd2+μd2)μ2 in distribution,(3) where σd2=Var(yij|Δi=d), d=1,,k.

Since μ0=E(yij|Δi=0) is usually unknown, however, μ~ is not an estimator and we need to find a way to estimate μ0. In the group with Δi=0, all components of yi are missing. Thus some assumption is needed to relate this group with other groups. Under assumption (A2), our idea is to solve this problem using data in the group with Δi=k, the group with completely observed yi's. From p(y,x)=p(y,x|Δ=0)P(Δ=0)P(Δ=0|y,x)=p(y,x|Δ=k)P(Δ=k)P(Δ=k|y,x), we obtain the following relationship: p(y,x|Δ=0)=P(Δ=k)P(Δ=0)P(Δ=0|y,x)P(Δ=k|y,x)p(y,x|Δ=k)=P(Δ=k)P(Δ=0){1πθ(y,u)}k{πθ(y,u)}kp(y,x|Δ=k) where the second equality follows from (A1)–(A2) and πθ(y,u) is defined in (A2), the conditional probability of observing a component of y given (y,x). The ratio P(Δ=k)/P(Δ=0) can be estimated by nk/n0. If we can obtain an estimator θˆ of θ, then characteristics in p(y,x|Δ=0) can be estimated using this relationship, nk/n0, θˆ, and estimators of characteristics in p(y,x|Δ=k) with completely observed (y,x).

Thus, μ0=E(yij|Δi=0) can be estimated by μˆ0=nkn0{1πθˆ(y,u)}k{πθˆ(y,u)}kydFˆk(y,x)=1kn0i:Δi=kj=1k{1πθˆ(yi,ui)}k{πθˆ(yi,ui)}krijyij where y is a component of y and Fˆk is the empirical distribution based on the data set {(yi,xi),Δi=k}.

Once μ0 is estimated by μˆ0, the overall population mean μ can be estimated by (4) μˆ=n0nμˆ0+d=1kndny¯d(4) In this way, other population characteristics can be similarly estimated. For example, if we want to estimate the distribution of a component of y at a point t, then we just need to replace yij by the indicator of yijt in the previous discussion. Quantiles of F can then be estimated. Estimators of correlation between two components of y and between y and x can be similarly derived. We can also estimate parameters via estimating equations.

2.3. Estimation of θ in propensity

To complete our proposed methodology, we need to construct an estimator θˆ of θ under (A1)–(A2). To estimate θ, we follow the approach of generalised method of moments (GMM) in Wang et al. (Citation2014) for the univariate response, but add a novel modification by utilising the multivariate structure of y.

Define an L-dimensional estimating function G(θ)=(g1(y,x,r,θ),,gL(y,x,r,θ)) where a is the transpose of a, L is an integer the dimension of θ and the form of gl is specified later. These functions are chosen so that, at the true parameter value θ, E{G(θ)}=0 and E{G(θ)/θ} is of full rank. Let Gn(θ)=1ni=1ng1(yi,xi,ri,θ),,1ni=1ngL(yi,xi,ri,θ). If L is the same as the dimension of θ, then we estimate θ by θˆ such that Gn(θˆ)=0. If L is larger than the dimension of θ, we apply the two-step GMM (Hall, Citation2005; Hansen, Citation1982) as follows:

  1. Obtain θˆ(1) by minimising {Gn(θ)}Gn(θ).

  2. Obtain θˆ by minimising {Gn(θ)}WˆGn(θ), where Wˆ is the inverse of L×L matrix whose (l,m) element is n1i=1ngl(yi,xi,ri,θˆ(1))gm(yi,xi,ri,θˆ(1)).

The optimisation can be solved by using the MATLAB function fminsearch, which is applied in our simulation and data analysis in Sections 3 and 4.

It remains to specify the form of G(θ). Suppose first that z is discrete and has q categories, say z=1,,q. A straightforward extension of the approach in Wang et al. (Citation2014) (from univariate response to multivariate y) is using (5) G(θ)=vr1rk[πθ(y,u)]k1,(5) where rj is the jth component of the vector r of response indicators and v is a vector whose first q components are indicators of z=1,,q and the last p components are the p-dimensional covariate vector u in (A2). With this choice of G, E{G(θ)}=0 under (A1)–(A2).

However, there are two problems. First, the partially observed responses in y are not used in (Equation5), since r1rk=1 if and only if all components of y are observed. Second, a more serious issue is that L may be smaller than the dimension of θ. For example, if u is continuous and (6) πθ(y,u)={1+exp(α+βy+γu)}1,(6) where θ=(α,β,γ), α is univariate, β is k-dimensional, and γ is p-dimensional, then the dimension of θ is p + k + 1 and L = p + q; in this case Lp+k+1 requires that q>k. That is, we are not able to apply GMM if z does not have more than k categories.

To overcome this difficulty, we consider the following modification. First, we construct k overlapped subsets D1,,Dk of the entire data set, where Dh contains data from units whose yih may be missing but all other components are observed, h=1,,k. With the notation rj= the jth component of r, Dh={r1==rh1=rh+1==rk=1}. Table  provides an example of D1,D2,D3, where a check mark indicates an observed datum and a question mark indicates a nonresponse.

Table 1. Example of D1,D2,D3 when k = 3 and n = 30.

For each fixed h, we consider (7) G(h)(θ)=vhrhπθ(y,u)1,(7) where L = p + q + k−1, vh is the L-dimensional vector whose first p + q components are the same as those of v in (Equation5), the rest k−1 components are r1y1,,rh1yh1, rh+1yh+1,,rkyk, and yh is the hth component of y. To see why the function G(h)(θ) in (Equation7) can be used in estimation equation, note that E{G(h)(θ)}=EEvhrhπθ(y,u)1|y,u,Dh=EE(vh|y,u,Dh)E(rh|y,u,Dh)πθ(y,u)1=0, where the second equality follows from the independence between z and rh conditioned on (y,u,Dh) and the last equality follows from the fact that E(rh|y,u,Dh) =E(rh|y,u)=πθ(y,u) under (A1)–(A2).

Note that the key difference between G(θ) in (Equation5) and G(h)(θ) in (Equation7) is that the components of y other than the hth component are used as ‘covariates’ and included in the vector vh in (Equation7). In this way, we have more estimating functions and does not need to have the restriction q>k in the case of (Equation6), because L=p+q+k1p+k+1 is easily satisfied as long as q2.

If we apply the GMM algorithm with G(θ) in (Equation5) replaced by G(h)(θ) in (Equation7), we can obtain a GMM estimator θˆ(h) for every h. Our proposed final GMM estimator of θ is then the weighted average estimator θˆ=h=1kmhθˆ(h)/h=1kmh where mh is the number of units in Dh.

When z has continuous components, we can apply the method by discretising z into q categories or use moments of z as components of v.

Under the same regularity conditions assumed in Wang et al. (Citation2014), consistency and asymptotic normality of the estimator θˆ can be established and details are omitted. For a point estimator such as μˆ defined in (Equation4), its consistency and asymptotic normality can also be established but its asymptotic variance does not have a simple explicit form such as the one for μ~ given in (Equation2). The complication comes from the estimation of μ0, the correlation between μˆ0 and y¯k in (Equation4), and the estimation of θ that produces θˆ correlated with μˆ0 and all y¯d's in (Equation4).

Thus we do not try to obtain an explicit form of the asymptotic variance of μˆ defined by (Equation4). Instead, we recommend the bootstrap method for variance estimation or inference. Since our point estimators are all functions of averages and GMM estimators, the general bootstrap theory ensures that the bootstrap variance estimators are consistent and can be effectively applied to avoid the complicated derivation of asymptotic variances of estimators such as μˆ in (Equation4), at the expense of a large amount of computations. In Section 3, the performance of bootstrap variance estimators is evaluated by simulations.

2.4. Estimation for multivariate outcomes

In Sections 2.2 and 2.3, we consider the situation where components of y have the same distribution and the population parameter such as the mean μ of a component of y can be estimated using the observed values from all components within each group under assumption (A1) to compensate the missing components. We now consider a multivariate outcome y whose components have different distributions, and we need to estimate population parameters of the jth component yj of y, j=1,,k. To illustrate, we focus on the estimation of population mean μj=E(yj) with a fixed j=1,,k.

To handle the nonignorable nonresponse under assumption (A1), we still group data according to the value of Δ, the number of observed components in y, as described in Section 2.1. However, we cannot make use of observations from different components of y within each group; instead, to estimate μj we can only use observed values from the fixed jth component. Assuming that μj0=E(yj|Δ=0) is known, an analog of μ~ in (Equation2) is (8) μ~j=n0nμj0+d=1kndny¯jd,(8) where y¯jd is the sample mean of observed values of the jth component of y within group Δ=d. The number of observations used for y¯jd, njd, is smaller than the number of observations dnd=j=1knjd used for y¯d in (Equation1). Hence, μ~j in (Equation8) may be not stable when the sample size n is not very large. To overcome this difficulty, we consider making use of the always observed covariate x to improve the estimation efficiency.

If a correct parametric model between y and x is imposed, then covariate information can be effectively utilised through the model. Although a linear or parametric relationship between y and x for the whole dataset without nonresponse might be possible, it is unrealistic to expect such relationship still exists between y and x in each group with Δ=d. A purely nonparametric regression between y and x in each group may be applied, but a nonparametric method may be inefficient and suffers from the well-known curse of dimensionality.

A popular approach in sample surveys for improving efficiency without relying on any model between y and x is the Generalised Regression (GREG) method. The GREG is first discussed in Cassel et al. (Citation1976) and studied extensively in the literature; for example, Sarndal et al. (Citation2003) and J. Shao and S. Wang (Citation2014). Since this approach is model-assisted but not model-based, i.e., a model is used to derive efficient estimators that are still asymptotically valid even if the model is incorrect, it suits our purpose of utilising covariates without modelling within each group.

For each d and j, let y¯jd be the sample mean of observed values of the jth component of y within group Δ=d, x¯jd be the sample mean vector of x values corresponding to the observed values used in computing y¯jd within group Δ=d, x¯d be the sample mean of x values based on all units in group Δ=d, and (9) βˆjd=i:Δi=drij(xix¯jd)(xix¯jd)1i:Δi=d(xix¯jd)rijyij,(9) which is a least squares estimator based on observed data from jth component of y and x within group Δ=d. Assuming that μj0 is known, our proposed GREG estimator of population mean μj is (10) μ~jGR=n0nμj0+d=1kndn{y¯jd+βˆjd(x¯dx¯jd)}.(10) The following theorem summarises the asymptotic behaviour of the proposed GREG estimator μ~jGR in (Equation10), for each fixed j=1,,k. Note that no model assumption is imposed on the relationship between y and x.

Theorem 2.2.

Assume (A1) and that, for each j=1,,k, the second-order moments of x and xyj are finite, where yj is the jth component of y. Assume also that, for every d=1,,k, Σd=Var(x|Δ=d), the conditional variance of x given Δ=d, is positive definite. Then, as n, (11) n(μ~jGRμj)N0,τj2+d=0kpdμjd2μj2 in distribution,(11) where μjd=E(yj|Δ=d), (12) τj2=1nEnΔ2σjΔ2njΔnΔnjΔnΔnjΔβjΔΣΔβjΔ(12) njd is the number of observed yij's within group Δ=d, σjd2=Var(yj|Δ=d), βjd=Σd1Cov(x,yj|Δ=d), Cov(x,yj|Δ=d) is the conditional covariance between x and yj given Δ=d, d=1,,k, and βj0 and σj02 are defined to be 0. In addition, result (Equation11) holds with μ~jGR replaced by μ~j in (Equation8) and βjΔ in (Equation12) replaced by 0.

As indicated by Theorem 2.2, the GREG estimator μ~jGR is always asymptotically more efficient than μ~j unless βdj=0 for all d=1,,k1. It can also be seen that nd=ndj when d = k, the group with all completely observed response vectors. This means that the GREG approach does not help in the group Δ=k.

Note that we still need to estimate μj0 for each fixed j. But this can be done using the same approach we discussed in Sections 2.2 and 2.3. Also, the final estimator of μj (after replacing μj0 in (Equation10) by its estimator) can be shown to be consistency and asymptotically normal under the same regularity conditions assumed in Wang et al. (Citation2014), but its asymptotic variance does not have a simple explicit form such as the one given in Theorem 2.2. Thus we do not try to obtain an explicit form of the asymptotic variance of the GREG estimator of μj. Instead, we recommend the bootstrap method for variance estimation, as we discussed in the end of Section 2.3.

3. Simulation results

In this section, simulation results are presented to investigate the finite sample performance of our proposed estimators developed in Section 2. We consider some different settings. In all simulation studies, the proposed GMM estimator θˆ is calculated using the MATLAB function fminsearch with initial value θ=0.

3.1. Results for a single covariate x=z and y with identically distributed components

We first present simulation results under situations where k = 3, y=(y1,y2,y3), components yj's are identically distributed, and there is only a single covariate x=z satisfying (A2), i.e., there is no covariate u. Our interest is to estimate the marginal population mean μ of a component of y, without applying GREG.

For comparison, in addition to the proposed estimator μˆ in (Equation4), we also include the naive estimate μˆN, the sample mean of observed y-values, and μˆF, the sample mean when there is no nonresponse, used as a benchmark.

In the first simulation study, z is discrete with q = 2 categories, P(z=1)=0.4 and P(z=2)=0.6. Conditional on z, k = 3 components of y are independently generated from N(20+10z,82). Note that components of y are conditionally independent, but are dependent unconditionally, and have the same distribution with unconditional mean μ=36. Given the generated data, the nonrespondents are generated according to the propensity (13) πθ(y,z)=[1+exp(α+β1y1+β2y2+β3y3)]1(13) where θ=(α,β1,β2,β3) with value (2.5,0.03,0.03,0.03) in case I and value (3,0.02,0.02,0.02) in case II. These values of θ are chosen so that β's have different signs and the unconditional nonresponse probability is approximately between 30% and 40%.

The population in cases III and IV is the same except that z has q = 3 categories with P(z=1)=0.3, P(z=2)=0.3, and P(z=3)=0.4, the unconditional population mean is 41, and θ=(α,β1,β2,β3)=(2.8,0.03,0.03,0.03) and (3.3,0.02,0.02,0.02) in cases III and IV, respectively.

Table  reports simulation results for n = 2000 with 1000 simulation runs. The reported quantities are values of estimate, bias in percentage, and standard deviation (SD) for the estimators of μ and parameters in the propensity, based on 1000 simulations. For the estimation of μ, we also calculate the simulation average of the standard error (SE) and coverage probability (CP) of the approximate 95% confidence interval, using the bootstrap variance estimator with bootstrap sample size 100. We do not compute SE and CP for estimators of α and β's as parameters in propensity are not the main parameters of interest.

Table 2. Simulation results for a single discrete covariate x=z and y with identically distributed components (n = 2000 with 1000 simulations).

The results in Table  show that the GMM estimator θˆ and μˆ in (Equation4) work well for all cases, in terms of estimation bias, SD, and CP. In addition, the bootstrap SE performs well. The naive estimator μˆN has a serious positive bias when β's are negative (larger y has smaller nonresponse probability) and has a negative bias when β's are positive (larger y has larger nonresponse probability). Although μˆN may have a small SD, its bias have a serious effect on inference as its related CP is far from the nominal level 95%.

We next turn to a continuous zN(0,42) and compare different ways to use z in estimation equations in (Equation7). Conditional on z, components of y are independent and identically distributed as N(30+1.5z,82), which gives the unconditional mean μ=30. Given the generated data, the nonrespondents are generated according to (Equation13) with θ=(α,β1,β2,β3)=(1.8,0.03,0.03,0.03). For the continuous z, we consider three ways of using z in the GMM estimation of θ. In case V, z is discretised into q = 2 categories according to the median of z. In case VI, z is discretised into q = 3 categories according to the 33% and 66% percentiles of z. In case VII, we use a moment of z, i.e., the vector vh in (Equation7) has its first two components as (1,z). Results for n = 2000 with 1000 simulation runs are given in Table , with the same quantities in Table .

Table 3. Simulation results for a single continuous covariate x=z and y with identically distributed components (n = 2000 with 1000 simulations).

From the results in Table , we can see that cutting z into three categories results in a smaller SD compared with that for discretising z into two categories. Using (1,z) for vh with a continuous z results in the most efficient estimators of μ among the three ways of using z in (Equation7).

3.2. Results for x=(u,z) and y with identically distributed components

We now add a covariate u into the cases in Section 3.1 and consider x=(u,z) with a univariate continuous u and a categorical z. We consider four cases. In cases VIII–IX, z is a discrete covariate having q = 2 categories, P(z=1)=0.4, and P(z=2)=0.6. Given z, uN(10z,102). Given z = 1 and u, components of y are independent and identically distributed as N(u+5z,82); given z = 2 and u, components of y are independent and identically distributed as N(10+0.5u+5z,82). The unconditional mean μ is 24. The propensity is (14) πθ(y,u)=[1+exp(α+β1y1+β2y2+β3y3+γu)]1(14) where θ=(α,β1,β2,β3,γ)=(0.6,0.03,0.03,0.03,0.04) in case VIII and (1.7,0.03,0.03,0.03,0.04) in case IX. These values are chosen so that γ has different signs and the unconditional nonresponse probability is approximately between 30% and 40%.

In cases X–XI, z has q = 3 categories, P(z=1)=0.3, P(z=2)=0.3 and P(z=3)=0.4. Given z, uN(10z,102). Given z = 1 and u, components of y are independent and identically distributed as N(u+5z,32); given z = 2 and u, components of y are independent and identically distributed as N(1.5u+5z,32); given z = 3 and u, components of y are independent and identically distributed as N(10+0.5u+5z,32). The unconditional mean μ is 32.5. The propensity is given by (Equation14) with θ=(α,β1,β2,β3,γ)=(1,0.03,0.03,0.03,0.05) in case X and (2.8,0.03,0.03,0.03,0.05) in case XI.

Results for n = 2000 with 1000 simulation runs are given in Table . Conclusions for results in Table  are similar to those in Tables  and .

Table 4. Simulation results for x=(u,z) with a categorical z and a continuous u (n = 2000 with 1000 simulations).

3.3. Results for a multivariate outcome y

In this section, we present simulation results under situations where k = 3, components of y=(y1,y2,y3) have different distributions, and our interest is to estimate each marginal population mean μj=E(yj), j=1,,k. We consider the proposed GREG estimator μˆjGR as well as the estimator μˆj without applying GREG, j=1,,k. The naive estimator μˆjN, the sample mean of observed values of yj, and μˆjF, the sample mean of yj when there is no nonresponse, are also included.

We consider x=(u,z) with independent u and z, where u is continuous and distributed as N(3,52). In cases XII–XIII, z is continuous and distributed as N(2,1) and given z and u, y1N(u+3z,32), y2N(u+4z,32), y3N(2u+5z,32) and yj's are independent. The unconditional mean vector E(y) is (9,11,16). In cases XIV–XV, z is discrete with q = 3 categories, P(z=1)=0.3, P(z=2)=0.3 and P(z=3)=0.4; given z and u, y1N(2u+2z,32), y2N(2u+4z,32), y3N(4u+2z,32), and yj's are independent. The unconditional mean μ is (10.2,14.4,16.2). The propensity is given by (Equation14) with θ=(α,β1,β2,β3,γ)=(0.1,0.02,0.02,0.02,0.05) in cases XII and XIV and (1.2,0.02,0.02,0.02,0.1) in cases XIII and XV. These values are chosen so that γ has different signs and the unconditional nonresponse probability is approximately between 30% and 40%.

Results for n = 2000 with 1000 simulation runs are given in Table . The results show that both proposed estimators μˆj and μˆjGR perform well for each component of y under all cases with coverage probabilities close to the nominal level 0.95. They are much better compared with the naive biased estimator μˆjN. Also, the estimator μˆjGR with GREG has a respectable improvement in standard deviation compared with μˆj without GREG.

Table 5. Simulation results for x=(u,z) and multivariate y (n = 2000 with 1000 simulations); SDimp(%) =(1 SD of μˆjGR/ SD of μˆj)×100%.

4. Real data examples

We apply our proposed estimators to two real data sets from the National Longitudinal Survey of Mature and Young Women (NLSW) and the National Health and Nutrition Examination Survey (NHANES). The proposed estimation approach introduced in Section 2.2 is applied on the NLSW survey data since components of the outcome we choose from the dataset can be treated as from the same distribution. The proposed estimation method introduced in Section 2.4 with or without the GREG is applied on the NHANES data since the outcome we choose from the dataset is multivariate.

We present the estimated values and standard error (SE) under bootstrap method of the marginal means as well as the estimated values of the parameters in the nonresponse propensity. Our results and conclusions are based on assumptions (A1)–(A2) which, unfortunately, cannot be checked using available data. The assumption that components of r are conditionally independent and identically distributed given (y,x) seems reasonable from the specific problems under investigation.

4.1. Application to NLSW data

The NLSW started in the mid-1960s because the U.S. Department of Labor was interested in studying the employment patterns of non-institutionalised civilian women in the United States. We focus on the survey of mature women cohort with ages from 30s to early 40s. A detailed description of this survey can be found at https://www.bls.gov/nls/original-cohorts/mature-and-young-women.htm.

Among many topics, we focus on the variable of women's weight in pounds (ERNYR-P) from heath topic as our example. More specifically, we consider the outcome y=(y1,y2,y3) (k=3), where yj's are weights (in lbs) of respondent in 1997, 1999 and 2001, respectively. The outcome values are self-reported in roughly every 2 years. Since the participants are matured women, the three components of y have almost the same distribution. We are interested in estimating the overall population mean μ of the weight using the proposed method in Section 2.2. We use the age of participant when she joined the NLSW survey as the nonresponse instrument z.

In the dataset, each of three components of y has about 29% nonresponse probability while the covariate has no nonresponse. The number of observed values in each nonresponse pattern for the outcome y is shown in Table .

Table 6. The number of observed values in each nonresponse pattern

We computed the proposed estimator μˆ in Sections 2.2 and 2.3. Since the covariate x= ‘age of respondent when joining the survey’ is univariate and continuous, we treat x=z and use the moments of z directly in the GMM algorithm. The results are given in Table  and the SE is computed as the squared root of the bootstrap variance estimate with bootstrap size 100.

Table 7. Estimation based on NLSW data.

For comparison, we include the naive estimator μˆN, the sample mean of observed y values. We can see that our proposed estimator μˆ has a significant difference from the naive estimate μˆN.

4.2. Application to NHANES data

The NHANES is a major program of the National Center for Health Statistics, which is a part of the Centers for Disease Control and Prevention responsible for producing vital and health statistics for the United States. The NHANES is a program designed to assess the health and nutritional status of adults and children in the non-institutionalised civilian resident population of the United States. A description of this survey can be found at https://www.cdc.gov/nchs/nhanes/about_nhanes.htm.

The NHANES program began in the early 1960s and had been conducted as a series of surveys focusing on different population groups or health topics. In 1999, the survey became a continuous program that has a changing focus on a variety of health and nutrition measurements to meet emerging needs. The survey is unique in that it combines interviews and physical examinations. The home-interview part collects answers from demographic, socioeconomic, dietary, and health-related questions. The examination component conducted in a mobile examination centre consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel.

The data set we focused on is for 2013–2014 consisting of 9100 persons who completed both interview and examination. We consider a multivariate outcome y with k = 3 and two demographic covariates from the dataset. The three components of y are the total cholesterol (mg/dL), ‘LBXSCH’, the first reading of systolic blood pressure (mm Hg), ‘BPXSY1’, and the average sagittal abdominal diameter (cm), ‘BMDAVSAD’. The two covariates are the age in years of the household reference person, ‘DMDHRAGE’, and the total household income (reported as a range value in dollars), ‘INDHHIN2’.

Each of the three components of y has about 28% missing values while two covariates have no missing value. The number of observed values in each of nonresponse pattern for y is shown in Table .

Table 8. The number of observed values in each of nonresponse pattern.

In this example, the three components of y have different distributions and we are interested in estimating the population mean for each yj. Therefore, we apply our proposed estimator in Section 2.4 with GREG, denoted by μˆjGR, and the estimator without generalised regression, denoted by μˆj. For comparison, we also include the naive estimate μˆjN, the sample mean of observed yj values.

Since x is two dimensional, we try two scenarios, z= DMDHRAGE and u= INDHHIN2 in case 1, and z= INDHHIN2 and u= DMDHRAGE in case 2. The propensity model we used is given by (Equation14).

The results for two cases are given in Table , where SE is computed as the squared root of the bootstrap variance estimate with bootstrap size 100.

Table 9. Estimation based on NHANES data.

From both cases, we can see that estimators μˆj and μˆjGR are very similar but are significantly different from the naive estimator μˆjN, indicating that the naive estimator is biased according to our theory and empirical results. The fact that different ways of defining z in (A2) result in very similar estimates of μj's indicates that both covariates DMDHRAGE and INDHHIN2 are suitable to be used as z in (A2), although different z's produce different estimates of parameters in propensity. In this example, covariates may not help very much in estimating the marginal population means, although they are very helpful in handling nonignorable nonresponse.

5. Technical proofs

Proof of Theorem 2.1

The asymptotic normality result (Equation3) follows from the Central Limit Theorem. Hence, it remains to show that the asymptotic mean and variance are of the given form. Let Δ={Δi,,Δi,i=1,,n}. From conditioning, E(μ~)=EEn0nμ0+d=1kndny¯d|Δ=Ed=0kndnμd=d=0kpdμd=μ so that the mean of μ~μ is 0. To derive the asymptotic variance, we calculate Var(μ~)=Varn0nμ0+d=1kndny¯d=EVard=1kndny¯d|Δ+VarEn0nμ0+d=1kndny¯d|Δ=E1n2d=1kndσd2+Vard=0kndnμd=1nd=1kpdσd2+1n2d=0kμd2Var(nd)+1n2dlμdμlCov(nd,nl)=1nd=0kpdσd2+1nd=0kμd2pd(1pd)1ndlμdμlpdpl where the last equality follows from the fact that the vector (n0,n1,,nk) follows a multinomial distribution so that Var(nd)=npd(1pd) and Cov(nd,nl)=npdpl for any dl. Then, the result follows from d=0kμd2pd2+dlμdμlpdpl=d=0kpdμd2=μ2.

Lemma 5.1.

Under the conditions of Theorem 2.2, for each j=1,,k and each d=1,,k, βˆjdβjd in probability as n, where βˆjd is defined in (Equation9) and βjd=Σd1Cov(x,yj|Δ=d).

Proof of Lemma 5.1

For fixed j and d, by (A1), the weak law of large numbers for independent random variables, and Lemma 2.1 in Section 2.1, as n, 1njdi:Δi=drijxiyijE(xyj|Δ=d),x¯jdE(x|Δ=d),y¯jdE(yj|Δ=d) in probability. Therefore, as n, 1njdi:Δi=d(xix¯jd)rijyijE(xyj|Δ=d)E(x|Δ=d)E(yj|Δ=d)=Cov(x,yj|Δ=d) in probability. Similarly, it can be shown that 1njdi:Δi=drij(xix¯i)(xix¯i)Σd in probability.  The proof is completed by combining the results and using the definitions of βˆjd and βjd.

Proof of Theorem 2.2

From (Equation10), μ~jGRμj=n0nμj0+d=1kndn[y¯jd+βˆjd(x¯dx¯jd)]d=0kpdμjd=d=0kndnpdμjd+d=1kndn[(y¯jdμjd)+βˆjd(x¯dx¯jd)]=Uj+Vj+Wj, where Uj=d=1kndn[(y¯jdμjd)+βjd(x¯dx¯jd)]Vj=d=0kndnpdμjdWj=d=1kndn[(βˆjdβjd)(x¯dx¯jd)] By Lemma 5.1, Wj is asymptotically negligible compared with Uj and Vj. Hence, to prove (Equation11), it suffices to show that n(Uj+Vj) converges in distribution to the limiting normal distribution in (Equation11). Consider Vj first. Note that Vj=d=0kndnμjdμj=1nd=0ki:Δi=dμjdμj=1ni:Δi=dd=0kμjdμj. Then E(Vj)=0andVar(Vj)=1nd=0kpdμjd2μj2 From the Central Limit Theorem, nVjVar(Vj)N0,1 in distribution. We now turn to Uj. Let Δ={Δ1,,Δn}. Conditioned on Δ, E(Uj|Δ)=Ed=1kndn[(y¯jdμjd)+βjd(x¯dx¯jd)]|Δ=d=1kndnEy¯jdμjd|Δ+d=1kndnβjdEx¯dx¯jd|Δ=0 where the last equality follows from E(y¯jdμjd|Δ)=0 and E(x¯dx¯jd|Δ)=0 as given Δ, x and yj values in group Δ=d are exchangeable. It follows from the Central Limit Theorem that, conditioned on Δ, nUjVar(Uj|Δ)N0,1 in distribution. Then, unconditionally, nUjE{Var(Uj|Δ)}N0,1 in distribution. To complete the proof, it remains to show two items. One is n1E{Var(Uj|Δ)}=τj2 given in (Equation12); the other is that Cov(Uj,Vj)=0. The latter follows from Cov(Uj,Vj)=Cov{E(Uj|Δ),E(Vj|Δ)}+E{Cov(Uj,Vj|Δ)}=Cov{0,E(Vj|Δ)}+0=0, where the second equality follows from the fact that Vj is a function of Δ so that Cov(Uj,Vj|Δ)=0 almost surely. To calculate E{Var(Uj|Δ)}, note that, for any fixed d, y¯jd+βjd(x¯dx¯jd)=y¯jd+ndnjdndβjd(x~jdx¯jd) where x¯jd=1njdi:Δi=drijxi,x~jd=1ndnjdi:Δi=d(1rij)xi. Since observations in x¯jd and x~jd are not overlapped, conditioned on Δ, Vary¯jd+βjd(x¯dx¯jd)|Δ=Vary¯jd+ndnjdndβjd(x~jdx¯jd)|Δ=Vary¯jd|Δ+(ndnjd)2nd2βjdVarx~jdx¯jd|Δβjd+2(ndnjd)ndβjdCovy¯jd,x~jdx¯jd|Δ=σjd2ndj+(ndnjd)2nd2nd(ndnjd)njdβjdΣdβjd2(ndnjd)ndβjdCovy¯jd,x¯jd|Δ=σjd2njdndnjdndnjdβjdΣdβjd, This shows the desired result and completes the proof.

Acknowledgments

We would like to thank the two referees for comments. The authors' research was partially supported by the National Natural Science Foundation of China grant 11831008 and the U.S. National Science Foundation grant DMS-1914411.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The authors' research was partially supported by the National Natural Science Foundation of China grant 11831008 and the U.S. National Science Foundation grant DMS-1914411.

Notes on contributors

Sijing Li

Dr. Sijing Li holds a Ph.D. in statistics from University of Wisconsin-Madison. She is now a statistician at Roche in Shanghai, China. Her research interest is in missing data.

Jun Shao

Dr. Jun Shao holds a PhD in statistics from the University of Wisconsin-Madison. He is a Professor of Statistics at the University of Wisconsin-Madison. His research interests include variable selection and inference with high dimensional data, sample surveys, and missing data problems.

References