224
Views
0
CrossRef citations to date
0
Altmetric
Articles

Nearest neighbour imputation under single index models

&
Pages 208-212 | Received 14 Sep 2019, Accepted 30 Sep 2019, Published online: 11 Oct 2019

ABSTRACT

A popular imputation method used to compensate for item nonresponse in sample surveys is the nearest neighbour imputation (NNI) method utilising a covariate to defined neighbours. When the covariate is multivariate, however, NNI suffers the well-known curse of dimensionality and gives unstable results. As a remedy, we propose a single-index NNI when the conditional mean of response given covariates follows a single index model. For estimating the population mean or quantiles, we establish the consistency and asymptotic normality of the single-index NNI estimators. Some limited simulation results are presented to examine the finite-sample performance of the proposed estimator of population mean.

1. Introduction

Let P be a finite population containing N units indexed by i, yi be a univariate outcome or response of interest from unit iP, xi be a covariate vector associated with yi, and let SP be a sample of size n taken from P according to some sampling design. We consider the situation where xi is always observed if iS but yi is subject to nonresponse, i.e., yi is observed if and only if iRS. In sample surveys, imputation is commonly applied to compensate for nonresponse (Kalton & Kasprzyk, Citation1986; Rubin, Citation1987; Sedransk, Citation1985). The nearest neighbour imputation (NNI) method imputes a missing yj by yl, where lR is the nearest neighbour of j in the sense that d(xj,xl)=miniRd(xi,xj) and d(xi,xj) is a distance between xi and xj, e.g., the Euclidean distance. It is a popular method in many survey agencies and has a long history of applications in surveys such as the Census 2000 and the Current Population Survey conducted by the U.S. Census Bureau (Farber & Griffin, Citation1998; Fay, Citation1999), the Job Openings and Labor Turnover Survey and the Employee Benefits Survey conducted by the U.S. Bureau of Labor Statistics (Montaquila & Ponikowski, Citation1993), and the Unified Enterprise Survey, the Survey of Household Spending, and the Financial Farm Survey conducted by Statistics Canada (Rancourt, Citation1999).

The NNI method has some nice features. First, imputed values are actually occurring y-values, not constructed values; they may not be perfect substitutes, but are unlikely to be nonsensical values. Second, the NNI method may be more efficient than imputation not using x-values, such as mean imputation or random imputation, when x provides useful auxiliary information. Third, the NNI method does not assume a parametric regression model between y and x and, hence, it is more robust against model violations than ratio or regression imputation based on a linear regression model. Finally, under some conditions NNI estimators (i.e., estimators calculated using standard formulas and treating nearest neighbour imputed values as observed data) are asymptotically valid not only for moments of yi but also for the distribution and quantiles of yi, which is a superiority over other non-random imputation methods (such as mean, ratio or regression imputation) that lead to valid moment estimators only.

For a univariate covariate xi, some asymptotic properties of NNI are established in Chen and Shao (Citation2000, Citation2001) and Shao and Wang (Citation2008). When xi is multivariate, however, NNI runs into the curse of dimensionality problem. The purpose of this paper is to propose a single-index NNI method for multivariate xi and derive its asymptotic properties, under the following single index model assumption:

  1. The population P can be partitioned into K sub-populations, P1,,PK, such that for within each Pk, (xi,yi)'s are independent and identically distributed (i.i.d.) from a superpopulation with E(yi|xi)=μk(βkxi), where βk is the transpose of an unknown parameter vector βk with the same dimension as xi and μk() is an unspecified function, k=1,,K.

Imputation for nonrespondents are typically done within each Pk and, hence, Pk's are often referred to as imputation classes. They are usually constructed using a categorical variable whose values are observed for all sampled units; for example, under stratified sampling, strata or unions of strata are often used as imputation classes. Each imputation class should contain a large number of sampled units. When there are many strata of small sizes, imputation classes are often obtained through poststratification (Valliant, Citation1993) and/or combining small strata. The superpopulation assumption on (xi,yi) within each imputation class ensures exchangeability of units within each Pk. The single index model assumption is a semiparametric assumption, since μk is unspecified.

Details of the proposed method are presented in Section 2, where we also show that estimators based on single-index NNI are consistent and asymptotically normal under some limiting process as the sample size n increases to infinity. To complement the theory, some simulation results are presented in Section 3 to examine the finite sample performance of proposed estimators.

2. Method and theory

We consider one stage sampling without clusters. Let wi be the survey weight for unit iP, which is equal to the inverse of probability that unit i is selected, a known quantity from sampling design. When there is no nonresponse, a simple and popular estimator of the unknown population total Y=iPyi is the Horvitz–Thompson estimator Yˆ=iSwiyi, which has the unbiasedness property (1) Es(Yˆ)=EsiSwiyi=iPyi=Y,(1) where Es is the expectation with respect to sampling. If the total number of units in P, N, is known, then the population mean Y/N is estimated by Yˆ/N. If N is unknown, then Y/N is estimated by Yˆ/Nˆ, where Nˆ=iSwi satisfying Es(Nˆ)=N.

The most important population parameter in a survey study concerning a variable y is the population mean. Estimation of population quantiles has also become more and more important in modern survey studies. For income variables, for example, the median income or other quantiles could be as important as the mean income. In children with cystic fibrosis, the 10th percentiles of height and weight are important clinical boundaries between healthy and possibly nutritionally compromised patients (Kosorok, Citation1999). Let I(yit) be the indicator of yit for any fixed value t. Using property (Equation1) with yi replaced by I(yit), we obtain an approximately unbiased estimator iSwiI(yit)/Nˆ of the population cumulative distribution of yi at t, which further leads to an approximately unbiased estimator of any quantile of the distribution of yi.

When yi has nonresponse, however, the previously discussed estimators cannot be used. Imputation is a popular technique to handle nonresponse. It fills in a value for every nonrespondent yj, such that an unbiased or approximate unbiased estimator can be obtained using the formula for the situation of no nonresponse with imputed values treated as observed values. That is, if yˆj is an imputed value for nonrespondent yj, then our estimator of the population total Y is (2) YˆI=iRwiyi+jR¯wjyˆj,(2) where R and R¯ are the sets of respondents and nonrespondents, respectively, in the sample S=RR¯.

Under (A1), we consider NNI within each imputation class and independently across imputation classes. For a multivariate xi, if βk in (A1) is known, we can apply a single-index NNI by defining the distance between xi and xj as |βkxiβkxj|, to avoid the curse of dimensionality issue in multivariate NNI. As βk is generally unknown, we can first estimate βk by βˆk using a nonparametric method such as the sliced inverse regression (SIR) proposed by Li and Duan (Citation1991) or the sliced average variance estimation (SAVE) proposed by Cook and Weisberg (Citation1991), and then apply single-index NNI using |βˆkxiβˆkxj| as the distance between xi and xj, i.e., a nonrespondent yj in imputation class k is imputed by yˆj=yl with l satisfying (3) |βˆkxlβˆkxj|=miniRPk|βˆkxiβˆkxj|.(3) After imputation, the population total Y is estimated by YˆI in (Equation2) with yˆj defined by (Equation3). The population cumulative distribution of yi at any t is estimated by FˆI(t)=1NˆiRwiI(yit)+jR¯wjI(yˆjt), regardless whether N is known or unknown (to ensure that the estimate 1 when t).

To consider asymptotic properties of estimators based on singe-index NNI, we assume that the finite population P is a member of a sequence of finite populations indexed by ν. All limiting processes in this paper are understood to be as ν. We need the following assumptions in addition to (A1).

  1. The size of Pk and sample size of SPk increase to infinity as ν, while the number of sub-populations, K, is fixed.

  2. There is a fixed constant c>0 (not depending on ν) such that maxiPnwiNcandnN2EsiSwi2c.

Recall that N is the size of P and n is the sample size. The first condition in (A3) ensures that none of the weights wi's is disproportionately large (see Krewski & Rao, Citation1981). The second condition in (A3) means that the sampling variance of iSwi/N is at most of the order n1. These conditions are typically satisfied, e.g., they are satisfied under stratified simple random sampling designs.

Let ai be the response indicator, i.e., ai=1 if yi is observed and ai=0 if yi is a nonrespondent.

  1. Within each Pk, (xi,yi,ai)'s are i.i.d. from a superpopulation with E(yi8)<, (xi,yi,ai)'s from different imputation classes are independent, and sampling is independent of the superpopulation.

  2. Within each Pk, under the superpopulation, P(ai=1|xi,yi,k)=P(ai=1|xi,k)>0, which is continuous in xi.

  3. Within each Pk, the conditional distribution of xi given ai has a bounded and continuous Lebesgue density and μk() in (A1) is a differentiable function.

  4. Within each Pk, qk,i(γ)=PminjRk|γxγxi|=minjRk|γxγxj||Xk,Rk,Sk is differentiable with respect to γ, where P is with respect to x under superpopulation, Sk=SPk, Rk=RPk, and Xk={xi:iRk}.

  5. For each k, n1/2(βˆkβk)=n1/2iSPkφ(xi,yi,ai)+op(1), where φ is a function satisfying E{φ(xi,yi,ai)}=0 and E{φ(xi,yi,ai)}2<, and op(1) denotes a term converging to 0 in probability.

Because of (A4), NNI is carried out within each SPk. (A5) assumes that, within an imputation class, the nonresponse mechanism is covariate-dependent (Little, Citation1995) or unconfounded (Lee, Rancourt, & Särndal, Citation1994), an assumption made for the validity of many other popular imputation methods. This actually is the main reason to construct imputation classes, in addition to the exchangeability of (xi,yi)'s. Although (xi,yi,ai)'s within an imputation class are i.i.d., the nonresponse mechanism is still not completely at random, since P(ai=1|xi,k) depends on the covariate xi. Finally, (A8) is satisfied if βˆk is obtained using SIR (Li & Duan, Citation1991) or SAVE (Cook & Weisberg, Citation1991).

The following is our main theoretical result.

Theorem

Assume (A1)–(A8). Let YˆI by defined by (Equation2) with imputed yˆj based on single-index NNI. Then nYˆINYN/σdN(0,1) for some σ>0, where d is convergence in distribution unconditionally with respect to the superpopulation and sampling.

Similar results can be obtained for FˆI(t) with any t and quantiles related with FˆI.

Proof of Theorem.

Proof of Theorem

The proof follows the same argument in Shao and Wang (Citation2008). Since variables are independent across imputation classes and imputation is carried out within each imputation class, it suffices to show the result within each imputation class or, equivalently, the result when K = 1. We now drop the subscript k in this proof. Let S, R and X be defined as before with subscript k dropped. Then E(yˆi|X,R,S)=iRqi(βˆ)yi, where qi(βˆ) is the probability that iR is selected as the nearest neighbour of a nonrespondent and qi(β) is defined in (A7) with subscript k dropped. Define μˆI=YˆI/N, μ=Y/N, μ1=E(yi|ai=1), μ0=E(yi|ai=0), p=P(ai=1), w¯i=wi/N, eˆi=yˆiiRqi(βˆ)yi, Q1=iR¯w¯ieˆi, Qˆ2=iRw¯i[yiμ(βˆxi)]+(1p)iRqi(βˆ)[yiμ(βˆxi)], Qˆ3=iRw¯i[μ(βˆxi)μ1], Qˆ4=iR¯[w¯i(1p)]iRqi(βˆ)[yiμ(βˆxi)]+iR¯w¯i[iRqi(βˆ)μ(βˆxi)μ0], Q5=(μ1μ0)iSw¯i(aip) and Q6=μ(iSw¯i1). Also, for l = 2, 3, 4, define Ql to be Qˆl with βˆ replaced by β. Then μˆIμ=Q1+Qˆ2+Qˆ3+Qˆ4+Q5+Q6=Q1+Q2+Q3+Q4+Q5+Q6+(Qˆ2Q2)+(Qˆ2Q3)+(Qˆ4Q4). For each Ql, it is shown in Shao and Wang (Citation2008) that each n1/2Ql is an approximately linear function of random variables converging in distribution to a normal distribution with mean 0. Under (A6)–(A8) and Taylor expansions, we can show that each QˆlQl, l = 2, 3, 4, can be approximated by a linear function of random variables converging in distribution to a normal distribution with mean 0. Hence, the result follows by repeatedly applying Lemma 1 in Schenker and Welsh (Citation1988).

3. Simulation results

A simulation study is performed to examine the finite sample performance of μˆI=YˆI/N with YˆI defined in (Equation2) and wi=N/n. With sample of size n = 200 or 500, data (x1,y1,a1),,(xn,yn,an) are i.i.d. generated as follows. First, a three-dimensional covariate vector xi is generated from the multivariate normal distribution with mean vector (1,1,1) and covariance matrix 10.50.250.510.50.250.51 Conditioned on xi, yi is generated according to a linear model: yi=βxi+εi, or a nonlinear model: yi=0.5(βxi)2+εi, where β=(1,1,1) and εi is generated from one of the following three distributions:

  1. normal distribution N(0,4),

  2. mixture normal distribution 0.4N(0,1)+0.6N(0,9),

  3. heteroscedastic normal distribution N(0,xi12+1), where xi1 is the first component of xi.

Conditioned on xi, the response indicator ai is generated from the Bernoulli distribution with probability π(xi)=1/[1+exp(0.40.1βxi)], where the coefficients in π(xi) are chosen so that the unconditional rates of missing data are between 20% and 40%. For each i, xi is observed and yi is observed if and only if ai=1.

For simplicity, we consider K = 1 in (A1) and N = n. Then, μˆI=YˆI/n is considered as an estimator of the super-population mean μ=E(yi), which is μ=3 under linear model and μ=7.25 under nonlinear model. To apply single-index NNI in (Equation2), SAVE (Cook & Weisberg, Citation1991) is used to obtain estimator βˆ.

To evaluate the performance, we add two oracle estimators, in addition to μˆI. The first oracle estimator is μˆ=i=1nyi/n, the sample mean without nonresponse, assuming we observe all yi's. The second oracle estimator is μ~I, which is the same as μˆI except that the true β, instead of βˆ, is used in finding the nearest neighbour.

Table  provides simulation bias and standard error (SD) of μˆ, μ~I and μˆI based on 1000 runs. It can be seen from Table  that all biases are negligible. In terms of the SD, μˆI based on single-index NNI is just slightly worse than the oracle estimator μ~I using the true β instead of βˆ.

Table 1. Simulation bias and standard deviation (SD) in estimating μ (1000 runs).

The empirical results are consistent with our theoretical findings.

Acknowledgements

We would like to thank two referees for their comments and suggestions.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This work was partially supported by the National Natural Science Foundation of China grants 11831008 and 11871287, the U.S. National Science Foundation grants DMS-1612873 and DMS-1914411, the Natural Science Foundation of Tianjin (18JCYBJC41100) and the Fundamental Research Funds for the Central Universities.

Notes on contributors

Jun Shao

Dr Jun Shao holds a PhD in statistics from the University of Wisconsin-Madison. He is a professor of statistics at the University of Wisconsin-Madison and East China Normal University. His research interests include variable selection and inference with high dimensional data,sample surveys, and missing data problems.

Lei Wang

Dr Lei Wang holds a PhD in statistics from East China Normal University. He is an assistant professor of statistics at Nankai University. His research interests include empirical likelihood and missing data problems.

References

  • Chen, J., & Shao, J. (2000). Nearest neighbor imputation for survey data. Journal of Official Statistics, 16, 113–132.
  • Chen, J., & Shao, J. (2001). Jackknife variance estimation for nearest-neighbor imputation. Journal of the American Statistical Association, 96, 260–269. doi: 10.1198/016214501750332839
  • Cook, R. D., & Weisberg, S. (1991). Discussion of ‘Sliced inverse regression for dimension reduction’. Journal of the American Statistical Association, 86, 28–33.
  • Farber, J. E., & Griffin, R. (1998). A comparison of alternative estimation methodologies for Census 2000. Proceedings of the section on survey research methods (pp. 629–634). American Statistical Association.
  • Fay, R. E. (1999). Theory and application of nearest neighbor imputation in Census 2000. Proceedings of the section on survey research methods (pp. 112–121). American Statistical Association.
  • Kalton, G., & Kasprzyk, D. (1986). The treatment of missing data. Survey Methodology, 12, 1–16.
  • Kosorok, M. R. (1999). Two-sample quantile tests under general conditions. Biometrika, 86, 909–921. doi: 10.1093/biomet/86.4.909
  • Krewski, D., & Rao, J. N. K. (1981). Inference from stratified samples: Properties of the linearization, jackknife and balanced repeated replication methods. The Annals of Statistics, 9, 1010–1019. doi: 10.1214/aos/1176345580
  • Lee, H., Rancourt, E., & Särndal, C. E. (1994). Experiments with variance estimation from survey data with imputed values. Journal of Official Statistics, 10, 231–243.
  • Li, K. C., & Duan, N. (1991). Regression analysis under link violation. The Annals of Statistics, 17, 1009–1052. doi: 10.1214/aos/1176347254
  • Little, R. J. (1995). Modeling the dropout mechanism in repeated-measures studies. Journal of the American Statistical Association, 90, 1112–1121. doi: 10.1080/01621459.1995.10476615
  • Montaquila, J. M., & Ponikowski, C. H. (1993). Comparison of methods for imputing missing responses in an establishment survey. Proceedings of the section on survey research methods (pp. 446–451). American Statistical Association.
  • Rancourt, E. (1999). Estimation with nearest neighbor imputation at statistics Canada. Proceedings of the section on survey research methods (pp. 131–138). American Statistical Association.
  • Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
  • Schenker, N., & Welsh, A. H. (1988). Asymptotic results for multiple imputation. The Annals of Statistics, 16, 1550–1566. doi: 10.1214/aos/1176351053
  • Sedransk, J. (1985). The objective and practice of imputation. Proceedings of the first annual research conference (pp. 445–452). Washington, DC: Bureau of the Census.
  • Shao, J., & Wang, H. (2008). Confidence intervals based on survey data with nearest neighbor imputation. Statistica Sinica, 18, 281–297.
  • Valliant, R. (1993). Poststratification and conditional variance estimation. Journal of the American Statistical Association, 88, 89–96.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.