ABSTRACT
A popular imputation method used to compensate for item nonresponse in sample surveys is the nearest neighbour imputation (NNI) method utilising a covariate to defined neighbours. When the covariate is multivariate, however, NNI suffers the well-known curse of dimensionality and gives unstable results. As a remedy, we propose a single-index NNI when the conditional mean of response given covariates follows a single index model. For estimating the population mean or quantiles, we establish the consistency and asymptotic normality of the single-index NNI estimators. Some limited simulation results are presented to examine the finite-sample performance of the proposed estimator of population mean.
1. Introduction
Let be a finite population containing N units indexed by i, be a univariate outcome or response of interest from unit , be a covariate vector associated with , and let be a sample of size n taken from according to some sampling design. We consider the situation where is always observed if but is subject to nonresponse, i.e., is observed if and only if . In sample surveys, imputation is commonly applied to compensate for nonresponse (Kalton & Kasprzyk, Citation1986; Rubin, Citation1987; Sedransk, Citation1985). The nearest neighbour imputation (NNI) method imputes a missing by , where is the nearest neighbour of j in the sense that and is a distance between and , e.g., the Euclidean distance. It is a popular method in many survey agencies and has a long history of applications in surveys such as the Census 2000 and the Current Population Survey conducted by the U.S. Census Bureau (Farber & Griffin, Citation1998; Fay, Citation1999), the Job Openings and Labor Turnover Survey and the Employee Benefits Survey conducted by the U.S. Bureau of Labor Statistics (Montaquila & Ponikowski, Citation1993), and the Unified Enterprise Survey, the Survey of Household Spending, and the Financial Farm Survey conducted by Statistics Canada (Rancourt, Citation1999).
The NNI method has some nice features. First, imputed values are actually occurring y-values, not constructed values; they may not be perfect substitutes, but are unlikely to be nonsensical values. Second, the NNI method may be more efficient than imputation not using x-values, such as mean imputation or random imputation, when x provides useful auxiliary information. Third, the NNI method does not assume a parametric regression model between y and x and, hence, it is more robust against model violations than ratio or regression imputation based on a linear regression model. Finally, under some conditions NNI estimators (i.e., estimators calculated using standard formulas and treating nearest neighbour imputed values as observed data) are asymptotically valid not only for moments of but also for the distribution and quantiles of , which is a superiority over other non-random imputation methods (such as mean, ratio or regression imputation) that lead to valid moment estimators only.
For a univariate covariate , some asymptotic properties of NNI are established in Chen and Shao (Citation2000, Citation2001) and Shao and Wang (Citation2008). When is multivariate, however, NNI runs into the curse of dimensionality problem. The purpose of this paper is to propose a single-index NNI method for multivariate and derive its asymptotic properties, under the following single index model assumption:
The population can be partitioned into K sub-populations, , such that for within each , 's are independent and identically distributed (i.i.d.) from a superpopulation with where is the transpose of an unknown parameter vector with the same dimension as and is an unspecified function, .
Imputation for nonrespondents are typically done within each and, hence, 's are often referred to as imputation classes. They are usually constructed using a categorical variable whose values are observed for all sampled units; for example, under stratified sampling, strata or unions of strata are often used as imputation classes. Each imputation class should contain a large number of sampled units. When there are many strata of small sizes, imputation classes are often obtained through poststratification (Valliant, Citation1993) and/or combining small strata. The superpopulation assumption on within each imputation class ensures exchangeability of units within each . The single index model assumption is a semiparametric assumption, since is unspecified.
Details of the proposed method are presented in Section 2, where we also show that estimators based on single-index NNI are consistent and asymptotically normal under some limiting process as the sample size n increases to infinity. To complement the theory, some simulation results are presented in Section 3 to examine the finite sample performance of proposed estimators.
2. Method and theory
We consider one stage sampling without clusters. Let be the survey weight for unit , which is equal to the inverse of probability that unit i is selected, a known quantity from sampling design. When there is no nonresponse, a simple and popular estimator of the unknown population total is the Horvitz–Thompson estimator , which has the unbiasedness property (1) (1) where is the expectation with respect to sampling. If the total number of units in , N, is known, then the population mean is estimated by . If N is unknown, then Y/N is estimated by , where satisfying .
The most important population parameter in a survey study concerning a variable y is the population mean. Estimation of population quantiles has also become more and more important in modern survey studies. For income variables, for example, the median income or other quantiles could be as important as the mean income. In children with cystic fibrosis, the 10th percentiles of height and weight are important clinical boundaries between healthy and possibly nutritionally compromised patients (Kosorok, Citation1999). Let be the indicator of for any fixed value t. Using property (Equation1(1) (1) ) with replaced by , we obtain an approximately unbiased estimator of the population cumulative distribution of at t, which further leads to an approximately unbiased estimator of any quantile of the distribution of .
When has nonresponse, however, the previously discussed estimators cannot be used. Imputation is a popular technique to handle nonresponse. It fills in a value for every nonrespondent , such that an unbiased or approximate unbiased estimator can be obtained using the formula for the situation of no nonresponse with imputed values treated as observed values. That is, if is an imputed value for nonrespondent , then our estimator of the population total Y is (2) (2) where and are the sets of respondents and nonrespondents, respectively, in the sample .
Under (A1), we consider NNI within each imputation class and independently across imputation classes. For a multivariate , if in (A1) is known, we can apply a single-index NNI by defining the distance between and as , to avoid the curse of dimensionality issue in multivariate NNI. As is generally unknown, we can first estimate by using a nonparametric method such as the sliced inverse regression (SIR) proposed by Li and Duan (Citation1991) or the sliced average variance estimation (SAVE) proposed by Cook and Weisberg (Citation1991), and then apply single-index NNI using as the distance between and , i.e., a nonrespondent in imputation class k is imputed by with l satisfying (3) (3) After imputation, the population total Y is estimated by in (Equation2(2) (2) ) with defined by (Equation3(3) (3) ). The population cumulative distribution of at any t is estimated by regardless whether N is known or unknown (to ensure that the estimate when ).
To consider asymptotic properties of estimators based on singe-index NNI, we assume that the finite population is a member of a sequence of finite populations indexed by ν. All limiting processes in this paper are understood to be as . We need the following assumptions in addition to (A1).
The size of and sample size of increase to infinity as , while the number of sub-populations, K, is fixed.
There is a fixed constant c>0 (not depending on ν) such that
Recall that N is the size of and n is the sample size. The first condition in (A3) ensures that none of the weights 's is disproportionately large (see Krewski & Rao, Citation1981). The second condition in (A3) means that the sampling variance of is at most of the order . These conditions are typically satisfied, e.g., they are satisfied under stratified simple random sampling designs.
Let be the response indicator, i.e., if is observed and if is a nonrespondent.
Within each , 's are i.i.d. from a superpopulation with , 's from different imputation classes are independent, and sampling is independent of the superpopulation.
Within each , under the superpopulation, , which is continuous in .
Within each , the conditional distribution of given has a bounded and continuous Lebesgue density and in (A1) is a differentiable function.
Within each , is differentiable with respect to γ, where P is with respect to x under superpopulation, , , and .
For each k, , where φ is a function satisfying and , and denotes a term converging to 0 in probability.
Because of (A4), NNI is carried out within each . (A5) assumes that, within an imputation class, the nonresponse mechanism is covariate-dependent (Little, Citation1995) or unconfounded (Lee, Rancourt, & Särndal, Citation1994), an assumption made for the validity of many other popular imputation methods. This actually is the main reason to construct imputation classes, in addition to the exchangeability of 's. Although 's within an imputation class are i.i.d., the nonresponse mechanism is still not completely at random, since depends on the covariate . Finally, (A8) is satisfied if is obtained using SIR (Li & Duan, Citation1991) or SAVE (Cook & Weisberg, Citation1991).
The following is our main theoretical result.
Theorem
Assume (A1)–(A8). Let by defined by (Equation2(2) (2) ) with imputed based on single-index NNI. Then for some where is convergence in distribution unconditionally with respect to the superpopulation and sampling.
Similar results can be obtained for with any t and quantiles related with .
Proof of Theorem.
Proof of Theorem
The proof follows the same argument in Shao and Wang (Citation2008). Since variables are independent across imputation classes and imputation is carried out within each imputation class, it suffices to show the result within each imputation class or, equivalently, the result when K = 1. We now drop the subscript k in this proof. Let , and be defined as before with subscript k dropped. Then where is the probability that is selected as the nearest neighbour of a nonrespondent and is defined in (A7) with subscript k dropped. Define , , , , , , , , , , , and . Also, for l = 2, 3, 4, define to be with replaced by β. Then For each , it is shown in Shao and Wang (Citation2008) that each is an approximately linear function of random variables converging in distribution to a normal distribution with mean 0. Under (A6)–(A8) and Taylor expansions, we can show that each , l = 2, 3, 4, can be approximated by a linear function of random variables converging in distribution to a normal distribution with mean 0. Hence, the result follows by repeatedly applying Lemma 1 in Schenker and Welsh (Citation1988).
3. Simulation results
A simulation study is performed to examine the finite sample performance of with defined in (Equation2(2) (2) ) and . With sample of size n = 200 or 500, data are i.i.d. generated as follows. First, a three-dimensional covariate vector is generated from the multivariate normal distribution with mean vector and covariance matrix Conditioned on , is generated according to a linear model: , or a nonlinear model: , where and is generated from one of the following three distributions:
normal distribution ,
mixture normal distribution ,
heteroscedastic normal distribution , where is the first component of .
Conditioned on , the response indicator is generated from the Bernoulli distribution with probability where the coefficients in are chosen so that the unconditional rates of missing data are between and . For each i, is observed and is observed if and only if .
For simplicity, we consider K = 1 in (A1) and N = n. Then, is considered as an estimator of the super-population mean , which is under linear model and under nonlinear model. To apply single-index NNI in (Equation2(2) (2) ), SAVE (Cook & Weisberg, Citation1991) is used to obtain estimator .
To evaluate the performance, we add two oracle estimators, in addition to . The first oracle estimator is , the sample mean without nonresponse, assuming we observe all 's. The second oracle estimator is , which is the same as except that the true β, instead of , is used in finding the nearest neighbour.
Table provides simulation bias and standard error (SD) of , and based on 1000 runs. It can be seen from Table that all biases are negligible. In terms of the SD, based on single-index NNI is just slightly worse than the oracle estimator using the true β instead of .
Table 1. Simulation bias and standard deviation (SD) in estimating μ (1000 runs).
The empirical results are consistent with our theoretical findings.
Acknowledgements
We would like to thank two referees for their comments and suggestions.
Disclosure statement
No potential conflict of interest was reported by the authors.
Additional information
Funding
Notes on contributors
Jun Shao
Dr Jun Shao holds a PhD in statistics from the University of Wisconsin-Madison. He is a professor of statistics at the University of Wisconsin-Madison and East China Normal University. His research interests include variable selection and inference with high dimensional data,sample surveys, and missing data problems.
Lei Wang
Dr Lei Wang holds a PhD in statistics from East China Normal University. He is an assistant professor of statistics at Nankai University. His research interests include empirical likelihood and missing data problems.
References
- Chen, J., & Shao, J. (2000). Nearest neighbor imputation for survey data. Journal of Official Statistics, 16, 113–132.
- Chen, J., & Shao, J. (2001). Jackknife variance estimation for nearest-neighbor imputation. Journal of the American Statistical Association, 96, 260–269. doi: 10.1198/016214501750332839
- Cook, R. D., & Weisberg, S. (1991). Discussion of ‘Sliced inverse regression for dimension reduction’. Journal of the American Statistical Association, 86, 28–33.
- Farber, J. E., & Griffin, R. (1998). A comparison of alternative estimation methodologies for Census 2000. Proceedings of the section on survey research methods (pp. 629–634). American Statistical Association.
- Fay, R. E. (1999). Theory and application of nearest neighbor imputation in Census 2000. Proceedings of the section on survey research methods (pp. 112–121). American Statistical Association.
- Kalton, G., & Kasprzyk, D. (1986). The treatment of missing data. Survey Methodology, 12, 1–16.
- Kosorok, M. R. (1999). Two-sample quantile tests under general conditions. Biometrika, 86, 909–921. doi: 10.1093/biomet/86.4.909
- Krewski, D., & Rao, J. N. K. (1981). Inference from stratified samples: Properties of the linearization, jackknife and balanced repeated replication methods. The Annals of Statistics, 9, 1010–1019. doi: 10.1214/aos/1176345580
- Lee, H., Rancourt, E., & Särndal, C. E. (1994). Experiments with variance estimation from survey data with imputed values. Journal of Official Statistics, 10, 231–243.
- Li, K. C., & Duan, N. (1991). Regression analysis under link violation. The Annals of Statistics, 17, 1009–1052. doi: 10.1214/aos/1176347254
- Little, R. J. (1995). Modeling the dropout mechanism in repeated-measures studies. Journal of the American Statistical Association, 90, 1112–1121. doi: 10.1080/01621459.1995.10476615
- Montaquila, J. M., & Ponikowski, C. H. (1993). Comparison of methods for imputing missing responses in an establishment survey. Proceedings of the section on survey research methods (pp. 446–451). American Statistical Association.
- Rancourt, E. (1999). Estimation with nearest neighbor imputation at statistics Canada. Proceedings of the section on survey research methods (pp. 131–138). American Statistical Association.
- Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
- Schenker, N., & Welsh, A. H. (1988). Asymptotic results for multiple imputation. The Annals of Statistics, 16, 1550–1566. doi: 10.1214/aos/1176351053
- Sedransk, J. (1985). The objective and practice of imputation. Proceedings of the first annual research conference (pp. 445–452). Washington, DC: Bureau of the Census.
- Shao, J., & Wang, H. (2008). Confidence intervals based on survey data with nearest neighbor imputation. Statistica Sinica, 18, 281–297.
- Valliant, R. (1993). Poststratification and conditional variance estimation. Journal of the American Statistical Association, 88, 89–96.