Search in:

Statistical Theory and Related Fields Volume 3, 2019 - Issue 2

Submit an article Journal homepage

Free access

1,507

Views

CrossRef citations to date

Altmetric

Listen

Articles

On valid descriptive inference from non-probability sample

Li-Chun ZhangS3RI/University of Southampton, Southampton, UKCorrespondence[email protected]

https://orcid.org/0000-0002-3944-9484 View further author information

Pages 103-113 | Received 05 Oct 2018, Accepted 07 Sep 2019, Published online: 13 Sep 2019

Cite this article
https://doi.org/10.1080/24754269.2019.1666241
CrossMark

In this article

ABSTRACT
1. Introduction
2. Review of existing approaches
3. More generally on validity conditions
4. Using additional probability sample of outcomes
5. Summary
Disclosure statement
Additional information
References

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

We examine the conditions under which descriptive inference can be based directly on the observed distribution in a non-probability sample, under both the super-population and quasi-randomisation modelling approaches. Review of existing estimation methods reveals that the traditional formulation of these conditions may be inadequate due to potential issues of under-coverage or heterogeneous mean beyond the assumed model. We formulate unifying conditions that are applicable to both types of modelling approaches. The difficulties of empirically validating the required conditions are discussed, as well as valid inference approaches using supplementary probability sampling. The key message is that probability sampling may still be necessary in some situations, in order to ensure the validity of descriptive inference, but it can be much less resource-demanding given the presence of a big non-probability sample.

KEYWORDS:

Non-informative selection
prediction model
calibration
inverse propensity weighting
sample matching
model misspecification

1. Introduction

There is a resurgence of interest in the use of non-probability samples. See, for example, Baker et al. (Citation2013) and Elliott and Valliant (Citation2017) for two recent reviews. Such data may arise in situations where probability sampling is either infeasible or too costly. The observations may be obtained from the so-called big-data sources, such as payment transaction data via a specific platform, cellphone call data from a major provider of the service. These big-data non-probability samples can be much larger in size, compared to the more familiar non-probability samples collected from web panel surveys, quota sampling, etc.

Following Rubin (Citation1976) and Little (Citation1982), Smith (Citation1983) considers the so-called super-population (SP) approach to inference from non-probability sample. Under this approach, a prediction model is constructed for the outcome variable of interest, often conditional on some chosen covariates. In particular, Smith (Citation1983) observes an important distinction between analytic and descriptive inference. In analytic inference, the target is the model parameters that are of a theoretical nature; such parameters can never be observed directly no matter how large the sample is. Whereas the targets of descriptive inference are statistics of a given finite population, such that in principle they can be directly observed given a perfect census of the population.

Moreover, Smith (Citation1983) focuses on validity conditions, under which the non-probability sample observation mechanism can be ignored, in the sense that inference can be based on the observed distributions directly, such as the conditional distribution of the outcome variable given the covariates in the sample. The two key validity conditions under the SP approach can be roughly stated as follows: (i) the prediction model is correctly specified for the population units, (ii) the non-probability sample selection mechanism is non-informative, in the sense that the relevant distribution under the population model can be observed in the non-probability sample directly. Similar validity conditions for the SP approach apply in other situations, such as purposive sampling (Royall, Citation1970), missing data problems (Rubin, Citation1976).

In this paper, we concentrate on descriptive inference methods that depend on validity conditions in the sense of Smith (Citation1983). Of course, inference is also possible without such validity conditions. For instance, not missing-at-random models (Rubin, Citation1976) can be used to deal with informative missing data, where the unobserved full-sample outcome distribution is not the same as that among the respondent subsample. Or, the sample likelihood of Pfeffermann, Krieger, and Rinott (Citation1998) can be applied to survey data under informative sampling, where the distribution that holds in the population cannot be directly observed in the sample. See also Pfeffermann (Citation2017) for several other situations where this approach may be relevant. We do not consider such approaches here, which require explicitly modelling the informative observation mechanism of sample selection or measurement.

As reviewed by Elliott and Valliant (Citation2017), there exists another quasi-randomisation (QR) approach to non-probability samples. Under the QR approach, one hypothesises a randomisation model of the non-probability sample inclusion indicator, but treats the outcomes of interest as unknown constants in the population. Though it is clearly inspired by the randomisation approach based on probability sampling, the QR approach is also a model-based approach, based on a model of the sample inclusion indicator instead of a prediction model of the outcome variable under the SP approach. A key motivation is that the correct inclusion probability can be used for any outcome of interest, just like when it is known under probability sampling, whereas the SP approach by nature must be specified differently for different outcome variables. In the context of survey sampling, the QR approach was introduced to deal with nonresponse, where response to survey is modelled as the second phase of selection, in addition to the first phase of sample selection according to a probability sampling design (Oh & Scheuren, Citation1983).

According to Elliott and Valliant (Citation2017), two key validity conditions are required for the QR approach. (I) The non-probability sample does have a probability sampling mechanism, even though it is unknown. In particular, one assumes that this hypothesised sample inclusion probability is strictly positive for all the population units, so that the only difference to probability sampling is that the inclusion probability is unknown. (II) There exist a set of covariates that ‘fully govern the sampling mechanism’. In other words, the sample inclusion probability is a function of these covariates.

Thus there are two model-based approaches to inference from non-probability sample. Under the SP approach, one models the outcome variable conditional on the realised sample inclusion indicators, whereas under the QR approach, one models the sample inclusion indicators, but treats the outcomes as unknown constants. Although one may envisage the outcomes as the realised values of random variables, a fully specified model of the outcome variable will not be required under the QR approach, given suitable validity conditions. Similarly, although one acknowledges that the sample selection mechanism may be critical to the SP approach, a fully specified model of the inclusion indicator will not be required under the SP approach, given suitable validity conditions.

It is possible to construct estimators that combine both the models of outcome and sample inclusion indicator, in a manner such that the estimator is consistent as long as one of the two models hold. Over the recent years, it is becoming common to refer to this estimation approach as ‘doubly robust’ (Kang & Schafer, Citation2007; Kim & Haziza, Citation2014; Robins, Rotnitzky, & Zhao, Citation1994). Notice that the traditional generalised regression estimator in survey sampling is doubly robust in the same sense, except that here the randomisation mechanism is actually known. Nevertheless, it is a fact that in the debate between model-based and design-based inference from probability sampling, either side questioned the ‘robustness’ of the other.

The rest of the paper is organised as follows. In Section 2, we review the estimation methods from non-probability sample which do require validity conditions. Although these have been roughly stated above, a closer examination under both modelling perspectives reveals nuances across the different estimators. Moreover, we shall highlight the potential challenges of under-coverage and heterogeneous means beyond the assumed model. The traditional formulation of validity conditions is inadequate in both regards. We outline a set of unified validity conditions in Section 3, which are formulated non-parametrically and encompasses both the modelling approaches. Post-stratification and calibration estimators are considered in light of these conditions. However, as will be discussed, a key difficulty in practice is that the validity conditions may be impossible to verify empirically based only on the data used for the estimation. Finally, we outline shortly in Section 4 some valid approaches given a supplementary probability sampling of the outcome of interest, followed by a brief summary in Section 5.

The key message is that probability sampling may still be necessary in some situations, in order to ensure the validity of descriptive inference, but it can be much less resource-demanding given the presence of a big non-probability sample. In fact, the bigger the non-probability sample, the better it is.

2. Review of existing approaches

Denote by U the population of known size N. Let each population unit be associated with an outcome of interest, denoted by $y_{i}$ , for $i \in U$ . Denote by B the observed nonprobability sample of size $n_{B}$ . A common assumption to all the estimators we discuss below is that B does not contain any out-of-scope units, such that $B \subset U$ , and there are no duplicated units in B. Let $δ_{i} = 1$ if $i \in B$ , and 0 if $i \in U ∖ B$ . Let $y_{i}$ be observed for all the units in B, and let $y_{B} = {y_{i}; i \in B}$ . To fix the idea, let $Y = \sum_{i \in U} y_{i}$ be the population total that is the target of descriptive inference. Let $x_{B} = {x_{i}; i \in B}$ in cases where any relevant covariates $x_{i}$ are available in the sample B. Let $X = \sum_{i \in U} x_{i}$ be the population totals and let $\bar{X} = X / N$ . Given $x_{B}$ , one can have two situations depending on whether $(X, \bar{X})$ are known or not. In the case they are unknown, it may still be possible that there exists a second probability sample S, for $S \subset U$ , in which $x_{i}$ is observed, so that $(X, \bar{X})$ can be estimated based on the sample S.

2.1. B-sample expansion estimator

Consider first the most basic situation where only $y_{B}$ is observed, and no relevant covariates are available at all. Let ${\bar{y}}_{B} = \sum_{i \in B} y_{i} / n_{B}$ be the B-sample mean. The B-sample expansion estimator of Y is given by (1) $\hat{Y} = N {\bar{y}}_{B} .$ (1) Under the SP approach, let $μ_{i} = E (y_{i} | δ_{i}, i \in U)$ be the conditional expectation of $y_{i}$ given $δ_{i}$ , for any $i \in U$ , where both $δ_{i}$ and $y_{i}$ are treated as random variables. Provided the conditional expectation is the same as the unconditional expectation given either $δ_{i} = 1$ or $δ_{i} = 0$ , for any $i \in U$ , denoted by (2) $μ = μ (δ_{i} = 1) = E (y_{i} | i \in B) = E (y_{i}; i \in U),$ (2) we have $E ({\bar{y}}_{B} - Y / N | B) = \sum_{i \in B} μ / n_{B} - μ = 0$ such that the B-sample expansion estimator is prediction unbiased for Y. We shall refer to (Equation2(2) $μ = μ (δ_{i} = 1) = E (y_{i} | i \in B) = E (y_{i}; i \in U),$ (2) ) as the SP assumption, which is a validity condition for the B-sample expansion estimator under the SP approach.

Under the QR approach, where $y_{i}$ is treated as a fixed constant, let $p_{i} = Pr (δ_{i} = 1; y_{i}, i \in U)$ be the inclusion probability of any population unit that is associated with the value $y_{i}$ . The notation ‘;’ is used here instead of ‘ $|$ ’ because, strictly speaking, $p_{i}$ is not a conditional probability now that $y_{i}$ is not conceived as the realised value of a random variable under the QR approach. Now, provided the inclusion probability is the same for any $i \in U$ , (3) $p_{i} = p$ (3) we have $\tilde{Y} = \sum_{i \in B} y_{i} / p$ is unbiased for Y, since $\begin{aligned} E (\sum_{i \in B} y_{i} / p) & = \sum_{i \in U} E (δ_{i}; y_{i}, i \in U) y_{i} / p \\ = \sum_{i \in U} p y_{i} / p = Y . \end{aligned}$ In reality, p is unknown. It is natural to estimate it by $\hat{p} = n_{B} / N$ under (Equation3(3) $p_{i} = p$ (3) ), which yields (Equation1(1) $\hat{Y} = N {\bar{y}}_{B} .$ (1) ) as the resulting plug-in estimator. It follows that the QR assumption (Equation3(3) $p_{i} = p$ (3) ) is the key validity condition, which ensures that the B-sample expansion estimator is consistent for Y, as $N \to \infty$ and $n_{B} / N = O_{p} (1)$ asympotically.

In summary, the B-sample expansion estimator (Equation1(1) $\hat{Y} = N {\bar{y}}_{B} .$ (1) ) can be motivated under both the SP and QR approaches, given validity conditions (Equation2(2) $μ = μ (δ_{i} = 1) = E (y_{i} | i \in B) = E (y_{i}; i \in U),$ (2) ) and (Equation3(3) $p_{i} = p$ (3) ), respectively.

2.2. B-sample calibration estimator

Suppose relevant covariates $x_{B}$ are available in the sample B. The population totals X may be either known or unknown. In the latter case, suppose they can be estimated from a second probability sample S. The B-sample calibration estimator of Y is given by (4) $\begin{aligned} \hat{Y} & = \sum_{i \in B} w_{i} y_{i} w h e r e \\ \{\begin{cases} \sum_{i \in B} w_{i} x_{i} = X & i f k n o w n X, \\ \sum_{i \in B} w_{i} x_{i} = \hat{X} (S) & i f u n k n o w n X, \end{cases} \end{aligned}$ (4) where $\hat{X} (S)$ is some consistent S-sample estimator, as the S-sample size increases, and the weights $w_{B} = {w_{i}; i \in B}$ are calibrated in a way depending on the availability of X.

To actually compute the estimator (Equation4(4) $\begin{aligned} \hat{Y} & = \sum_{i \in B} w_{i} y_{i} w h e r e \\ \{\begin{cases} \sum_{i \in B} w_{i} x_{i} = X & i f k n o w n X, \\ \sum_{i \in B} w_{i} x_{i} = \hat{X} (S) & i f u n k n o w n X, \end{cases} \end{aligned}$ (4) ), one needs to choose a set of initial weights, denoted by $a_{B} = {a_{i}; i \in B}$ , and a distance function such as $\sum_{i \in B} (w_{i} - a_{i})^{2} / a_{i}$ between the initial and calibrated weights (Deville & Särndal, Citation1992). In the case of (5) $a_{i} = 1 / p_{i},$ (5) where $p_{i}$ is the true B-sample inclusion probability, for $p_{i} > 0$ , the calibration estimator is consistent, as $N \to \infty$ and $n_{B} / N = O_{p} (1)$ , given mild regularity conditions in addition. However, insofar as one cannot manage to set the initial weights (Equation5(5) $a_{i} = 1 / p_{i},$ (5) ), the calibration estimator is unmotivated from the QR perspective.

Next, under the SP approach, suppose the SP $_{x}$ assumption given by (6) $E (y_{i} | x_{i}, i \in U) = μ (x_{i}) = x_{i}^{⊤} β,$ (6) which relates the conditional expectation of $y_{i}$ linearly to the given $x_{i}$ , and (7) $E (y_{i} | x_{i}, i \in U) = E (y_{i} | x_{i}, i \in B)$ (7) by which the B-sample selection is non-informative given $x_{i}$ . We have then $E (\sum_{i \in B} w_{i} y_{i} - Y | x_{U}) = E (\sum_{i \in B} w_{i} x_{i}^{⊤} β) - X^{⊤} β = 0$ given $\sum_{i \in B} w_{i} x_{i} = X$ , regardless of the initial weights $a_{B}$ . Otherwise, this expectation would tend to 0, provided $\hat{X} (S)$ is an asymptotically unbiased estimator of X, under some suitable asymptotic setting. It follows that the assumptions (Equation6(6) $E (y_{i} | x_{i}, i \in U) = μ (x_{i}) = x_{i}^{⊤} β,$ (6) ) and (Equation7(7) $E (y_{i} | x_{i}, i \in U) = E (y_{i} | x_{i}, i \in B)$ (7) ) are the key validity conditions for the B-sample calibration estimator under the SP approach.

The estimator (Equation4(4) $\begin{aligned} \hat{Y} & = \sum_{i \in B} w_{i} y_{i} w h e r e \\ \{\begin{cases} \sum_{i \in B} w_{i} x_{i} = X & i f k n o w n X, \\ \sum_{i \in B} w_{i} x_{i} = \hat{X} (S) & i f u n k n o w n X, \end{cases} \end{aligned}$ (4) ) becomes the B-sample post-stratification estimator in the special case where $x_{i}$ is the post-stratum dummy index. For the QR approach, one can set $a_{i}$ to be the inverse post-stratum B-sample fraction, which is equivalent to introducing the QR assumption (Equation3(3) $p_{i} = p$ (3) ) in each post-stratum separately. This QR $_{x}$ assumption provides then a validity condition for the B-sample post-stratification estimator under the QR approach. For the SP approach, the two assumptions (Equation6(6) $E (y_{i} | x_{i}, i \in U) = μ (x_{i}) = x_{i}^{⊤} β,$ (6) ) and (Equation7(7) $E (y_{i} | x_{i}, i \in U) = E (y_{i} | x_{i}, i \in B)$ (7) ) remain formally the same.

2.3. B-sample inverse propensity weighting

Suppose relevant covariates $x_{B}$ are available in the sample B. The B-sample inverse propensity weighting (IPW) estimator is constructed under the QR approach. Suppose (8) $p_{i} = p (x_{i}; η) > 0$ (8) i.e., the B-sample inclusion probability is completely determined given $x_{i}$ , in the strictly positive parametric form $p (x_{i}; η)$ , which may as well be referred to as the QR $_{x}$ assumption. Provided $x_{i}$ is known for all the units in the population, η can be estimated, say, by a population estimating equation $\sum_{i \in U} H (δ_{i}; η) = 0,$ where $E [H (δ_{i}; η)] = 0$ . Otherwise, suppose $x_{S}$ is observed in a second probability sample S, one can use the pseudo population estimating equation $\sum_{i \in S} d_{i} H (δ_{i}; η) = 0$ (Kim & Wang, Citation2018), where $d_{i}$ is the sampling weight, for $i \in S$ , or some S-sampling design-consistent adjustment of it. This requires that one is able to observe $δ_{i}$ for each unit i in S, in other words the two samples S and B can be matched, which is an important assumption in terms of application. To ensure that $H (δ_{i}; η)$ is the same in both of these two estimating equations, i.e., whether $i \in S$ or just $i \in U$ , one needs to assume that S-sampling from U is non-informative for $δ_{i}$ , so that (9) $Pr (δ_{i} = 1 | x_{i}, i \in S) = Pr (δ_{i} = 1 | x_{i}, i \in U) .$ (9) Notice that, given non-informativeness (Equation9(9) $Pr (δ_{i} = 1 | x_{i}, i \in S) = Pr (δ_{i} = 1 | x_{i}, i \in U) .$ (9) ), we have $E [H (δ_{i}; η)] = 0$ for all $i \in s$ , such that one can also use the unweighted S-sample estimating equation, which is given by $\sum_{i \in S} H (δ_{i}; η) = 0$ instead of the pseudo population estimating equation. Having obtained the parameter estimate $\hat{η}$ , one obtains ${\hat{p}}_{i} = p (x_{i}; \hat{η})$ and the B-sample IPW estimator (10) $\hat{Y} = \sum_{i \in B} y_{i} / {\hat{p}}_{i},$ (10) which is consistent for Y under mild regularity conditions, if $\hat{η}$ is consistent for η under some suitable asymptotic setting. It follows that the QR $_{x}$ assumption (Equation8(8) $p_{i} = p (x_{i}; η) > 0$ (8) ) is its key validity condition, whereas the non-informativeness assumption (Equation9(9) $Pr (δ_{i} = 1 | x_{i}, i \in S) = Pr (δ_{i} = 1 | x_{i}, i \in U) .$ (9) ) is needed in addition when $x_{i}$ is only available in a probability sample S instead of the population.

2.4. Another B-sample IPW estimator

Elliott and Valliant (Citation2017) discuss another IPW estimator (Equation10(10) $\hat{Y} = \sum_{i \in B} y_{i} / {\hat{p}}_{i},$ (10) ), where $p_{i}$ is obtained with the help of a second so-called reference probability sample S, and is given by (11) $p_{i} \propto Pr (S_{i} = 1 | x_{i}, i \in U) \frac{Pr (δ_{i} = 1 | x_{i}, i \in B \cup S)}{Pr (S_{i} = 1 | x_{i}, i \in B \cup S)},$ (11) where $S_{i} = 1$ if $i \in S$ and 0 if $i \in U ∖ S$ , and to fix the idea one may suppose $S \cap B = \emptyset$ . First, the QR $_{x}$ assumption (Equation8(8) $p_{i} = p (x_{i}; η) > 0$ (8) ) is retained. The definition of $p_{i}$ by (Equation11(11) $p_{i} \propto Pr (S_{i} = 1 | x_{i}, i \in U) \frac{Pr (δ_{i} = 1 | x_{i}, i \in B \cup S)}{Pr (S_{i} = 1 | x_{i}, i \in B \cup S)},$ (11) ) can then be motivated as follows: $\begin{aligned} \frac{Pr (δ_{i} = 1 | x_{i}, i \in U)}{Pr (S_{i} = 1 | x_{i}, i \in U)} \\ \propto \frac{Pr (x_{i} | δ_{i} = 1, i \in U)}{Pr (x_{i} | S_{i} = 1, i \in U)} \\ [p r o p . t o \frac{Pr (δ_{i} = 1 | i \in U)}{Pr (S_{i} = 1 | i \in U)}] \\ \propto \frac{Pr (x_{i} | δ_{i} = 1, i \in B \cup S)}{Pr (x_{i} | S_{i} = 1, i \in B \cup S)} \\ \propto \frac{Pr (δ_{i} = 1 | x_{i}, i \in B \cup S)}{Pr (S_{i} = 1 | x_{i}, i \in B \cup S)} \\ [p r o p . t o \frac{Pr (δ_{i} = 1 | i \in B \cup S)}{Pr (S_{i} = 1 | i \in B \cup S)}] \end{aligned}$ provided the S-sample inclusion probability is also fully determined by $x_{i}$ in the sense of (Equation8(8) $p_{i} = p (x_{i}; η) > 0$ (8) ). Thus the validity condition for the IPW estimator (Equation10(10) $\hat{Y} = \sum_{i \in B} y_{i} / {\hat{p}}_{i},$ (10) ) based on (Equation11(11) $p_{i} \propto Pr (S_{i} = 1 | x_{i}, i \in U) \frac{Pr (δ_{i} = 1 | x_{i}, i \in B \cup S)}{Pr (S_{i} = 1 | x_{i}, i \in B \cup S)},$ (11) ) is that the QR $_{x}$ assumption (Equation8(8) $p_{i} = p (x_{i}; η) > 0$ (8) ) holds for both the samples, given the same $x_{i}$ .

We make two observations. First, despite the superficial resemblance to the propensity scoring method of Rosenbaum and Rubin (Citation1983), the above argument for $p_{i}$ is not the same. As Rosenbaum and Rubin (Citation1983) state clearly before their first enumerated equation, ‘in this paper, the N units in the study are viewed as a simple random sample from some population’, where N is the size of the combined sample of treatment and non-treatment. The analogy to this combined sample is $B \cup S$ here. However, it is generally untenable that $B \cup S$ can be treated as a simple random sample from the population. Second, for any given probability sample S, it is possible to identify the variables that determine the designed inclusion probability, denoted by $π_{i} = π (z_{i})$ , for $i \in U$ . There arises thus a question, ‘what if $π (z_{i})$ differs considerably from $p (x_{i}, \hat{η})$ ?’ Moreover, one may have more than one probability sample in which $x_{i}$ is observed. There arises then a question, ‘which reference sample should one use?’

2.5. Sample matching estimator

Rivers (Citation2007) applies the SP approach in situations where a second probability sample S is available. Yang and Kim (Citation2018) study mass imputation methods, which include the matching estimator of Rivers (Citation2007) as a special case. The sample matching (SM) estimator is given by (12) $\hat{Y} = \sum_{i \in S} d_{i} {\hat{y}}_{i},$ (12) where ${\hat{y}}_{i} = y_{k_{i}}$ , for $k_{i} = \arg min_{j \in B} ∥ x_{i} - x_{j} ∥$ based on a chosen metric $∥ \cdot ∥$ . That is, $y_{k_{i}}$ is the nearest-neighbour (NN) imputation value from the B-sample for $i \in S$ .

To focus on the idea, assume for the moment exact matching is the case, where $x_{k_{i}} = x_{i} = x$ for all $i \in S$ and $k_{i} \in B$ . We have then $E ({\hat{y}}_{i} | x_{i} = x) = E (y_{k_{i}} | x_{k_{i}} = x, k_{i} \in B)$ , which is the same as $E (y_{i} | x_{i} = x, i \in B)$ as if the unit i were in B. Given the non-informativeness assumption (Equation7(7) $E (y_{i} | x_{i}, i \in U) = E (y_{i} | x_{i}, i \in B)$ (7) ) for the B-sample, which Yang and Kim (Citation2018) call the ‘ignorability’ assumption, we have $\begin{aligned} E [\sum_{i \in S} d_{i} E ({\hat{y}}_{i} | x_{i})] \\ = E [\sum_{i \in S} d_{i} E (y_{i} | x_{i}, i \in B)] \\ = E [\sum_{i \in S} d_{i} E (y_{i} | x_{i}, i \in U)] \\ = \sum_{i \in U} E (y_{i} | x_{i}, i \in U) = E (Y | x_{U}) . \end{aligned}$ With respect to both the population model and the design of S, the SM estimator (Equation12(12) $\hat{Y} = \sum_{i \in S} d_{i} {\hat{y}}_{i},$ (12) ) is prediction unbiased for Y. Notice that in the case of S = U, the SM estimator is just an NN-imputation method. Whether S = U or not, the NN-imputed SM estimator is likely to be less efficient than a prediction-imputed SM estimator $\hat{Y} = \sum_{i \in S} d_{i} E (y_{i} | x_{i}; \hat{β} (B))$ whenever a correct parametric specification of the conditional mean (via β) is possible. The simulation results of Yang and Kim (Citation2018) show that NN-imputation is less efficient than imputation based on semi-parametric generalised additive models.

Now, it is not difficult to see that the consistency of the SM estimator (Equation12(12) $\hat{Y} = \sum_{i \in S} d_{i} {\hat{y}}_{i},$ (12) ) can be established, given asymptotic exact matching instead, i.e., (13) $∥ x_{i} - x_{k_{i}} ∥ \to 0 i n p r o b a b i l i t y,$ (13) for any $i \in S$ , as $N \to \infty$ and $n_{B} / N = O_{p} (1)$ . Yang and Kim (Citation2018) make the assumption of ‘common support’ to the same effect. To ensure that $E (y_{i} | x_{i}, i \in U)$ does not change abruptly as the value $x_{i}$ varies, Yang and Kim (Citation2018) assume that $E (y_{i} | x_{i}, i \in U)$ is continuous differentiable. Or, one may adopt the SP $_{x}$ assumption below: (14) $\begin{aligned} ∥ E (y_{i} | x_{i}, i \in U) - E (y_{j} | x_{j}, j \in U) ∥ \\ = O (∥ x_{i} - x_{j} ∥) a s N \to \infty \end{aligned}$ (14) (Chen & Shao, Citation2000, Theorem 1). It follows that the assumptions (Equation7(7) $E (y_{i} | x_{i}, i \in U) = E (y_{i} | x_{i}, i \in B)$ (7) ), (Equation13(13) $∥ x_{i} - x_{k_{i}} ∥ \to 0 i n p r o b a b i l i t y,$ (13) ) and (Equation14(14) $\begin{aligned} ∥ E (y_{i} | x_{i}, i \in U) - E (y_{j} | x_{j}, j \in U) ∥ \\ = O (∥ x_{i} - x_{j} ∥) a s N \to \infty \end{aligned}$ (14) ) are the key validity conditions for the consistency of the SM estimator (Equation12(12) $\hat{Y} = \sum_{i \in S} d_{i} {\hat{y}}_{i},$ (12) ).

We make two observations. First, an attractive feature of the NN-imputation is that the imputed sample S looks more realistic and natural than, say, by the regression prediction imputation. However, unless the S-sampling is non-informative, the NN-imputed S-sample will not resemble the true S-sample that could have been observed, since $E ({\hat{y}}_{i} | x_{i}, i \in S) = E (y_{i} | x_{i}, i \in U) \neq E (y_{i} | x_{i}, i \in S),$ where the inequality is the case unless S-sampling is non-informative in the sense of (Equation7(7) $E (y_{i} | x_{i}, i \in U) = E (y_{i} | x_{i}, i \in B)$ (7) ). Second, for any other covariate $z_{i} \neq x_{i}$ , including when $z_{i}$ contains the S-sample design variables, we have $\begin{aligned} E ({\hat{y}}_{i} | z_{i}, x_{i}, i \in S) = E (y_{i} | x_{i}, i \in U) \\ \neq E (y_{i} | z_{i}, x_{i}, i \in U) \end{aligned}$ unless $y_{i}$ and $z_{i}$ are conditionally independent of each other given $x_{i}$ . This is a general problem for statistical matching of variables associated with distinct units, i.e., $y_{i}$ associated with $x_{i}$ for some $i \in B$ and $z_{i}$ associated with the same value $x_{i}$ but for some different unit in S. The following example illustrates both remarks above.

Example

Let $y_{i}$ be independent of $x_{i} \sim Unif (0, 1)$ , for any $i \in U$ . Then, the SP $_{x}$ assumption (Equation14(14) $\begin{aligned} ∥ E (y_{i} | x_{i}, i \in U) - E (y_{j} | x_{j}, j \in U) ∥ \\ = O (∥ x_{i} - x_{j} ∥) a s N \to \infty \end{aligned}$ (14) ) holds trivially, as long as the marginal expectation $E (y_{i})$ exists. Next, suppose simple random sample B, so that the non-informative assumption (Equation7(7) $E (y_{i} | x_{i}, i \in U) = E (y_{i} | x_{i}, i \in B)$ (7) ) holds, and $E ({\hat{y}}_{i} | x_{i}, i \in S) = E (y_{i} | i \in U)$ regardless of the exact matching assumption. Suppose stratified simple random S-sampling with two strata of different sampling fractions, so that the S-sample inclusion probability is not a constant. Then, the S-sampling is informative (given $x_{i}$ ) as long as the population stratum means are different, since $E ({\bar{y}}_{S} | x_{S}, S) = E ({\bar{y}}_{S} | S) \neq E (\bar{Y}) = E (\bar{Y} | x_{U}),$ where ${\bar{y}}_{S}$ is the true S-sample mean that is unknown, since $y_{i}$ is not observed in S. It follows that the NN-imputed S-sample ${{\hat{y}}_{i}; i \in S}$ would look like a sample generated by simple random sampling, rather than the actual stratified sampling. Moreover, the SM-estimator of stratum means, corresponding to say $z_{i} = 1, 2$ , respectively, will be biased for the population stratum means.

3. More generally on validity conditions

Non-informative selection in form of (Equation7(7) $E (y_{i} | x_{i}, i \in U) = E (y_{i} | x_{i}, i \in B)$ (7) ) or (Equation9(9) $Pr (δ_{i} = 1 | x_{i}, i \in S) = Pr (δ_{i} = 1 | x_{i}, i \in U) .$ (9) ) is a critical condition for all the methods in Section 2, which make use of auxiliary variable $x_{i}$ . Two kinds of possible violation of these assumptions are worth noting.

First, Kim and Rao (Citation2018) point out an important issue that has not received sufficient attention in these methods, namely B-sample under-coverage is the case if some population units have in fact zero chance of being included in it. Under the SP approach, extrapolation of the conditional distribution of $y_{i}$ in the B-sample to these population units can only be based on subjective beliefs but not empirical evidence. The QR approach is equally affected, since randomisation inference would have been invalidated even if $p_{i}$ were known for all the B-sample units, let alone when it is unknown and needs to be estimated. To address the issue, Kim and Rao (Citation2018) consider a two-phase SM estimator. Let the S-sample be partitioned into $S_{1}$ and $S_{0}$ , such that $S_{1} = {i; p_{i} > 0}$ and $S_{0} = {i; p_{i} = 0}$ . First, estimate this unobserved partition via the B-sample support: ${\hat{S}}_{1} = \{i; min_{j \in B} ∥ x_{i} - x_{j} ∥ < ϵ\} .$ Each S-sample unit that is unsupported in the B-sample ε-neighbourhood is assigned to ${\hat{S}}_{0}$ . Let us suppose this partition estimator is consistent in the following sense: $| {\hat{S}}_{1} \cup S_{1} | / | {\hat{S}}_{1} \cap S_{1} | \to 1 i n p r o b a b i l i t y,$ as $N \to \infty$ and $ϵ \to 0$ . Next, the two-phase SM estimator is given as $\hat{Y} = \sum_{i \in {\hat{S}}_{1}} d_{i} w_{2 i} {\hat{y}}_{i},$ where $\sum_{i \in {\hat{S}}_{1}} d_{i} w_{2 i} x_{i} = \sum_{i \in S} d_{i} x_{i}$ . In other words, the under-coverage is dealt with by the calibration of the weights $w_{2 i}$ . This can be motivated, provided the conditional mean $E (y_{i} | x_{i}, p_{i} = 0)$ can be linearly related to $x_{i}$ , and the relationship is the same for the units with $p_{i} > 0$ , i.e., the under-coverage is non-informative for the SP linear model.

Second, insofar one requires either an assumption of SP $_{x}$ (Equation7(7) $E (y_{i} | x_{i}, i \in U) = E (y_{i} | x_{i}, i \in B)$ (7) ) or QR $_{x}$ (Equation9(9) $Pr (δ_{i} = 1 | x_{i}, i \in S) = Pr (δ_{i} = 1 | x_{i}, i \in U) .$ (9) ), there is always the possibility of heterogeneous mean, beyond what is controlled by the chosen $x_{i}$ . Let $U_{x} = {i; x_{i} = x, i \in U}$ be of the size $N_{x}$ . Under the SP approach, which models the mean $μ_{i}$ of unit i by $μ (x_{i})$ , heterogeneous mean is the case if $μ_{i} \neq μ (x_{i})$ , despite (15) $μ (x) = \sum_{i \in U_{x}} μ_{i} / N_{x}$ (15) and $μ (x_{i})$ is statistically correct in that the $μ_{i}$ 's average to $μ (x)$ over all the units in $U_{x}$ . Under the QR approach, heterogeneous mean is the case if $p_{i} \neq p (x_{i})$ , despite (16) $p (x) = \sum_{i \in U_{x}} p_{i} / N_{x} .$ (16) Let us illustrate the concept of heterogeneous mean with a simple example.

Example

Let $x \equiv 1$ , such that $μ (x_{i}) = μ$ , for all $i \in U$ . Let $U = U_{1} \cup U_{0}$ be a partition. Let $U_{1}$ be of size $N_{1}$ and with mean $μ_{i} = μ (1)$ , for all $i \in U_{1}$ ; let $U_{0}$ be of size $N_{0}$ and with mean $μ_{i} = μ (0)$ , for all $i \in U_{0}$ . Suppose $μ (1) \neq μ (0)$ . Let $μ = μ (1) N_{1} / N + μ (0) N_{0} / N$ . Then, $μ_{i} \neq μ$ for any $i \in U$ , but we still have $\sum_{i \in U} μ_{i} / N = μ$ , satisfying (Equation15(15) $μ (x) = \sum_{i \in U_{x}} μ_{i} / N_{x}$ (15) ).

Heterogeneous mean affects the SP and QR approaches differently. Given (Equation15(15) $μ (x) = \sum_{i \in U_{x}} μ_{i} / N_{x}$ (15) ), assuming $μ_{i} = μ (x)$ for $i \in U_{x}$ is prediction unbiased, despite heterogeneous mean, since $\sum_{i \in U_{x}} [E (y_{i} | δ_{i}) - μ (x)] = \sum_{i \in U_{x}} [μ_{i} - μ (x)] = 0.$ Meanwhile, given (Equation16(16) $p (x) = \sum_{i \in U_{x}} p_{i} / N_{x} .$ (16) ), assuming $p_{i} = p (x)$ for $i \in U_{x}$ yields $\begin{aligned} E (\sum_{i \in U_{x}} \frac{δ_{i} y_{i}}{p (x)}) - \sum_{i \in U_{x}} y_{i} \\ = p (x)^{- 1} \sum_{i \in U_{x}} (p_{i} - p (x)) y_{i} \neq 0 \end{aligned}$ in which case the IPW estimator under the QP approach may be biased, despite the model of $p_{i}$ is statistically correct in the sense of (Equation16(16) $p (x) = \sum_{i \in U_{x}} p_{i} / N_{x} .$ (16) ).

The discussion above suggests that the formulation of validity conditions in Section 2 is inadequate in the presence of under-coverage and mean heterogeneity. Below we first reformulate the validity conditions, which cover both the SP and QR approaches, despite the presence of under-coverage and mean heterogeneity. We elaborate and illustrate these conditions for the post-stratification and calibration estimators. Finally, we discuss the difficulties of verifying these validity conditions empirically.

3.1. Non-parametric asymptotic (NPA) non-informativeness

We start by noticing that the B-sample mean equals to the population mean, denoted by ${\bar{y}}_{B} = \bar{Y}$ , provided $\begin{aligned} {C o v}_{N} (δ_{i}, y_{i}) \\ = \frac{1}{N} \sum_{i \in U} δ_{i} y_{i} - (\frac{1}{N} \sum_{i \in U} δ_{i}) (\frac{1}{N} \sum_{i \in U} y_{i}) = 0, \\ E_{N} (δ_{i}) = \sum_{i \in U} δ_{i} / N > 0, \end{aligned}$ where $E_{N}$ and ${C o v}_{N}$ denote, respectively, expectation and covariance with respect to the empirical distribution function that places point mass 1/N on each population unit. This provides an empirical formulation of the non-informativeness of the B-sample observation mechanism with respect to the outcome of interest. Similar expressions have appeared in various discussions of the potential sample mean bias due to the observation mechanism, such as unequal probability sampling (Rao, Citation1966), survey nonresponse (Bethlehem, Citation1988) or big data (Meng, Citation2018). It motivates the following non-parametric asymptotic (NPA) non-informativeness assumption in the absence of any covariates:

(17) $\begin{aligned} lim_{N \to \infty} {C o v}_{N} (δ_{i}, y_{i}) = 0, & i . e ., n o n - i n f o r m a t i v e B - s e l e c t i o n \\ lim_{N \to \infty} E_{N} (δ_{i}) = p > 0, & i . e ., n o n - n e g l i g i b l e B - s e l e c t i o n . \end{aligned}$ (17) The NPA assumption (Equation17(17) $\begin{aligned} lim_{N \to \infty} {C o v}_{N} (δ_{i}, y_{i}) = 0, & i . e ., n o n - i n f o r m a t i v e B - s e l e c t i o n \\ lim_{N \to \infty} E_{N} (δ_{i}) = p > 0, & i . e ., n o n - n e g l i g i b l e B - s e l e c t i o n . \end{aligned}$ (17) ) encompasses both the SP and QR approaches. For the SP approach, taking the conditional expectation of $y_{i}$ 's conditional on the $δ_{i}$ 's yields $\begin{aligned} E ({C o v}_{N} (δ_{i}, y_{i}) | δ_{U}) \\ = \frac{1}{N} \sum_{i \in U} δ_{i} μ_{i} - (\frac{1}{N} \sum_{i \in U} δ_{i}) (\frac{1}{N} \sum_{i \in U} μ_{i}) \to 0 \end{aligned}$ given NPA non-informative B-selection, where $\sum_{i \in U} δ_{i} / N > 0$ given non-negligible B-selection in addition. Under this condition, the B-sample expansion estimator (Equation1(1) $\hat{Y} = N {\bar{y}}_{B} .$ (1) ) is asymptotically prediction unbiased from the SP perspective. For the QR approach, taking the expectation of $δ_{i}$ s with the $y_{i}$ s being constants yields $\begin{aligned} E ({C o v}_{N} (δ_{i}, y_{i}); y_{U}) \\ = \frac{1}{N} \sum_{i \in U} p_{i} y_{i} - (\frac{1}{N} \sum_{i \in U} p_{i}) (\frac{1}{N} \sum_{i \in U} y_{i}) \to 0, \\ E (E_{N} (δ_{i})) = \sum_{i \in U} p_{i} / N \to p > 0. \end{aligned}$ In particular, the NPA assumption (Equation17(17) $\begin{aligned} lim_{N \to \infty} {C o v}_{N} (δ_{i}, y_{i}) = 0, & i . e ., n o n - i n f o r m a t i v e B - s e l e c t i o n \\ lim_{N \to \infty} E_{N} (δ_{i}) = p > 0, & i . e ., n o n - n e g l i g i b l e B - s e l e c t i o n . \end{aligned}$ (17) ) allows for $0 \leq p_{i} \leq 1$ , so that the B-sample expansion estimator (Equation1(1) $\hat{Y} = N {\bar{y}}_{B} .$ (1) ) remains consistent from the QR perspective, even in the presence of under-coverage of the units with $p_{i} = 0$ or non-representative units with $p_{i} = 1$ .

Example

Let $U = U_{1} \cup U_{0}$ be a partition. Let $U_{1}$ be of size $N_{1}$ , where $p_{i} \equiv 1$ for $i \in U_{1}$ ; let $U_{0}$ be of size $N_{0}$ , where $p_{i} \equiv 0$ for $i \in U_{0}$ . Despite under-coverage of $B \equiv U_{1}$ , the first NPA condition implies ${\bar{y}}_{B} - \bar{Y} \to 0$ , given the second condition $N_{1} / N \to p > 0$ .

3.2. Post-stratification estimator

Consider post-stratification by $x_{i}$ , for $i \in U$ . Provided the assumption (Equation17(17) $\begin{aligned} lim_{N \to \infty} {C o v}_{N} (δ_{i}, y_{i}) = 0, & i . e ., n o n - i n f o r m a t i v e B - s e l e c t i o n \\ lim_{N \to \infty} E_{N} (δ_{i}) = p > 0, & i . e ., n o n - n e g l i g i b l e B - s e l e c t i o n . \end{aligned}$ (17) ) holds within each post-stratum, the B-sample post-stratification estimator is asymptotically unbiased from both the SP and QR perspectives. Below we consider the QR approach. The SP approach is a special case of the calibration estimator discussed in Section 3.3.

Consider first the hypothetical estimator with known $p_{x} = \sum_{i \in U_{x}} p_{i} / N_{x}$ : $\tilde{Y} = \sum_{x} \sum_{i \in U_{x}} δ_{i} y_{i} / p_{x} .$ To fix the idea for variance estimation, suppose independent Bernoulli distribution of $δ_{i}$ with probability $p_{i}$ , where $0 \leq p_{i} \leq 1$ . The variance of $\tilde{Y}$ is then given by $V (\tilde{Y}) = \sum_{x} \sum_{i \in U_{x}} p_{i} y_{i}^{2} / p_{x}^{2} - \sum_{x} \sum_{i \in U_{x}} p_{i}^{2} y_{i}^{2} / p_{x}^{2} .$ An unbiased estimator of the first term of the variance, denoted by $τ_{1}$ is given by ${\hat{τ}}_{1} = \sum_{x} \sum_{i \in U_{x}} δ_{i} y_{i}^{2} / p_{x}^{2} = \sum_{x} p_{x}^{- 2} \sum_{i \in B_{x}} y_{i}^{2},$ where $B_{x} = B \cap U_{x}$ . An unbiased estimator of the second term, denoted by $τ_{2}$ is given by $\begin{aligned} {\hat{τ}}_{2} & = \sum_{x} p_{x}^{- 2} \sum_{i \in U_{x}} δ_{i} p_{i} y_{i}^{2} \\ = \sum_{x} p_{x}^{- 1} \sum_{i \in U_{x}} δ_{i} y_{i}^{2} = \sum_{x} p_{x}^{- 1} \sum_{i \in B_{x}} y_{i}^{2}, \end{aligned}$ where the second equality follows given the additional QR $_{x}$ assumption, i.e., $p_{i} = p_{x}$ for $i \in U_{x}$ . Putting ${\hat{τ}}_{1}$ and ${\hat{τ}}_{2}$ together, we obtain $\hat{V} (\tilde{Y}) = \sum_{x} (p_{x}^{- 1} - 1) p_{x}^{- 1} \sum_{i \in B_{x}} y_{i}^{2} .$ Now, the post-stratification estimator, denoted by $\hat{Y}$ , is obtained from $\tilde{Y}$ on replacing $p_{x}$ by ${\hat{p}}_{x} = n_{x B} / N_{x}$ , where $n_{x B}$ is the observed size of $B_{x}$ and $N_{x}$ is the known post-stratum population size. Expanding ${\hat{p}}_{x}$ around $p_{x}$ (i.e., linearisation) would yield an asymptotically valid estimator of the unconditional variance of $\hat{Y}$ .

3.3. Calibration estimator

The post-stratification estimator is infeasible, in cases when the B-sample contains empty cells, or when the population size $N_{x}$ is not all known. Let $\begin{aligned} t_{i} & = (t_{1 i}, t_{2 i}, \dots t_{K i})^{⊤} \\ = (t_{1} (x_{i}), t_{2} (x_{i}), \dots t_{K} (x_{i}))^{⊤} = t (x_{i}) \end{aligned}$ be a vector of many-to-one mappings of $x_{i}$ , such that the population total $T = \sum_{i \in U} t_{i}$ is known, and the sample total $t = \sum_{i \in B} t_{i}$ has only non-zero components.

As discussed for the calibration estimator in Section 2, generally one is not able to set the initial weight to be the inverse of B-sample inclusion probability in practice. Suppose one simply starts with initial equal weights $a_{i} = N / n_{B}$ for all $i \in B$ . The linear calibration estimator (Deville & Särndal, Citation1992) is given by $\hat{Y} = \sum_{i \in B} w_{i} y_{i},$ where the weights ${w_{i}; i \in B}$ minimise the distance to ${a_{i}; i \in B}$ as measured by $\begin{aligned} \sum_{i \in B} (w_{i} - N / n_{B})^{2} \\ = \sum_{t} (\sum_{i \in B_{t}} w_{i}^{2} - 2 (N / n_{B}) \sum_{i \in B_{t}} w_{i} + n_{t B} (N / n_{t B})^{2}) \end{aligned}$ subjected to the constraints $\sum_{i \in B} w_{i} t_{i} = T$ , where $B_{t} = {i; t_{i} = t, i \in B}$ and $n_{t B} > 0$ . It follows that $w_{i} = w_{t}$ , for $i \in B_{t}$ , since the only thing that matters to the calibration constraints is $\sum_{i \in B_{t}} w_{i}$ now that $t_{i} = t$ for $i \in B_{t}$ and, given whatever $\sum_{i \in B_{t}} w_{i}$ , the term $\sum_{i \in B_{t}} w_{i}^{2}$ is minimised at $w_{i} = w_{t}$ for $i \in B_{t}$ .

As the first validity condition for $\hat{Y}$ , suppose there exists a vector $β_{K \times 1}$ , such that (18) $\sum_{i \in U_{t}} ϵ_{i} / N_{t} \to 0$ (18) for each t-value, as $N \to \infty$ , where $ϵ_{i} = y_{i} - t_{i}^{⊤} β$ , and $N_{t}$ is the population size of $U_{t} = {i; t_{i} = t, i \in U}$ . The condition (Equation18(18) $\sum_{i \in U_{t}} ϵ_{i} / N_{t} \to 0$ (18) ) is analogous to the SP $_{x}$ assumption (Equation6(6) $E (y_{i} | x_{i}, i \in U) = μ (x_{i}) = x_{i}^{⊤} β,$ (6) ), where the covariate $x_{i}$ is replaced by $t_{i}$ here. Moreover, it relaxes the model (Equation6(6) $E (y_{i} | x_{i}, i \in U) = μ (x_{i}) = x_{i}^{⊤} β,$ (6) ) of the conditional mean, allowing for potential heterogeneous mean similar to (Equation15(15) $μ (x) = \sum_{i \in U_{x}} μ_{i} / N_{x}$ (15) ). Now that $\sum_{i \in B} w_{i} t_{i} = T$ , we have $\begin{aligned} \hat{Y} - Y & = \sum_{i \in B} w_{i} (t_{i}^{⊤} β + ϵ_{i}) - \sum_{i \in U} t_{i}^{⊤} (β + ϵ_{i}) \\ = \sum_{i \in B} w_{i} ϵ_{i} - \sum_{i \in U} ϵ_{i} . \end{aligned}$ Given (Equation18(18) $\sum_{i \in U_{t}} ϵ_{i} / N_{t} \to 0$ (18) ), $\sum_{i \in U} ϵ_{i} / N \to 0$ as $N \to \infty$ . Moreover, we have $\begin{aligned} \frac{1}{N} \sum_{i \in B} w_{i} ϵ_{i} = \sum_{t} \frac{w_{t}}{N} \sum_{i \in U_{t}} δ_{i} ϵ_{i} \\ = \sum_{t} w_{t} \frac{N_{t}}{N} ({C o v}_{N_{t}} (δ_{i}, ϵ_{i}) + (\frac{1}{N_{t}} \sum_{i \in U_{t}} δ_{i}) \\ (\frac{1}{N_{t}} \sum_{i \in U_{t}} ϵ_{i})) \to 0 \end{aligned}$ as $N \to \infty$ , given (19) $\begin{aligned} {C o v}_{N_{t}} (δ_{i}, ϵ_{i}) \to 0 \\ E_{N_{t}} (δ_{i}) = \sum_{i \in U_{t}} δ_{i} / N_{t} \to p_{t} > 0 \end{aligned}$ (19) for any given t, which is an adaption of the NPA non-informativeness assumption (Equation17(17) $\begin{aligned} lim_{N \to \infty} {C o v}_{N} (δ_{i}, y_{i}) = 0, & i . e ., n o n - i n f o r m a t i v e B - s e l e c t i o n \\ lim_{N \to \infty} E_{N} (δ_{i}) = p > 0, & i . e ., n o n - n e g l i g i b l e B - s e l e c t i o n . \end{aligned}$ (17) ) to the present setting. It follows that the two assumptions (Equation18(18) $\sum_{i \in U_{t}} ϵ_{i} / N_{t} \to 0$ (18) ) and (Equation19(19) $\begin{aligned} {C o v}_{N_{t}} (δ_{i}, ϵ_{i}) \to 0 \\ E_{N_{t}} (δ_{i}) = \sum_{i \in U_{t}} δ_{i} / N_{t} \to p_{t} > 0 \end{aligned}$ (19) ) are the key validity conditions for the calibration estimator to be consistent for Y.

For variance estimation, suppose again independent Bernoulli distribution of $δ_{i}$ with probability $p_{i}$ , where $0 \leq p_{i} \leq 1$ . An approximate variance estimator for the calibration estimator $\hat{Y}$ can then be given as $\hat{V} (\hat{Y}) = \sum_{t} ({\hat{p}}_{t}^{- 1} - 1) {\hat{p}}_{t}^{- 1} \sum_{i \in B_{t}} (y_{i} - t_{i}^{⊤} \hat{β})^{2},$ where ${\hat{p}}_{t} = n_{t B} / N_{t}$ , and $\hat{β} = (\sum_{i \in B} w_{i} t_{i} t_{i}^{⊤})^{- 1} (\sum_{i \in B} w_{i} t_{i} y_{i})$ .

3.4. Validation of non-informative B-sample selection

Of the validity conditions discussed above, the critical assumption is non-informative B-sample selection, which can be stated in various forms. For instance, given the non-informativeness assumption (Equation17(17) $\begin{aligned} lim_{N \to \infty} {C o v}_{N} (δ_{i}, y_{i}) = 0, & i . e ., n o n - i n f o r m a t i v e B - s e l e c t i o n \\ lim_{N \to \infty} E_{N} (δ_{i}) = p > 0, & i . e ., n o n - n e g l i g i b l e B - s e l e c t i o n . \end{aligned}$ (17) ), an additional assumption like (Equation18(18) $\sum_{i \in U_{t}} ϵ_{i} / N_{t} \to 0$ (18) ) can in principle to validated empirically. However, the non-informativeness condition may not hold exactly, and it is generally impossible to verify only based on the data used for the estimation. Below we discuss the issue in more detail.

Consider first the propensity model $p_{i} = p (x_{i}; η)$ under the QR approach. Suppose known $x_{U}$ to avoid additional complications otherwise, the census score equation is $\sum_{x} \frac{\partial p (x; η)}{\partial η} [\frac{n_{x B}}{p (x; η)} - \frac{N_{x} - n_{x B}}{1 - p (x; η)}] = 0,$ which is always satisfied by $p (x; \hat{η}) = n_{x B} / N_{x}$ , i.e., the saturated model. Insofar as a non-saturated model of $p (x_{i}; η)$ does not fit perfectly to the data, one can always attribute its cause to the non-saturated functional form of $p (x_{i}; η)$ , instead of rejecting the assumption that the set of covariates $x_{i}$ ‘fully govern the sampling mechanism’. In this sense, the validity of the latter assumption cannot be refuted empirically.

Next, for the SP approach, where both $δ_{i}$ and $y_{i}$ are treated as random, assume the B-sample inclusion probability $p_{i}$ depends on $x_{i}$ , where $x_{i}$ is known for $i \in U$ to avoid extra complications. For goodness-of-fit checks, let $z_{i}$ be a known covariate, which is distinct from $x_{i}$ . We have $\begin{aligned} E (z_{B}) & = \sum_{i \in U} p_{i} z_{i} = \sum_{x} p (x; η) \sum_{i \in U_{x}} z_{i} \\ = \sum_{x} p (x; η) N_{z} {\bar{Z}}_{x}, \\ Z & = E (\sum_{i \in U} δ_{i} z_{i} / p_{i}) = E [\sum_{x} n_{x B} {\bar{z}}_{x B} / p (x; η)], \end{aligned}$ where ${\bar{Z}}_{x} = \sum_{i \in U_{x}} z_{i} / N_{x}$ and ${\bar{z}}_{x B} = \sum_{i \in B_{x}} z_{i} / n_{x B}$ . The two observed checks are $\begin{aligned} \{\begin{cases} z_{B} \equiv \sum_{x} n_{x B} {\bar{z}}_{x B} = \sum_{x} {\hat{p}}_{x} N_{x} {\bar{Z}}_{x} \\ Z = \sum_{x} n_{x B} {\bar{z}}_{x B} / {\hat{p}}_{x} \end{cases} \\ \overset{i f z_{i} \equiv 1}{\Rightarrow} \{\begin{cases} \sum_{i \in U} {\hat{p}}_{i} = n_{B} \\ \sum_{i \in B} 1 / {\hat{p}}_{i} = N . \end{cases} \end{aligned}$ Setting ${\hat{p}}_{x} = n_{x B} / N_{x}$ , which fits the assumption $p_{i} = p (x_{i}; λ)$ , both the two checks are satisfied given ${\bar{Z}}_{x} = {\bar{z}}_{x B}$ , i.e., the B-sample expansion estimate of $Z_{x}$ is perfect for all x. This would suggest that the NPA assumption (Equation17(17) $\begin{aligned} lim_{N \to \infty} {C o v}_{N} (δ_{i}, y_{i}) = 0, & i . e ., n o n - i n f o r m a t i v e B - s e l e c t i o n \\ lim_{N \to \infty} E_{N} (δ_{i}) = p > 0, & i . e ., n o n - n e g l i g i b l e B - s e l e c t i o n . \end{aligned}$ (17) ) holds for $z_{i}$ given $x_{i}$ and may be considered to support the plausibility of the NPA assumption (Equation17(17) $\begin{aligned} lim_{N \to \infty} {C o v}_{N} (δ_{i}, y_{i}) = 0, & i . e ., n o n - i n f o r m a t i v e B - s e l e c t i o n \\ lim_{N \to \infty} E_{N} (δ_{i}) = p > 0, & i . e ., n o n - n e g l i g i b l e B - s e l e c t i o n . \end{aligned}$ (17) ) for $y_{i}$ given $x_{i}$ , provided $z_{i}$ is known to be correlated with $y_{i}$ , but not otherwise. However, in situations where such a covariate $z_{i}$ is available, it seems natural that it should be used in the estimation of Y to start with. The two checks amount then to the case of $z_{i} \equiv 1$ and are satisfied trivially by setting ${\hat{p}}_{x} = n_{x B} / N_{x}$ . Thus one is faced with a dilemma, where building the best model for estimation would at the same time reduces the ability to verify it.

4. Using additional probability sample of outcomes

So far we have only considered the situations, where the outcome values of interest are only observed in the non-probability sample B. Obviously, the situation changes completely, given in addition a probability sample of outcomes. Below we discuss shortly two different approaches to inference in the absence of any relevant covariates. The ideas remain the same in situations with additional covariates.

The first approach aims at consistent estimation combing the two samples, as for example, discussed in Tam and Kim (Citation2018a, Citation2018b), where the probability sample is taken from the whole population and overlaps with the B-sample. These authors also discussed additional issues such as measurement errors or nonresponse. Here we discuss the situation where the probability sample is taken from the B-sample complement population. Given the non-probability sample observations $y_{B}$ , one may treat $(B, y_{B})$ as fixed, and select a second supplementary sample from the rest of the population, denoted by $S \subset U ∖ B$ . Given the S-sample observations of the outcome, denoted by $y_{S}$ , it is straightforward to obtain a test for $H_{0} : \bar{Y} = {\bar{y}}_{B}$ vs. $H_{1} : \bar{Y} \neq {\bar{y}}_{B}$ , given as $D = ({\bar{y}}_{B} - \hat{{\bar{Y}}_{B}^{c}})^{2} / \hat{V} (\hat{{\bar{Y}}_{B}^{c}}) \sim χ_{1}^{2},$ where $\hat{{\bar{Y}}_{B}^{c}}$ is an S-sample estimator of the population mean outside of the B-sample, i.e., ${\bar{Y}}_{B}^{c} = \sum_{i \in U ∖ B} y_{i} / (N - n_{B})$ and $\hat{V} (\hat{{\bar{Y}}_{B}^{c}})$ is the associated variance estimator. If $H_{0}$ is not rejected, then there is the possibility of using ${\bar{y}}_{B}$ as an estimate on its own, without regular concurrent surveys in future. This would achieve the greatest cost savings. To this end, one may consider S as a particular form of audit sampling, whose aim is to validate the big-data estimate ${\bar{y}}_{B}$ and to provide a meaningful accuracy measure that can accommodate its potential bias. Zhang (Citation2019) develops an approach to audit sampling inference for big data statistics.

Let $W_{B} = n_{B} / N$ . A consistent estimator of $\bar{Y}$ using both samples is given by ${\hat{\bar{Y}}}_{S} = W_{B} {\bar{y}}_{B} + (1 - W_{B}) {\bar{y}}_{w} a n d {\bar{y}}_{w} = \frac{\sum_{i \in S} y_{i} / π_{i}}{\sum_{i \in S} 1 / π_{i}},$ where $π_{i}$ is the S-sample inclusion probability, and the validity of ${\hat{\bar{Y}}}_{S}$ now derives from probability sampling of S, regardless of how the B-sample is generated. The relative efficiency (RE) against the setting without the B-sample can be given by $RE = [(1 - W_{B})^{2} V ({\bar{y}}_{w})] / V (\hat{{\bar{Y}}^{'}}),$ where $\hat{{\bar{Y}}^{'}}$ is a hypothetical probability sample from the whole population U, which has the same sample size and the same sampling design as S. One may refer to this as the split-population approach to inference, which is an age-old idea for combining survey sampling with administrative data. The efficiency gain would be substantial provided the B-sample is large. In fact, the larger the B-sample, the greater is the efficiency gain.

Under the second approach to inference, consider a composite estimator given by ${\hat{\bar{Y}}}_{C} = γ {\bar{y}}_{B} + (1 - γ) {\bar{y}}_{w},$ where γ is the composition weight, for $W_{B} \leq γ \leq 1$ . Notice that when $γ = W_{B}$ , the composite estimator is just the split-population estimator ${\hat{\bar{Y}}}_{S}$ above, which is consistent for $\bar{Y}$ . As γ increases from $W_{B}$ towards one, one risks introducing greater bias, insofar as ${\bar{y}}_{B} \neq \bar{Y}$ . However, the composite estimator may yield a smaller mean squared error (MSE) of estimation, provided this is desirable. One is then essentially trading the increasing bias $(γ - W_{B}) ({\bar{y}}_{B} - {\bar{Y}}_{B}^{c})$ against the decreasing stand error $(1 - γ) SE ({\bar{y}}_{w})$ , as γ increases. The composite estimator that achieves the minimum MSE is given by $γ = \frac{V ({\bar{y}}_{w}) + W_{B} ({\bar{y}}_{B} - {\bar{Y}}_{B}^{c})^{2}}{V ({\bar{y}}_{w}) + ({\bar{y}}_{B} - {\bar{Y}}_{B}^{c})^{2}} .$ Estimating ${\bar{Y}}_{B}^{c}$ by ${\bar{y}}_{w}$ in application, one can use $\hat{γ} = min (W_{B} + (1 - W_{B}) \hat{V} ({\bar{y}}_{w}) / ({\bar{y}}_{B} - {\bar{y}}_{w})^{2}, 1) .$ The validity of the composite approach derives from probability sampling of S, regardless of how the B-sample is generated. Again, the bigger the B-sample, the better it is.

5. Summary

All the estimators from non-probability sample observations reviewed in Section 2 are model based, whether the modelling is carried out under the SP or QR approach. Two features regarding the model covariate $x_{i}$ , for $i \in U$ , are worth recapitulating:

compared to the situation with known $x_{U}$ , making use of an additional probability sample $x_{S}$ entails a loss of efficiency, as can be expected;
the availability of an additional probability sample without the outcome variable is not a principal advantage, since it does not simplify the validity conditions compared to the situation where $x_{U}$ is known, but it does resolve the practical difficulty when $x_{U}$ is unavailable yet some functions of $x_{U}$ are needed for descriptive inference.

The situation changes completely, given in addition a probability sample of outcomes. The probability sample then enables valid descriptive inference in combination with the non-probability probability sample. Depending on the circumstances, the probability sample can either be selected from the whole population, or just the rest population outside the non-probability sample.

Finally, in situations where the non-probability sample is large, the cost savings will be the greatest if it can replace regular survey sampling altogether. Use of an additional probability audit sample is needed to validate the non-probability sample estimate, in spite of possible failure of its underlying model assumptions, and to provide a meaningful accuracy measure that can accommodate its potential bias.

Disclosure statement

No potential conflict of interest was reported by the author.

ORCID

Li-Chun Zhang http://orcid.org/0000-0002-3944-9484

Additional information

Notes on contributors

Li-Chun Zhang

Li-Chun Zhang is Professor of Social Statistics at University of Southampton, Senior Researcher at Statistics Norway, and Professor of Official Statistics at University of Oslo. His research interests include finite population sampling design and coordination, graph sampling, sample survey estimation, non-response, measurement errors, small area estimation, index number calculations, editing and imputation, register-based statistics, analysis of integrated data, statistical matching, record linkage, population size estimation.

References

Baker, R., Brick, J. M., Bates, N. A., Battaglia, M., Couper, M. P., Dever, J. A., …Tourangeau, R. (2013). Report of the AAPOR task force on non-probability sampling (Technical Report). Deerfield, IL: American Association for Public Opinion Research.
Google Scholar
Bethlehem, J. (1988). Reduction of nonresponse bias through regression estimation. Journal of Official Statistics, 4, 251–260.
Google Scholar
Chen, J. H., & Shao, J. (2000). Nearest neighbor imputation for survey data. Journal of Official Statistics, 16, 113–131.
Google Scholar
Deville, J.-C., & Särndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87, 376–382. doi: 10.1080/01621459.1992.10475217
Web of Science ®Google Scholar
Elliott, M. R., & Valliant, R. (2017). Inference for nonprobability samples. Statistical Science, 32, 249–264. doi: 10.1214/16-STS598
Web of Science ®Google Scholar
Kang, J. D. Y., & Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussion). Statistical Science, 22, 523–539. doi: 10.1214/07-STS227
Web of Science ®Google Scholar
Kim, J. K., & Haziza, D. (2014). Doubly robust inference with missing data in survey sampling. Statistica Sinica, 24, 375–394.
Web of Science ®Google Scholar
Kim, J.-K., & Rao, J. N. K. (2018). Data integration for big data analysis in finite population inference. Talk presented at SSC2018. Montreal.
Google Scholar
Kim, J.-K., & Wang, Z. (2018). Sampling techniques for big data analysis in finite population inference. Retrieved from arXiv:1801.09728v1
Google Scholar
Little, R. J. A. (1982). Models for nonresponse in sample surveys. Journal of the American Statistical Association, 77, 237–250. doi: 10.1080/01621459.1982.10477792
Web of Science ®Google Scholar
Meng, X. L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and 2016 US presidential election. The Annals of Applied Statistics, 12, 685–726. doi: 10.1214/18-AOAS1161SF
Web of Science ®Google Scholar
Oh, H. L., & Scheuren, F. J. (1983). Weighting adjustments for unit non-response. In W. G. Madow, I. Olkin & D. B. Rubin (Eds.), Incomplete data in sample surveys (Vol. 2): Theory and bibliographies (pp. 143–184). New York: Academic Press.
Google Scholar
Pfeffermann, D. (2017). Bayes-based non-Bayesian inference on finite populations from non-representative samples. Calcutta Statistical Association Bulletin, 69, 1–29. doi:10.1177/0008068317696546
Google Scholar
Pfeffermann, D., Krieger, A. M., & Rinott, Y. (1998). Parametric distributions of complex survey data under informative probability sampling. Statistica Sinica, 8, 1087–1114.
Web of Science ®Google Scholar
Rao, J. N. K. (1966). Alternative estimators in PPS sampling for multiple characteristics. Sankhya, 28, 47–60.
Google Scholar
Rivers, D. (2007). Sampling for web surveys. Proceedings of the survey research methods section. American Statistical Association.
Google Scholar
Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Estimation of regression coefficient when some regressors are not always observed. Journal of the American Statistical Association, 89, 846–866. doi: 10.1080/01621459.1994.10476818
Web of Science ®Google Scholar
Rosenbaum, P., & Rubin, D. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. doi: 10.1093/biomet/70.1.41
Web of Science ®Google Scholar
Royall, R. M. (1970). On finite population sampling theory under certain linear regression models. Biometrika, 57, 377–387. doi: 10.1093/biomet/57.2.377
Web of Science ®Google Scholar
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592. doi: 10.1093/biomet/63.3.581
Web of Science ®Google Scholar
Smith, T. M. F. (1983). On the validity of inferences from non-random sample. Journal of the Royal Statistical Society, Series A, 146, 394–403. doi: 10.2307/2981454
Web of Science ®Google Scholar
Tam, S.-M., & Kim, J.-K. (2018a). Big data ethics and selection-bias: An official statistician's perspective. Statistical Journal of the IAOS, 34, 577–588. doi:10.3233/SJI-170395
Google Scholar
Tam, S.-M., & Kim, J.-K. (2018b). Mining big data for finite population inference. Talk presented at BigSurv18. Barcelona.
Google Scholar
Yang, S., & Kim, J.-K. (2018). Integration of survey data and big observational data for finite population inference using mass imputation. Retrieved from https://arxiv.org/abs/1807.02817v1
Google Scholar
Zhang, L.-C. (2019). Proxy expenditure weights for consumer price index: Audit sampling inference for big data statistics. Retrieved from https://arxiv.org/abs/1906.11208
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

On valid descriptive inference from non-probability sample

ABSTRACT

1. Introduction

2. Review of existing approaches

2.1. B-sample expansion estimator

2.2. B-sample calibration estimator

2.3. B-sample inverse propensity weighting

2.4. Another B-sample IPW estimator

2.5. Sample matching estimator

3. More generally on validity conditions

3.1. Non-parametric asymptotic (NPA) non-informativeness

3.2. Post-stratification estimator

3.3. Calibration estimator

3.4. Validation of non-informative B-sample selection

4. Using additional probability sample of outcomes

5. Summary

Disclosure statement

Notes on contributors

Li-Chun Zhang

References

Information for

Open access

Opportunities

Help and information

On valid descriptive inference from non-probability sample

ABSTRACT

1. Introduction

2. Review of existing approaches

2.1. B-sample expansion estimator

2.2. B-sample calibration estimator

2.3. B-sample inverse propensity weighting

2.4. Another B-sample IPW estimator

2.5. Sample matching estimator

3. More generally on validity conditions

3.1. Non-parametric asymptotic (NPA) non-informativeness

3.2. Post-stratification estimator

3.3. Calibration estimator

3.4. Validation of non-informative B-sample selection

4. Using additional probability sample of outcomes

5. Summary

Disclosure statement

ORCID

Additional information

Notes on contributors

Li-Chun Zhang

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date