1,482
Views
6
CrossRef citations to date
0
Altmetric
Articles

On valid descriptive inference from non-probability sample

Pages 103-113 | Received 05 Oct 2018, Accepted 07 Sep 2019, Published online: 13 Sep 2019

ABSTRACT

We examine the conditions under which descriptive inference can be based directly on the observed distribution in a non-probability sample, under both the super-population and quasi-randomisation modelling approaches. Review of existing estimation methods reveals that the traditional formulation of these conditions may be inadequate due to potential issues of under-coverage or heterogeneous mean beyond the assumed model. We formulate unifying conditions that are applicable to both types of modelling approaches. The difficulties of empirically validating the required conditions are discussed, as well as valid inference approaches using supplementary probability sampling. The key message is that probability sampling may still be necessary in some situations, in order to ensure the validity of descriptive inference, but it can be much less resource-demanding given the presence of a big non-probability sample.

1. Introduction

There is a resurgence of interest in the use of non-probability samples. See, for example, Baker et al. (Citation2013) and Elliott and Valliant (Citation2017) for two recent reviews. Such data may arise in situations where probability sampling is either infeasible or too costly. The observations may be obtained from the so-called big-data sources, such as payment transaction data via a specific platform, cellphone call data from a major provider of the service. These big-data non-probability samples can be much larger in size, compared to the more familiar non-probability samples collected from web panel surveys, quota sampling, etc.

Following Rubin (Citation1976) and Little (Citation1982), Smith (Citation1983) considers the so-called super-population (SP) approach to inference from non-probability sample. Under this approach, a prediction model is constructed for the outcome variable of interest, often conditional on some chosen covariates. In particular, Smith (Citation1983) observes an important distinction between analytic and descriptive inference. In analytic inference, the target is the model parameters that are of a theoretical nature; such parameters can never be observed directly no matter how large the sample is. Whereas the targets of descriptive inference are statistics of a given finite population, such that in principle they can be directly observed given a perfect census of the population.

Moreover, Smith (Citation1983) focuses on validity conditions, under which the non-probability sample observation mechanism can be ignored, in the sense that inference can be based on the observed distributions directly, such as the conditional distribution of the outcome variable given the covariates in the sample. The two key validity conditions under the SP approach can be roughly stated as follows: (i) the prediction model is correctly specified for the population units, (ii) the non-probability sample selection mechanism is non-informative, in the sense that the relevant distribution under the population model can be observed in the non-probability sample directly. Similar validity conditions for the SP approach apply in other situations, such as purposive sampling (Royall, Citation1970), missing data problems (Rubin, Citation1976).

In this paper, we concentrate on descriptive inference methods that depend on validity conditions in the sense of Smith (Citation1983). Of course, inference is also possible without such validity conditions. For instance, not missing-at-random models (Rubin, Citation1976) can be used to deal with informative missing data, where the unobserved full-sample outcome distribution is not the same as that among the respondent subsample. Or, the sample likelihood of Pfeffermann, Krieger, and Rinott (Citation1998) can be applied to survey data under informative sampling, where the distribution that holds in the population cannot be directly observed in the sample. See also Pfeffermann (Citation2017) for several other situations where this approach may be relevant. We do not consider such approaches here, which require explicitly modelling the informative observation mechanism of sample selection or measurement.

As reviewed by Elliott and Valliant (Citation2017), there exists another quasi-randomisation (QR) approach to non-probability samples. Under the QR approach, one hypothesises a randomisation model of the non-probability sample inclusion indicator, but treats the outcomes of interest as unknown constants in the population. Though it is clearly inspired by the randomisation approach based on probability sampling, the QR approach is also a model-based approach, based on a model of the sample inclusion indicator instead of a prediction model of the outcome variable under the SP approach. A key motivation is that the correct inclusion probability can be used for any outcome of interest, just like when it is known under probability sampling, whereas the SP approach by nature must be specified differently for different outcome variables. In the context of survey sampling, the QR approach was introduced to deal with nonresponse, where response to survey is modelled as the second phase of selection, in addition to the first phase of sample selection according to a probability sampling design (Oh & Scheuren, Citation1983).

According to Elliott and Valliant (Citation2017), two key validity conditions are required for the QR approach. (I) The non-probability sample does have a probability sampling mechanism, even though it is unknown. In particular, one assumes that this hypothesised sample inclusion probability is strictly positive for all the population units, so that the only difference to probability sampling is that the inclusion probability is unknown. (II) There exist a set of covariates that ‘fully govern the sampling mechanism’. In other words, the sample inclusion probability is a function of these covariates.

Thus there are two model-based approaches to inference from non-probability sample. Under the SP approach, one models the outcome variable conditional on the realised sample inclusion indicators, whereas under the QR approach, one models the sample inclusion indicators, but treats the outcomes as unknown constants. Although one may envisage the outcomes as the realised values of random variables, a fully specified model of the outcome variable will not be required under the QR approach, given suitable validity conditions. Similarly, although one acknowledges that the sample selection mechanism may be critical to the SP approach, a fully specified model of the inclusion indicator will not be required under the SP approach, given suitable validity conditions.

It is possible to construct estimators that combine both the models of outcome and sample inclusion indicator, in a manner such that the estimator is consistent as long as one of the two models hold. Over the recent years, it is becoming common to refer to this estimation approach as ‘doubly robust’ (Kang & Schafer, Citation2007; Kim & Haziza, Citation2014; Robins, Rotnitzky, & Zhao, Citation1994). Notice that the traditional generalised regression estimator in survey sampling is doubly robust in the same sense, except that here the randomisation mechanism is actually known. Nevertheless, it is a fact that in the debate between model-based and design-based inference from probability sampling, either side questioned the ‘robustness’ of the other.

The rest of the paper is organised as follows. In Section 2, we review the estimation methods from non-probability sample which do require validity conditions. Although these have been roughly stated above, a closer examination under both modelling perspectives reveals nuances across the different estimators. Moreover, we shall highlight the potential challenges of under-coverage and heterogeneous means beyond the assumed model. The traditional formulation of validity conditions is inadequate in both regards. We outline a set of unified validity conditions in Section 3, which are formulated non-parametrically and encompasses both the modelling approaches. Post-stratification and calibration estimators are considered in light of these conditions. However, as will be discussed, a key difficulty in practice is that the validity conditions may be impossible to verify empirically based only on the data used for the estimation. Finally, we outline shortly in Section 4 some valid approaches given a supplementary probability sampling of the outcome of interest, followed by a brief summary in Section 5.

The key message is that probability sampling may still be necessary in some situations, in order to ensure the validity of descriptive inference, but it can be much less resource-demanding given the presence of a big non-probability sample. In fact, the bigger the non-probability sample, the better it is.

2. Review of existing approaches

Denote by U the population of known size N. Let each population unit be associated with an outcome of interest, denoted by yi, for iU. Denote by B the observed nonprobability sample of size nB. A common assumption to all the estimators we discuss below is that B does not contain any out-of-scope units, such that BU, and there are no duplicated units in B. Let δi=1 if iB, and 0 if iUB. Let yi be observed for all the units in B, and let yB={yi;iB}. To fix the idea, let Y=iUyi be the population total that is the target of descriptive inference. Let xB={xi;iB} in cases where any relevant covariates xi are available in the sample B. Let X=iUxi be the population totals and let X¯=X/N. Given xB, one can have two situations depending on whether (X,X¯) are known or not. In the case they are unknown, it may still be possible that there exists a second probability sample S, for SU, in which xi is observed, so that (X,X¯) can be estimated based on the sample S.

2.1. B-sample expansion estimator

Consider first the most basic situation where only yB is observed, and no relevant covariates are available at all. Let y¯B=iByi/nB be the B-sample mean. The B-sample expansion estimator of Y is given by (1) Yˆ=Ny¯B.(1) Under the SP approach, let μi=E(yi|δi,iU) be the conditional expectation of yi given δi, for any iU, where both δi and yi are treated as random variables. Provided the conditional expectation is the same as the unconditional expectation given either δi=1 or δi=0, for any iU, denoted by (2) μ=μ(δi=1)=E(yi|iB)=E(yi;iU),(2) we have E(y¯BY/N|B)=iBμ/nBμ=0 such that the B-sample expansion estimator is prediction unbiased for Y. We shall refer to (Equation2) as the SP assumption, which is a validity condition for the B-sample expansion estimator under the SP approach.

Under the QR approach, where yi is treated as a fixed constant, let pi=Pr(δi=1;yi,iU) be the inclusion probability of any population unit that is associated with the value yi. The notation ‘;’ is used here instead of ‘|’ because, strictly speaking, pi is not a conditional probability now that yi is not conceived as the realised value of a random variable under the QR approach. Now, provided the inclusion probability is the same for any iU, (3) pi=p(3) we have Y~=iByi/p is unbiased for Y, since EiByi/p=iUE(δi;yi,iU)yi/p=iUpyi/p=Y. In reality, p is unknown. It is natural to estimate it by pˆ=nB/N under (Equation3), which yields (Equation1) as the resulting plug-in estimator. It follows that the QR assumption (Equation3) is the key validity condition, which ensures that the B-sample expansion estimator is consistent for Y, as N and nB/N=Op(1) asympotically.

In summary, the B-sample expansion estimator (Equation1) can be motivated under both the SP and QR approaches, given validity conditions (Equation2) and (Equation3), respectively.

2.2. B-sample calibration estimator

Suppose relevant covariates xB are available in the sample B. The population totals X may be either known or unknown. In the latter case, suppose they can be estimated from a second probability sample S. The B-sample calibration estimator of Y is given by (4) Yˆ=iBwiyiwhereiBwixi=Xif known X,iBwixi=Xˆ(S)if unknown X,(4) where Xˆ(S) is some consistent S-sample estimator, as the S-sample size increases, and the weights wB={wi;iB} are calibrated in a way depending on the availability of X.

To actually compute the estimator (Equation4), one needs to choose a set of initial weights, denoted by aB={ai;iB}, and a distance function such as iB(wiai)2/ai between the initial and calibrated weights (Deville & Särndal, Citation1992). In the case of (5) ai=1/pi,(5) where pi is the true B-sample inclusion probability, for pi>0, the calibration estimator is consistent, as N and nB/N=Op(1), given mild regularity conditions in addition. However, insofar as one cannot manage to set the initial weights (Equation5), the calibration estimator is unmotivated from the QR perspective.

Next, under the SP approach, suppose the SPx assumption given by (6) E(yi|xi,iU)=μ(xi)=xiβ,(6) which relates the conditional expectation of yi linearly to the given xi, and (7) E(yi|xi,iU)=E(yi|xi,iB)(7) by which the B-sample selection is non-informative given xi. We have then EiBwiyiY|xU=EiBwixiβXβ=0 given iBwixi=X, regardless of the initial weights aB. Otherwise, this expectation would tend to 0, provided Xˆ(S) is an asymptotically unbiased estimator of X, under some suitable asymptotic setting. It follows that the assumptions (Equation6) and (Equation7) are the key validity conditions for the B-sample calibration estimator under the SP approach.

The estimator (Equation4) becomes the B-sample post-stratification estimator in the special case where xi is the post-stratum dummy index. For the QR approach, one can set ai to be the inverse post-stratum B-sample fraction, which is equivalent to introducing the QR assumption (Equation3) in each post-stratum separately. This QRx assumption provides then a validity condition for the B-sample post-stratification estimator under the QR approach. For the SP approach, the two assumptions (Equation6) and (Equation7) remain formally the same.

2.3. B-sample inverse propensity weighting

Suppose relevant covariates xB are available in the sample B. The B-sample inverse propensity weighting (IPW) estimator is constructed under the QR approach. Suppose (8) pi=p(xi;η)>0(8) i.e., the B-sample inclusion probability is completely determined given xi, in the strictly positive parametric form p(xi;η), which may as well be referred to as the QRx assumption. Provided xi is known for all the units in the population, η can be estimated, say, by a population estimating equation iUH(δi;η)=0, where E[H(δi;η)]=0. Otherwise, suppose xS is observed in a second probability sample S, one can use the pseudo population estimating equation iSdiH(δi;η)=0 (Kim & Wang, Citation2018), where di is the sampling weight, for iS, or some S-sampling design-consistent adjustment of it. This requires that one is able to observe δi for each unit i in S, in other words the two samples S and B can be matched, which is an important assumption in terms of application. To ensure that H(δi;η) is the same in both of these two estimating equations, i.e., whether iS or just iU, one needs to assume that S-sampling from U is non-informative for δi, so that (9) Pr(δi=1|xi,iS)=Pr(δi=1|xi,iU).(9) Notice that, given non-informativeness (Equation9), we have E[H(δi;η)]=0 for all is, such that one can also use the unweighted S-sample estimating equation, which is given by iSH(δi;η)=0 instead of the pseudo population estimating equation. Having obtained the parameter estimate ηˆ, one obtains pˆi=p(xi;ηˆ) and the B-sample IPW estimator (10) Yˆ=iByi/pˆi,(10) which is consistent for Y under mild regularity conditions, if ηˆ is consistent for η under some suitable asymptotic setting. It follows that the QRx assumption (Equation8) is its key validity condition, whereas the non-informativeness assumption (Equation9) is needed in addition when xi is only available in a probability sample S instead of the population.

2.4. Another B-sample IPW estimator

Elliott and Valliant (Citation2017) discuss another IPW estimator (Equation10), where pi is obtained with the help of a second so-called reference probability sample S, and is given by (11) piPr(Si=1|xi,iU)Pr(δi=1|xi,iBS)Pr(Si=1|xi,iBS),(11) where Si=1 if iS and 0 if iUS, and to fix the idea one may suppose SB=. First, the QRx assumption (Equation8) is retained. The definition of pi by (Equation11) can then be motivated as follows: Pr(δi=1|xi,iU)Pr(Si=1|xi,iU)Pr(xi|δi=1,iU)Pr(xi|Si=1,iU)prop. to Pr(δi=1|iU)Pr(Si=1|iU)Pr(xi|δi=1,iBS)Pr(xi|Si=1,iBS)Pr(δi=1|xi,iBS)Pr(Si=1|xi,iBS)prop. to Pr(δi=1|iBS)Pr(Si=1|iBS) provided the S-sample inclusion probability is also fully determined by xi in the sense of (Equation8). Thus the validity condition for the IPW estimator (Equation10) based on (Equation11) is that the QRx assumption (Equation8) holds for both the samples, given the same xi.

We make two observations. First, despite the superficial resemblance to the propensity scoring method of Rosenbaum and Rubin (Citation1983), the above argument for pi is not the same. As Rosenbaum and Rubin (Citation1983) state clearly before their first enumerated equation, ‘in this paper, the N units in the study are viewed as a simple random sample from some population’, where N is the size of the combined sample of treatment and non-treatment. The analogy to this combined sample is BS here. However, it is generally untenable that BS can be treated as a simple random sample from the population. Second, for any given probability sample S, it is possible to identify the variables that determine the designed inclusion probability, denoted by πi=π(zi), for iU. There arises thus a question, ‘what if π(zi) differs considerably from p(xi,ηˆ)?’ Moreover, one may have more than one probability sample in which xi is observed. There arises then a question, ‘which reference sample should one use?’

2.5. Sample matching estimator

Rivers (Citation2007) applies the SP approach in situations where a second probability sample S is available. Yang and Kim (Citation2018) study mass imputation methods, which include the matching estimator of Rivers (Citation2007) as a special case. The sample matching (SM) estimator is given by (12) Yˆ=iSdiyˆi,(12) where yˆi=yki, for ki=argminjBxixj based on a chosen metric . That is, yki is the nearest-neighbour (NN) imputation value from the B-sample for iS.

To focus on the idea, assume for the moment exact matching is the case, where xki=xi=x for all iS and kiB. We have then E(yˆi|xi=x)=E(yki|xki=x,kiB), which is the same as E(yi|xi=x,iB) as if the unit i were in B. Given the non-informativeness assumption (Equation7) for the B-sample, which Yang and Kim (Citation2018) call the ‘ignorability’ assumption, we have EiSdiE(yˆi|xi)=EiSdiE(yi|xi,iB)=EiSdiE(yi|xi,iU)=iUE(yi|xi,iU)=E(Y|xU). With respect to both the population model and the design of S, the SM estimator (Equation12) is prediction unbiased for Y. Notice that in the case of S = U, the SM estimator is just an NN-imputation method. Whether S = U or not, the NN-imputed SM estimator is likely to be less efficient than a prediction-imputed SM estimator Yˆ=iSdiE(yi|xi;βˆ(B)) whenever a correct parametric specification of the conditional mean (via β) is possible. The simulation results of Yang and Kim (Citation2018) show that NN-imputation is less efficient than imputation based on semi-parametric generalised additive models.

Now, it is not difficult to see that the consistency of the SM estimator (Equation12) can be established, given asymptotic exact matching instead, i.e., (13) xixki0 in probability,(13) for any iS, as N and nB/N=Op(1). Yang and Kim (Citation2018) make the assumption of ‘common support’ to the same effect. To ensure that E(yi|xi,iU) does not change abruptly as the value xi varies, Yang and Kim (Citation2018) assume that E(yi|xi,iU) is continuous differentiable. Or, one may adopt the SPx assumption below: (14) E(yi|xi,iU)E(yj|xj,jU)=O(xixj)as N(14) (Chen & Shao, Citation2000, Theorem 1). It follows that the assumptions (Equation7), (Equation13) and (Equation14) are the key validity conditions for the consistency of the SM estimator (Equation12).

We make two observations. First, an attractive feature of the NN-imputation is that the imputed sample S looks more realistic and natural than, say, by the regression prediction imputation. However, unless the S-sampling is non-informative, the NN-imputed S-sample will not resemble the true S-sample that could have been observed, since E(yˆi|xi,iS)=E(yi|xi,iU)E(yi|xi,iS), where the inequality is the case unless S-sampling is non-informative in the sense of (Equation7). Second, for any other covariate zixi, including when zi contains the S-sample design variables, we have E(yˆi|zi,xi,iS)=E(yi|xi,iU)E(yi|zi,xi,iU) unless yi and zi are conditionally independent of each other given xi. This is a general problem for statistical matching of variables associated with distinct units, i.e., yi associated with xi for some iB and zi associated with the same value xi but for some different unit in S. The following example illustrates both remarks above.

Example

Let yi be independent of xiUnif(0,1), for any iU. Then, the SPx assumption (Equation14) holds trivially, as long as the marginal expectation E(yi) exists. Next, suppose simple random sample B, so that the non-informative assumption (Equation7) holds, and E(yˆi|xi,iS)=E(yi|iU) regardless of the exact matching assumption. Suppose stratified simple random S-sampling with two strata of different sampling fractions, so that the S-sample inclusion probability is not a constant. Then, the S-sampling is informative (given xi) as long as the population stratum means are different, since E(y¯S|xS,S)=E(y¯S|S)E(Y¯)=E(Y¯|xU), where y¯S is the true S-sample mean that is unknown, since yi is not observed in S. It follows that the NN-imputed S-sample {yˆi;iS} would look like a sample generated by simple random sampling, rather than the actual stratified sampling. Moreover, the SM-estimator of stratum means, corresponding to say zi=1,2, respectively, will be biased for the population stratum means.

3. More generally on validity conditions

Non-informative selection in form of (Equation7) or (Equation9) is a critical condition for all the methods in Section 2, which make use of auxiliary variable xi. Two kinds of possible violation of these assumptions are worth noting.

First, Kim and Rao (Citation2018) point out an important issue that has not received sufficient attention in these methods, namely B-sample under-coverage is the case if some population units have in fact zero chance of being included in it. Under the SP approach, extrapolation of the conditional distribution of yi in the B-sample to these population units can only be based on subjective beliefs but not empirical evidence. The QR approach is equally affected, since randomisation inference would have been invalidated even if pi were known for all the B-sample units, let alone when it is unknown and needs to be estimated. To address the issue, Kim and Rao (Citation2018) consider a two-phase SM estimator. Let the S-sample be partitioned into S1 and S0, such that S1={i;pi>0} and S0={i;pi=0}. First, estimate this unobserved partition via the B-sample support: Sˆ1=i;minjBxixj<ϵ. Each S-sample unit that is unsupported in the B-sample ε-neighbourhood is assigned to Sˆ0. Let us suppose this partition estimator is consistent in the following sense: |Sˆ1S1|/|Sˆ1S1|1 in probability, as N and ϵ0. Next, the two-phase SM estimator is given as Yˆ=iSˆ1diw2iyˆi, where iSˆ1diw2ixi=iSdixi. In other words, the under-coverage is dealt with by the calibration of the weights w2i. This can be motivated, provided the conditional mean E(yi|xi,pi=0) can be linearly related to xi, and the relationship is the same for the units with pi>0, i.e., the under-coverage is non-informative for the SP linear model.

Second, insofar one requires either an assumption of SPx (Equation7) or QRx (Equation9), there is always the possibility of heterogeneous mean, beyond what is controlled by the chosen xi. Let Ux={i;xi=x,iU} be of the size Nx. Under the SP approach, which models the mean μi of unit i by μ(xi), heterogeneous mean is the case if μiμ(xi), despite (15) μ(x)=iUxμi/Nx(15) and μ(xi) is statistically correct in that the μi's average to μ(x) over all the units in Ux. Under the QR approach, heterogeneous mean is the case if pip(xi), despite (16) p(x)=iUxpi/Nx.(16) Let us illustrate the concept of heterogeneous mean with a simple example.

Example

Let x1, such that μ(xi)=μ, for all iU. Let U=U1U0 be a partition. Let U1 be of size N1 and with mean μi=μ(1), for all iU1; let U0 be of size N0 and with mean μi=μ(0), for all iU0. Suppose μ(1)μ(0). Let μ=μ(1)N1/N+μ(0)N0/N. Then, μiμ for any iU, but we still have iUμi/N=μ, satisfying (Equation15).

Heterogeneous mean affects the SP and QR approaches differently. Given (Equation15), assuming μi=μ(x) for iUx is prediction unbiased, despite heterogeneous mean, since iUx[E(yi|δi)μ(x)]=iUx[μiμ(x)]=0. Meanwhile, given (Equation16), assuming pi=p(x) for iUx yields EiUxδiyip(x)iUxyi=p(x)1iUx(pip(x))yi0 in which case the IPW estimator under the QP approach may be biased, despite the model of pi is statistically correct in the sense of (Equation16).

The discussion above suggests that the formulation of validity conditions in Section 2 is inadequate in the presence of under-coverage and mean heterogeneity. Below we first reformulate the validity conditions, which cover both the SP and QR approaches, despite the presence of under-coverage and mean heterogeneity. We elaborate and illustrate these conditions for the post-stratification and calibration estimators. Finally, we discuss the difficulties of verifying these validity conditions empirically.

3.1. Non-parametric asymptotic (NPA) non-informativeness

We start by noticing that the B-sample mean equals to the population mean, denoted by y¯B=Y¯, provided CovN(δi,yi)=1NiUδiyi1NiUδi1NiUyi=0,EN(δi)=iUδi/N>0, where EN and CovN denote, respectively, expectation and covariance with respect to the empirical distribution function that places point mass 1/N on each population unit. This provides an empirical formulation of the non-informativeness of the B-sample observation mechanism with respect to the outcome of interest. Similar expressions have appeared in various discussions of the potential sample mean bias due to the observation mechanism, such as unequal probability sampling (Rao, Citation1966), survey nonresponse (Bethlehem, Citation1988) or big data (Meng, Citation2018). It motivates the following non-parametric asymptotic (NPA) non-informativeness assumption in the absence of any covariates:

(17) limNCovN(δi,yi)=0,i.e.,non-informative B-selectionlimNEN(δi)=p>0,i.e.,non-negligible B-selection.(17) The NPA assumption (Equation17) encompasses both the SP and QR approaches. For the SP approach, taking the conditional expectation of yi's conditional on the δi's yields E(CovN(δi,yi)|δU)=1NiUδiμi1NiUδi1NiUμi0 given NPA non-informative B-selection, where iUδi/N>0 given non-negligible B-selection in addition. Under this condition, the B-sample expansion estimator (Equation1) is asymptotically prediction unbiased from the SP perspective. For the QR approach, taking the expectation of δi s with the yi s being constants yields E(CovN(δi,yi);yU)=1NiUpiyi1NiUpi1NiUyi0,E(EN(δi))=iUpi/Np>0. In particular, the NPA assumption (Equation17) allows for 0pi1, so that the B-sample expansion estimator (Equation1) remains consistent from the QR perspective, even in the presence of under-coverage of the units with pi=0 or non-representative units with pi=1.

Example

Let U=U1U0 be a partition. Let U1 be of size N1, where pi1 for iU1; let U0 be of size N0, where pi0 for iU0. Despite under-coverage of BU1, the first NPA condition implies y¯BY¯0, given the second condition N1/Np>0.

3.2. Post-stratification estimator

Consider post-stratification by xi, for iU. Provided the assumption (Equation17) holds within each post-stratum, the B-sample post-stratification estimator is asymptotically unbiased from both the SP and QR perspectives. Below we consider the QR approach. The SP approach is a special case of the calibration estimator discussed in Section 3.3.

Consider first the hypothetical estimator with known px=iUxpi/Nx: Y~=xiUxδiyi/px. To fix the idea for variance estimation, suppose independent Bernoulli distribution of δi with probability pi, where 0pi1. The variance of Y~ is then given by V(Y~)=xiUxpiyi2/px2xiUxpi2yi2/px2. An unbiased estimator of the first term of the variance, denoted by τ1 is given by τˆ1=xiUxδiyi2/px2=xpx2iBxyi2, where Bx=BUx. An unbiased estimator of the second term, denoted by τ2 is given by τˆ2=xpx2iUxδipiyi2=xpx1iUxδiyi2=xpx1iBxyi2, where the second equality follows given the additional QRx assumption, i.e., pi=px for iUx. Putting τˆ1 and τˆ2 together, we obtain Vˆ(Y~)=x(px11)px1iBxyi2. Now, the post-stratification estimator, denoted by Yˆ, is obtained from Y~ on replacing px by pˆx=nxB/Nx, where nxB is the observed size of Bx and Nx is the known post-stratum population size. Expanding pˆx around px (i.e., linearisation) would yield an asymptotically valid estimator of the unconditional variance of Yˆ.

3.3. Calibration estimator

The post-stratification estimator is infeasible, in cases when the B-sample contains empty cells, or when the population size Nx is not all known. Let ti=(t1i,t2i,tKi)=(t1(xi),t2(xi),tK(xi))=t(xi) be a vector of many-to-one mappings of xi, such that the population total T=iUti is known, and the sample total t=iBti has only non-zero components.

As discussed for the calibration estimator in Section 2, generally one is not able to set the initial weight to be the inverse of B-sample inclusion probability in practice. Suppose one simply starts with initial equal weights ai=N/nB for all iB. The linear calibration estimator (Deville & Särndal, Citation1992) is given by Yˆ=iBwiyi, where the weights {wi;iB} minimise the distance to {ai;iB} as measured by iB(wiN/nB)2=tiBtwi22(N/nB)iBtwi+ntB(N/ntB)2 subjected to the constraints iBwiti=T, where Bt={i;ti=t,iB} and ntB>0. It follows that wi=wt, for iBt, since the only thing that matters to the calibration constraints is iBtwi now that ti=t for iBt and, given whatever iBtwi, the term iBtwi2 is minimised at wi=wt for iBt.

As the first validity condition for Yˆ, suppose there exists a vector βK×1, such that (18) iUtϵi/Nt0(18) for each t-value, as N, where ϵi=yitiβ, and Nt is the population size of Ut={i;ti=t,iU}. The condition (Equation18) is analogous to the SPx assumption (Equation6), where the covariate xi is replaced by ti here. Moreover, it relaxes the model (Equation6) of the conditional mean, allowing for potential heterogeneous mean similar to (Equation15). Now that iBwiti=T, we have YˆY=iBwi(tiβ+ϵi)iUti(β+ϵi)=iBwiϵiiUϵi. Given (Equation18), iUϵi/N0 as N. Moreover, we have 1NiBwiϵi=twtNiUtδiϵi=twtNtNCovNt(δi,ϵi)+1NtiUtδi1NtiUtϵi0 as N, given (19) CovNt(δi,ϵi)0ENt(δi)=iUtδi/Ntpt>0(19) for any given t, which is an adaption of the NPA non-informativeness assumption (Equation17) to the present setting. It follows that the two assumptions (Equation18) and (Equation19) are the key validity conditions for the calibration estimator to be consistent for Y.

For variance estimation, suppose again independent Bernoulli distribution of δi with probability pi, where 0pi1. An approximate variance estimator for the calibration estimator Yˆ can then be given as Vˆ(Yˆ)=t(pˆt11)pˆt1iBt(yitiβˆ)2, where pˆt=ntB/Nt, and βˆ=(iBwititi)1(iBwitiyi).

3.4. Validation of non-informative B-sample selection

Of the validity conditions discussed above, the critical assumption is non-informative B-sample selection, which can be stated in various forms. For instance, given the non-informativeness assumption (Equation17), an additional assumption like (Equation18) can in principle to validated empirically. However, the non-informativeness condition may not hold exactly, and it is generally impossible to verify only based on the data used for the estimation. Below we discuss the issue in more detail.

Consider first the propensity model pi=p(xi;η) under the QR approach. Suppose known xU to avoid additional complications otherwise, the census score equation is xp(x;η)ηnxBp(x;η)NxnxB1p(x;η)=0, which is always satisfied by p(x;ηˆ)=nxB/Nx, i.e., the saturated model. Insofar as a non-saturated model of p(xi;η) does not fit perfectly to the data, one can always attribute its cause to the non-saturated functional form of p(xi;η), instead of rejecting the assumption that the set of covariates xi ‘fully govern the sampling mechanism’. In this sense, the validity of the latter assumption cannot be refuted empirically.

Next, for the SP approach, where both δi and yi are treated as random, assume the B-sample inclusion probability pi depends on xi, where xi is known for iU to avoid extra complications. For goodness-of-fit checks, let zi be a known covariate, which is distinct from xi. We have E(zB)=iUpizi=xp(x;η)iUxzi=xp(x;η)NzZ¯x,Z=EiUδizi/pi=ExnxBz¯xB/p(x;η), where Z¯x=iUxzi/Nx and z¯xB=iBxzi/nxB. The two observed checks are zBxnxBz¯xB=xpˆxNxZ¯xZ=xnxBz¯xB/pˆxifzi1iUpˆi=nBiB1/pˆi=N. Setting pˆx=nxB/Nx, which fits the assumption pi=p(xi;λ), both the two checks are satisfied given Z¯x=z¯xB, i.e., the B-sample expansion estimate of Zx is perfect for all x. This would suggest that the NPA assumption (Equation17) holds for zi given xi and may be considered to support the plausibility of the NPA assumption (Equation17) for yi given xi, provided zi is known to be correlated with yi, but not otherwise. However, in situations where such a covariate zi is available, it seems natural that it should be used in the estimation of Y to start with. The two checks amount then to the case of zi1 and are satisfied trivially by setting pˆx=nxB/Nx. Thus one is faced with a dilemma, where building the best model for estimation would at the same time reduces the ability to verify it.

4. Using additional probability sample of outcomes

So far we have only considered the situations, where the outcome values of interest are only observed in the non-probability sample B. Obviously, the situation changes completely, given in addition a probability sample of outcomes. Below we discuss shortly two different approaches to inference in the absence of any relevant covariates. The ideas remain the same in situations with additional covariates.

The first approach aims at consistent estimation combing the two samples, as for example, discussed in Tam and Kim (Citation2018aCitation2018b), where the probability sample is taken from the whole population and overlaps with the B-sample. These authors also discussed additional issues such as measurement errors or nonresponse. Here we discuss the situation where the probability sample is taken from the B-sample complement population. Given the non-probability sample observations yB, one may treat (B,yB) as fixed, and select a second supplementary sample from the rest of the population, denoted by SUB. Given the S-sample observations of the outcome, denoted by yS, it is straightforward to obtain a test for H0:Y¯=y¯B vs. H1:Y¯y¯B, given as D=(y¯BY¯Bcˆ)2/Vˆ(Y¯Bcˆ)χ12, where Y¯Bcˆ is an S-sample estimator of the population mean outside of the B-sample, i.e., Y¯Bc=iUByi/(NnB) and Vˆ(Y¯Bcˆ) is the associated variance estimator. If H0 is not rejected, then there is the possibility of using y¯B as an estimate on its own, without regular concurrent surveys in future. This would achieve the greatest cost savings. To this end, one may consider S as a particular form of audit sampling, whose aim is to validate the big-data estimate y¯B and to provide a meaningful accuracy measure that can accommodate its potential bias. Zhang (Citation2019) develops an approach to audit sampling inference for big data statistics.

Let WB=nB/N. A consistent estimator of Y¯ using both samples is given by Y¯ˆS=WBy¯B+(1WB)y¯wandy¯w=iSyi/πiiS1/πi, where πi is the S-sample inclusion probability, and the validity of Y¯ˆS now derives from probability sampling of S, regardless of how the B-sample is generated. The relative efficiency (RE) against the setting without the B-sample can be given by RE=[(1WB)2V(y¯w)]/V(Y¯ˆ), where Y¯ˆ is a hypothetical probability sample from the whole population U, which has the same sample size and the same sampling design as S. One may refer to this as the split-population approach to inference, which is an age-old idea for combining survey sampling with administrative data. The efficiency gain would be substantial provided the B-sample is large. In fact, the larger the B-sample, the greater is the efficiency gain.

Under the second approach to inference, consider a composite estimator given by Y¯ˆC=γy¯B+(1γ)y¯w, where γ is the composition weight, for WBγ1. Notice that when γ=WB, the composite estimator is just the split-population estimator Y¯ˆS above, which is consistent for Y¯. As γ increases from WB towards one, one risks introducing greater bias, insofar as y¯BY¯. However, the composite estimator may yield a smaller mean squared error (MSE) of estimation, provided this is desirable. One is then essentially trading the increasing bias (γWB)(y¯BY¯Bc) against the decreasing stand error (1γ)SE(y¯w), as γ increases. The composite estimator that achieves the minimum MSE is given by γ=V(y¯w)+WB(y¯BY¯Bc)2V(y¯w)+(y¯BY¯Bc)2. Estimating Y¯Bc by y¯w in application, one can use γˆ=min(WB+(1WB)Vˆ(y¯w)/(y¯By¯w)2,1). The validity of the composite approach derives from probability sampling of S, regardless of how the B-sample is generated. Again, the bigger the B-sample, the better it is.

5. Summary

All the estimators from non-probability sample observations reviewed in Section 2 are model based, whether the modelling is carried out under the SP or QR approach. Two features regarding the model covariate xi, for iU, are worth recapitulating:

  • compared to the situation with known xU, making use of an additional probability sample xS entails a loss of efficiency, as can be expected;

  • the availability of an additional probability sample without the outcome variable is not a principal advantage, since it does not simplify the validity conditions compared to the situation where xU is known, but it does resolve the practical difficulty when xU is unavailable yet some functions of xU are needed for descriptive inference.

The situation changes completely, given in addition a probability sample of outcomes. The probability sample then enables valid descriptive inference in combination with the non-probability probability sample. Depending on the circumstances, the probability sample can either be selected from the whole population, or just the rest population outside the non-probability sample.

Finally, in situations where the non-probability sample is large, the cost savings will be the greatest if it can replace regular survey sampling altogether. Use of an additional probability audit sample is needed to validate the non-probability sample estimate, in spite of possible failure of its underlying model assumptions, and to provide a meaningful accuracy measure that can accommodate its potential bias.

Disclosure statement

No potential conflict of interest was reported by the author.

Additional information

Notes on contributors

Li-Chun Zhang

Li-Chun Zhang is Professor of Social Statistics at University of Southampton, Senior Researcher at Statistics Norway, and Professor of Official Statistics at University of Oslo. His research interests include finite population sampling design and coordination, graph sampling, sample survey estimation, non-response, measurement errors, small area estimation, index number calculations, editing and imputation, register-based statistics, analysis of integrated data, statistical matching, record linkage, population size estimation.

References

  • Baker, R., Brick, J. M., Bates, N. A., Battaglia, M., Couper, M. P., Dever, J. A., …Tourangeau, R. (2013). Report of the AAPOR task force on non-probability sampling (Technical Report). Deerfield, IL: American Association for Public Opinion Research.
  • Bethlehem, J. (1988). Reduction of nonresponse bias through regression estimation. Journal of Official Statistics, 4, 251–260.
  • Chen, J. H., & Shao, J. (2000). Nearest neighbor imputation for survey data. Journal of Official Statistics, 16, 113–131.
  • Deville, J.-C., & Särndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87, 376–382. doi: 10.1080/01621459.1992.10475217
  • Elliott, M. R., & Valliant, R. (2017). Inference for nonprobability samples. Statistical Science, 32, 249–264. doi: 10.1214/16-STS598
  • Kang, J. D. Y., & Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data (with discussion). Statistical Science, 22, 523–539. doi: 10.1214/07-STS227
  • Kim, J. K., & Haziza, D. (2014). Doubly robust inference with missing data in survey sampling. Statistica Sinica, 24, 375–394.
  • Kim, J.-K., & Rao, J. N. K. (2018). Data integration for big data analysis in finite population inference. Talk presented at SSC2018. Montreal.
  • Kim, J.-K., & Wang, Z. (2018). Sampling techniques for big data analysis in finite population inference. Retrieved from arXiv:1801.09728v1
  • Little, R. J. A. (1982). Models for nonresponse in sample surveys. Journal of the American Statistical Association, 77, 237–250. doi: 10.1080/01621459.1982.10477792
  • Meng, X. L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and 2016 US presidential election. The Annals of Applied Statistics, 12, 685–726. doi: 10.1214/18-AOAS1161SF
  • Oh, H. L., & Scheuren, F. J. (1983). Weighting adjustments for unit non-response. In W. G. Madow, I. Olkin & D. B. Rubin (Eds.), Incomplete data in sample surveys (Vol. 2): Theory and bibliographies (pp. 143–184). New York: Academic Press.
  • Pfeffermann, D. (2017). Bayes-based non-Bayesian inference on finite populations from non-representative samples. Calcutta Statistical Association Bulletin, 69, 1–29. doi:10.1177/0008068317696546
  • Pfeffermann, D., Krieger, A. M., & Rinott, Y. (1998). Parametric distributions of complex survey data under informative probability sampling. Statistica Sinica, 8, 1087–1114.
  • Rao, J. N. K. (1966). Alternative estimators in PPS sampling for multiple characteristics. Sankhya, 28, 47–60.
  • Rivers, D. (2007). Sampling for web surveys. Proceedings of the survey research methods section. American Statistical Association.
  • Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Estimation of regression coefficient when some regressors are not always observed. Journal of the American Statistical Association, 89, 846–866. doi: 10.1080/01621459.1994.10476818
  • Rosenbaum, P., & Rubin, D. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. doi: 10.1093/biomet/70.1.41
  • Royall, R. M. (1970). On finite population sampling theory under certain linear regression models. Biometrika, 57, 377–387. doi: 10.1093/biomet/57.2.377
  • Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592. doi: 10.1093/biomet/63.3.581
  • Smith, T. M. F. (1983). On the validity of inferences from non-random sample. Journal of the Royal Statistical Society, Series A, 146, 394–403. doi: 10.2307/2981454
  • Tam, S.-M., & Kim, J.-K. (2018a). Big data ethics and selection-bias: An official statistician's perspective. Statistical Journal of the IAOS, 34, 577–588. doi:10.3233/SJI-170395
  • Tam, S.-M., & Kim, J.-K. (2018b). Mining big data for finite population inference. Talk presented at BigSurv18. Barcelona.
  • Yang, S., & Kim, J.-K. (2018). Integration of survey data and big observational data for finite population inference using mass imputation. Retrieved from https://arxiv.org/abs/1807.02817v1
  • Zhang, L.-C. (2019). Proxy expenditure weights for consumer price index: Audit sampling inference for big data statistics. Retrieved from https://arxiv.org/abs/1906.11208

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.