![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
ABSTRACT
Standard statistical methods without taking proper account of the complexity of a survey design can lead to erroneous inferences when applied to survey data due to unequal selection probabilities, clustering, and other design features. In particular, the Type I error rates of hypotheses tests using standard methods can be much larger than the nominal significance level. Methods incorporating design features in testing hypotheses have been proposed, including Wald tests and quasi-score tests that involve estimated covariance matrices of parameter estimates. In this article, we present a unified approach to hypothesis testing without requiring estimated covariance matrices or design effects, by constructing bootstrap approximations to quasi-likelihood ratio statistics and quasi-score statistics and establishing its asymptotic validity. The proposed method can be easily implemented without a specific software designed for complex survey sampling. We also consider hypothesis testing for categorical data and present a bootstrap procedure for testing simple goodness of fit and independence in a two-way table. In simulation studies, the Type I error rates of the proposed approach are much closer to their nominal significance level compared with the naive likelihood ratio test and quasi-score test. An application to an educational survey under a logistic regression model is also presented. Supplementary materials for this article are available online.
1 Introduction
Testing statistical hypotheses is one of the fundamental problems in statistics. In the parametric model approach, hypothesis testing can be implemented using Wald test, likelihood ratio test, or score test. In each case, a test statistic is calculated and then compared with the -quantile of the reference distribution, which is the limiting distribution of the test statistic under the null hypothesis, where α is the nominal significance level. The limiting distribution is often a Chi-squared distribution (Shao Citation2003).
However, statistical inference with survey data involves additional steps incorporating the sampling design features. Korn and Graubard (Citation1999) and Chambers and Skinner (Citation2003) provided comprehensive overviews of the methods for analyzing survey data from complex sampling designs. In hypothesis testing with survey data, the limiting distribution of the test statistic is generally not a Chi-squared distribution. Rather, it can be expressed as a weighted sum of several independent random variables from , which is a Chi-square distribution with one degree of freedom, and the weights depend on unknown parameters. To handle such problems, one may consider some corrections to the test statistics to obtain a Chi-square limiting distribution approximately. Such an approach usually involves computing “design effects” associated with test statistics (Lumley and Scott Citation2014).
In this article, we use a different approach to computing the limiting distribution using a bootstrap approximation. Beaumont and Bocci (Citation2009) investigated a general weighted bootstrap method for hypothesis testing under complex sampling designs in the context of Wald tests for linear regression analysis. In this article, we generalize the bootstrap testing idea of Beaumont and Bocci (Citation2009) and present a unified bootstrap weight approach to obtain the limiting distribution of test statistics under complex sampling designs, including stratified multi-stage sampling with clusters selected with replacement.
The proposed method is a bootstrap weight method (Mashreghi, Haziza, and Léger Citation2016) under complex survey sampling. Although Praestgaard and Wellner (Citation1993) and Chatterjee and Bose (Citation2005) rigorously investigate bootstrap weight methods under regularity conditions, both may provide erroneous inferences under complex survey sampling. Praestgaard and Wellner (Citation1993) established central limit theorems for the weighted bootstrap empirical process, but they assumed that the sample is independently and identically distributed (iid). Although it is common to assume that the finite population is iid, elements in a sample may not be due to complex sampling designs; see Pfeffermann (Citation1993) for details. In addition, Praestgaard and Wellner (Citation1993) only focused on mean estimation, but we are interested in hypothesis testing for parameters that solve a certain estimating equation. Chatterjee and Bose (Citation2005) proposed a bootstrap weight method for estimating functions, and their theoretical results are valid even for correlated random variables. However, the bootstrap weight method of Chatterjee and Bose (Citation2005) is not applicable to informative sampling designs, since survey weights are not taken into consideration; see Pfeffermann (Citation1993) for details.
There exist bootstrap weight methods under complex survey sampling, but the proposed bootstrap method differs from them in the following aspects. Rao, Wu, and Yue (Citation1992) proposed a bootstrap weight method to estimate the variance of a function of population total estimators under stratified random sampling, but did not investigate the theoretical properties of their bootstrap method. Chipperfield and Preston (Citation2007) proposed a without replacement scaled bootstrap to achieve the same goal as Rao, Wu, and Yue (Citation1992) under stratified random sampling, but their method is only applicable when the parameter of interest is a smooth function of population totals. Moreover, neither method is applicable to other complex survey sampling. Bertail and Combris (Citation1997) proposed a bootstrap weight method to make inferences for population means under general sampling designs, and the corresponding theoretical properties were investigated. However, their results are not applicable to parameters obtained by solving an estimating equation. Although Beaumont and Patak (Citation2012) developed a general bootstrap weight method for parameter solving an estimating equation in general sampling designs, they focused only on variance estimation without a rigorous discussion about hypothesis testing. Antal and Tillé (Citation2011) proposed a one-one bootstrap method for variance estimation under general sampling designs. Similar to the one-one bootstrap method, Antal and Tillé (Citation2014) developed a doubled half-bootstrap to estimate the variance of the estimator for population means. However, neither Antal and Tillé (Citation2011) nor Antal and Tillé (Citation2014) considered parameters solving an estimating equation. Moreover, existing work did not establish limiting distributions for the corresponding bootstrap estimator, especially for the one solving an estimating equation. Thus, up to our knowledge, there is no theoretical guarantee for existing methods when used for hypothesis testing.
Since the use of bootstrap for hypothesis testing is not fully investigated in the literature, our goal is to fill this important gap under general sampling designs. To do this, we first establish a bootstrap central limit theorem under regularity conditions. The sampling design is allowed to be informative in the sense that the sampling mechanism depends on the response variables being sampled (Pfeffermann Citation1993). Once the bootstrap central limit theorem is established, the proposed bootstrap method can be applied to the quasi-likelihood-ratio test and the quasi-score test. As long as the bootstrap central limit theorem holds for the sampling design, the bootstrap replicates of the test statistics can be used to approximate their distributions under the null hypothesis. Besides, this article also has a practical contribution to survey sampling, since bootstrap weights are commonly provided in agencies, including Statistics Canada.
The article is organized as follows. In Section 2, the basic setup is introduced. The proposed bootstrap method is presented in Section 3. In Section 4, the quasi-likelihood ratio test and the quasi-score test are introduced. Test for categorical survey data is briefly discussed in Section 5. The results of two simulation studies are presented in Section 5. An application to data from an educational survey under a logistic regression model is presented in Section 6. Concluding remarks are made in Section 7.
2 Basic Setup
Suppose that a finite population is a random sample of size N from a super-population model (Isaki and Fuller Citation1982; Pfeffermann Citation1993; Sverchkov and Pfeffermann Citation2004; Rubin-Bleuer and Schiopu-Kratina Citation2005), where yi is the response of interest, and
is the corresponding auxiliary vector of length d, and
is the vector of design variables such that the first-order inclusion probability for unit i can be obtained by
, where
is a design parameter determined by
. In other literature, the inclusion probability is expressed as
; see Sverchkov and Pfeffermann (Citation2004) for details.
We are interested in hypothesis testing for a parameter , where
is the parameter space,
is the random variable associated with elements of
, the expectation is taken with respect to the super-population model, and
is a pre-defined objective function. For example, if the finite population is a random sample generated from a super-population model with a density function
and if
is the parameter of interest, then
. For regression problems,
can be the parameter in the conditional expectation of the response of interest given the auxiliary vectors.
From the finite population , a probability sample
of size n is selected by a probability sampling design with the first-order inclusion probability πi, where A is the set of sample indexes. The design variables
and the design parameter
are not available at the sample level. From the sample, we can compute a pseudo M-estimator of
by
where
is the survey-weighted objective function and
is the sampling weight for
. Note that
is design-unbiased for
in the sense that
, where the conditional expectation conditioning on
is with respect to the sampling mechanism. To be rigorous, a subscript N should also be used to index the sample, the sampling weights, the estimator and other quantities, but it is suppressed in the sequel for simplicity.
Remark 1.
When discussing a finite population in the preceding paragraph, we implicitly assume that elements of are iid. Such an assumption is widely adopted in survey sampling; see Corollary 1.3.2.1 and Section 2.2.1 of Fuller (Citation2009) for details. Besides, under a linear regression model,
(1)
(1) with
, the finite population
satisfies the iid assumption; see Särndal, Swensson, and Wretman (Citation1989, sec. 5.1), Breidt and Opsomer (Citation2000), Särndal, Swensson, and Wretman (Citation2003, sec. 6.4), Kim et al. (Citation2006, sec. 5), Toth and Eltinge (Citation2011), Wu and Thompson (Citation2020, sec. 5.1) and Han and Wellner (Citation2021) for more general setups about the super-population model. If we are interested in the regression parameter
in (1), an objective function is
.
Although we implicitly make an iid assumption for the finite population, elements in the sample A may not be iid due to complex survey designs. That is, the informative sampling no longer preserves the iid structure in the sample; see Pfeffermann (Citation1993) for details. Under certain sampling designs, we can relax the iid assumption, and the super-population model can involve cluster random effects (Rubin-Bleuer and Schiopu-Kratina Citation2005); refer to Section S4 of supplementary material for an extension to stratified multi-stage sampling with cluster random effects. We focus on element sampling mainly in this article.
Since the parameter of interest is the superpopulation parameter, the final sample can be viewed as a realization of the two-phase sampling design, where the finite population is the first-phase sample and the final sample is the second-phase sample. Chen and Rao (Citation2007) and Rubin-Bleuer and Schiopu-Kratina (Citation2005) present rigorous asymptotic setups for two-phase sampling.
Often, the pseudo M-estimator can be obtained by solving
(2)
(2) where
is the tangent function of
. To discuss the asymptotic properties of the pseudo M-estimator, we assume the following regularity conditions for a sequence of finite populations and samples (Isaki and Fuller Citation1982).
C1. The parameter space Θ is an open and convex set containing as an interior point. For
is concave and twice-continuously differentiable with respect to
, and
is uniquely maximized at
.
C2. For in probability with respect to the super-population model as
, where
is the design variance of
conditional on the finite population
. In addition, there exists a nonnegative design variance estimator
, such that
almost surely.
C3. There exists a compact set containing
as an interior point, such that
in probability with respect to the super-population model as
, where
is the design variance of
conditional on the finite population,
is the operator norm of an m × m matrix
associated with the Euclidean vector norm
, and
is a positive-definitive nonstochastic matrix.
C4. A central limit theorem holds for the survey-weighted score function in (2). Specifically,
in distribution as
with respect to the super-population model and the sampling mechanism.
C5. The weak convergence of is uniform in
, where
. Specifically,
in probability with respect to the super-population model and the sampling mechanism as
, where
, and
is invertible.
Conditions C1–C2 are sufficient conditions for the weak consistency of . Conditions C3–C5 are used to establish the asymptotic normality of
. In Conditions C3–C4, we consider the case where
is asymptotically negligible compared with
, and it is often the case when
as
. That is, we can safely ignore the variability for generating the finite population and focus on the design variance of
given the finite population. This assumption is a building block of the proposed bootstrap method in the next section, and a similar assumption is also proposed by Beaumont and Bocci (Citation2009); see Assumption 5 and its discussion of Beaumont and Bocci (Citation2009) for details.
As mentioned by Han and Wellner (Citation2021), “establishing a finite-dimensional CLT … can be a nontrivial problem for general sampling designs,” and it is beyond our scope to prove Condition C4 under general sampling designs. In fact, there is no guarantee that the central limit theorem holds for general sampling designs. Instead, we simply consider the set of sampling designs such that the central limit theorem holds; see Hájek (Citation1964), Rosén (Citation1972a), Rosén (Citation1972b), Berger (Citation1998), Rubin-Bleuer and Schiopu-Kratina (Citation2005), sections 1.3.3–1.3.4 of Fuller (Citation2009), Chauvet (Citation2015), and Han and Wellner (Citation2021) for details. For simplicity, we use the same notation, “”, to denote the matrix norm and the vector norm. Detailed comments on these regularity conditions are presented in Section S1 of supplementary material. For the proposed conditions, we implicitly assume that
as
. The following theorem establishes the asymptotic normality of
. Its proof can be found in Section S2 of supplementary material.
Theorem 1.
Under Conditions C1–C5, we have
(3)
(3) in distribution as
with respect to the joint distribution of the super-population model and the sampling mechanism, where
Now, by the second-order Taylor expansion, we obtain
(4)
(4) where
Define
(5)
(5)
By (4), we obtain
Thus, by Theorem 1, we have
(6)
(6) in distribution as
, and the reference distribution is the joint distribution of the super-population model and the sampling mechanism, where
are the eigenvalues of
, and
are p independent random variables from the standard normal distribution. Result (6) was established by Rao and Scott (Citation1984) and Lumley and Scott (Citation2014), and it can be regarded as a version of the Wilks’ theorem for survey sampling. For p = 1, we can use
as the test statistic with
distribution as the limiting distribution under the null hypothesis, where
, and
and
are the counterparts of
and
, respectively. Unless the sampling design is simple random sampling and the sampling fraction is negligible, the limiting distribution does not reduce to the standard Chi-squared distribution with p degrees of freedom.
3 Bootstrap Calibration
We propose using a bootstrap weight method to approximate the limiting distribution in (6) and other test statistics under complex survey sampling. Such a bootstrap calibration is very attractive in practice since there is no need to derive the analytic form of the limiting distribution of the test statistic. That is, standard software that takes account of the weights can be used to implement bootstrap hypothesis testing from data files reporting weights and associated sets of bootstrap weights, as done in Statistics Canada and some other agencies. Given the sample A, the proposed bootstrap method is as follows:
Step 1 Generate a rescaling vector
of size
from a bootstrap distribution with
, and the covariance of
is determined by the sampling design, where
is the cardinality of set A,
is the expectation with respect to the bootstrap distribution conditional on the sample A, and
is a vector of 1 with length n. The bootstrap weights are chosen in such a way that
and
, where
is the bootstrap variance conditional on the sample A, and
is a nonnegative design variance estimator of
conditional on the finite population
.
Step 2 Obtain a bootstrap replicate of
, denoted as
, by solving
.
Different distributions may be used to generate under different sampling designs, but the generation of
only depends on the inclusion probabilities and not on the specific form of the estimating function for parameter estimation. Thus, once the bootstrap weights are generated, they can be used for various inference problems based on the sample A. However, the commonly used linearization technique is based on the specific form of the estimating function, so different linearization is required for different hypothesis testing problems. Thus, compared with the linearization technique, the proposed bootstrap method is more appealing to the practitioners.
We now present additional regularity conditions to validate the proposed bootstrap method.
C6. For , the bootstrap weights are nonnegative and satisfy
and
conditional on the sample A, where
satisfies
in probability with respect to the super-population model and the sampling mechanism as
.
C7. A central limit theorem holds for the bootstrap weighted score function uniformly over
. Specifically,
(7)
(7) in distribution uniformly over
, and the reference distribution is the bootstrap sampling distribution conditional on the sample A.
C8. in probability with respect to the bootstrap sampling distribution conditional on the sample A as
, where
.
Condition C6 guarantees that the scaled bootstrap variance of converges to
in Condition C3, and they are also used to show the bootstrap central limit theorem in Condition C7. Condition C8 is needed to establish the asymptotic normality of the bootstrap estimator
. To check Condition C7, we may use a bootstrap version of the Lyapounov condition. As discussed in Condition C4, we do not claim that Condition C7 holds under general sampling designs, but we can validate both under certain sampling designs. For example, in Section S4 of the supplementary material, we use a Lyapounov-type condition to prove the two central limit theorems under stratified multi-stage sampling. Detailed comments on these regularity conditions are presented in Section S1 of the supplementary material. The high-level conditions are also verifiable under other commonly used sampling designs, including Poisson sampling.
Theorem 2.
Let the assumptions for Theorem 1 hold. Then, under Conditions C6–C8, we have
(8)
(8) in distribution as
, and the reference distribution on the left side of (8) is the bootstrap sampling distribution conditional on the sample A, where
is defined in Theorem 1.
Theorem 2 is proved under general sampling designs; see Section S3 of the supplementary material for details. Theorem 2 establishes the bootstrap central limit theorem for , and the limiting distribution coincides with the one in (3) for
. Thus, the proposed bootstrap method can be used to approximate the sampling distribution of
.
For estimating the mean of a finite population, Chao and Lo (Citation1985) first proved a bootstrap central limit theorem under simple random sampling without replacement, and Bickel and Freedman (Citation1984) proved the bootstrap central limit theorem under stratified random sampling without replacement. We now present stratified multi-stage sampling with clusters selected with replacement as an example to validate the bootstrap central limit theorem in Theorem 2. Stratified multi-stage sampling with clusters selected with replacement is extensively used in socio-economic surveys when the first-stage sampling fractions within strata are negligible, as is often the case with large-scale sample surveys.
Example 1.
Under stratified multi-stage sampling, a popular approach is to assume that clusters are selected with replacement from the hth stratum, with the selection probabilities phi satisfying
for
(Rao and Wu Citation1988), where Nh is the number of clusters in the hth stratum. As mentioned in Section 2, the selection probabilities can be random with respect to the super-population model. Within each sampled cluster, denote
as a design-unbiased estimator of the cluster mean under sampling at the second and subsequent stages. For simplicity, we still use
for the case when the (hi)th cluster is fully observed. Then, the weighted score function is
, where
, Mhi is the size of the (hi)th cluster,
, and a(hi) is the index of the ith selected cluster in the hth stratum.
The bootstrap replicate of can be obtained in the following way. For the hth stratum, generate
by a multinomial distribution with
trials and a success probability vector
of length nh. Then, the bootstrap weighted score function is
, where
,
, and
; Rao, Wu, and Yue (Citation1992) proposed a similar rescaling bootstrap method under stratified multi-stage sampling, but they did not investigate the corresponding theoretical properties with respect to hypothesis testing of super-population model parameters. By Lemma S7 in Section S4 of supplementary material, we conclude that
and
, where
is the projection matrix to the linear space spanned by
,
is a vector consisting of nh 1’s,
is an
identity matrix, and
In Section S4 of supplementary material, we show that the bootstrap central limit theorem in Theorem 2 holds with .
Under stratified simple random sampling without replacement (Bickel and Freedman Citation1984), a bootstrap method is proposed in Section S7 of the supplementary material, and it is essentially the same as the one considered by Rao and Wu (Citation1988) when estimating the population mean; see Rao and Wu (Citation1988) and Beaumont and Patak (Citation2012) for details.
To establish a bootstrap version of Wilks’ Theorem in (6), we use
as the bootstrap version of in (5). Under the assumptions for Theorem 2, we can show that
(9)
(9) in distribution as
, and the reference distribution is the bootstrap sampling distribution conditional on the sample A, where
is the limiting distribution (6) of the quasi-likelihood ratio test statistic
in (5) under the null hypothesis.
By (9), we conclude that the corresponding test statistic generated by the proposed bootstrap method has the same asymptotic distribution as in (5). Thus, we can use the bootstrap distribution of to approximate the sampling distribution of
.
Remark 2. A
s mentioned in Section 1, most existing bootstrap methods were proposed for variance estimation under survey sampling. There indeed exist some papers discussing bootstrap confidence intervals under survey sampling, but these methods do not apply to general hypothesis testing problems. For example, Rao, Wu, and Yue (Citation1992) proposed a bootstrap-t confidence interval for the parameter of interest, but their bootstrap confidence interval does not apply when the parameter of interest is multi-dimensional. In addition, their bootstrap method is only valid under the stratified simple random sampling. Beaumont and Patak (Citation2012) and Bertail and Combris (Citation1997) discussed bootstrap confidence intervals in their simulation studies, but it is not clear how the corresponding intervals are constructed. Also, neither Bertail and Combris (Citation1997) nor Beaumont and Patak (Citation2012) investigated their bootstrap confidence intervals theoretically. Theorem 2, on the other hand, is a building block to show that the proposed bootstrap method can be used for hypothesis testing under general sampling designs. Although we adopt almost the same technique to prove Theorems 1–2, we are the first to establish that the bootstrap estimator has the same asymptotic distribution as the original
. Then, we can show that the corresponding bootstrap statistic
shares the same asymptotic distribution as the original one
under the null hypothesis, so the proposed bootstrap method can be used for hypothesis testing without deriving the corresponding variance estimator as in Lumley and Scott (Citation2014). In addition, the proposed bootstrap method can be widely applied in commonly used sampling designs; see Sections S4–S5 of the supplementary material for details.
Remark 3.
In many practical situations, the design weights are further modified to incorporate population-level constraints (as in calibration weighting) or to handle unit nonresponse (as in nonresponse weighting adjustment). In this case, the bootstrap design weights can be modified in the same way to obtain the bootstrap final weights. See Bessonneau et al. (Citation2021) for a detailed illustration of constructing bootstrap final weights in household surveys. When the test statistic is computed using the final weights rather than the design weights, the bootstrap final weights can be applied to the same test statistics to obtain the bootstrap distribution of the test statistic.
4 Quasi-Likelihood-Ratio Test
Without loss of generality, we denote as the true parameter
in this section. Let
be the parameter space under the null hypothesis, so the null hypothesis can be written as
. In this section, we consider
for some known vector
, where
. Thus, we have
Let
be the profile pseudo M-estimator of
under
, which is obtained by maximizing
with respect to
.
The quasi-likelihood ratio test statistic for testing is defined as
(10)
(10) where
. Under simple random sampling with replacement,
in (10) is asymptotically distributed as
with
and
, and the reference distribution is the joint distribution of the super-population model and the sampling mechanism, where
is the dimension of Θ. The following lemma, proved by Lumley and Scott (Citation2014), presents the limiting distribution of the quasi-likelihood ratio test statistic in (10) for general sampling designs.
Lemma 1.
Under and the assumptions for Theorem 2,
(11)
(11) in distribution as
, and the reference distribution is the joint distribution of the super-population model and the sampling mechanism, where
are the eigenvalues of
,
are q independent random variables from the standard normal distribution,
is the asymptotic variance of
in (3),
and
for
.
Lumley and Scott (Citation2014) proposed to estimate the limiting distribution in (11) using a design-based estimator of .
We consider an alternative test using a novel application of the bootstrap method discussed in Section 3. To do this, a bootstrap version of the quasi-likelihood ratio test statistic in (10) is obtained as
(12)
(12) where
, and
is the second component of
. Under the conditions of Theorem 2, we can show that the limiting distribution for
is the same as that in (11). Thus, the bootstrap distribution using
in (12) can be used to obtain the reference distribution of the test statistic
in (10) under
.
Remark 4.
In addition to the quasi-likelihood ratio test, the proposed bootstrap method is also applicable to quasi-score tests (Rao, Scott, and Skinner Citation1998). To develop a quasi-score test for , we first decompose the survey-weighted score function as
(13)
(13) with
for j = 1, 2, where
is a known function with an unknown parameter
, and
is a known function of μi. Note that quasi-score tests clearly specify the model without requiring a correctly specified density function, so they are more important in practice. Let
be the solution to
where
is defined in (13). The quasi-score test statistic for
is
(14)
(14)
where
(15)
(15)
and
The bootstrap replicates of can be constructed by replacing the sampling weights wi and
, respectively, by the bootstrap weights
and the bootstrap estimator that solves
subject to
. Then, the bootstrap version of the quasi-score test statistic
is
(16)
(16) where
, and
is computed similarly to (15) using bootstrap weights.
5 Test for Categorical Survey Data
5.1 Goodness-of-Fit Test
Suppose that a finite population U is partitioned into K categories with . Define
to be the population proportion of the kth category, where Nk is the size of
. Thus, we implicitly assume that the finite population is an iid realization of the super-population model following a multinomial distribution with K categories.
Suppose that we are interested in the simple goodness-of-fit test for
, where
is a pre-specified vector satisfying
. From the sample A, we compute
as an estimator of pk, where
is a design-unbiased estimator of Nk,
, and
. The estimated proportions can be obtained by
, where
, with xik = 1 if the ith element belongs to the kth category and 0 otherwise,
.
Then, the Pearson Chi-squared goodness-of-fit test statistic for H0 is
where . Under regularity conditions, according to Rao and Scott (Citation1981), we have
(17)
(17) in distribution as
under H0, and the reference distribution is the joint distribution of the super-population model and the sampling mechanism, where
are the eigenvalues of the design effect matrix
is the asymptotic variance of
, and
are K – 1 independent random variables from the standard normal distribution. Under simple random sampling with replacement, the limiting distribution in (17) is a Chi-squared distribution with K – 1 degrees of freedom since
.
We now apply the proposed bootstrap method to approximate the limiting distribution in (17), without computing the estimated covariance matrix . To describe the proposed method, let
be the estimator of the population proportion
based on the bootstrap weights
. The proposed bootstrap statistic of
is
We use in place of
in the bootstrap test statistics. Under the conditions of Theorem 2, we can show that the limiting distribution for
is the same as that in (17).
5.2 Test of Independence in a Two-Way Table
The proposed bootstrap can also be used to test independence in a two-way count table. Let be the population proportion for cell (i, j) with margins
and
for
and
, where R and C are the numbers of rows and columns, and
is the set of population counts with margins
and
. Let
be a design-unbiased estimator of Nij and
. By assuming a multinomial distribution for the super-population model, the Chi-squared statistic for testing independence
for all i and j is
(18)
(18)
The limiting distribution of is a mixture of
distributions (Rao and Scott Citation1981).
To develop bootstrap replicates of the test statistics under H0, let be the bootstrap cell proportion computed using the bootstrap weights,
, and
. The proposed bootstrap version of
is
(19)
(19)
Under H0, the terms in the numerator of are identical to
. That is, the bootstrap test statistic is calculated by simply replacing
and
with
and
, respectively. It can be shown that the limiting distribution of
conditional on the sample is the same as that of
.
6 Simulation Study
6.1 Single-Stage Sampling
In this section, we test the performance of the proposed bootstrap method under the PPS (probability proportional to size) sampling with replacement. A finite population of size N is generated as follows. For ,
and
, where
is a uniform distribution on the interval (a, b),
, and
. To make the sampling design informative, the PPS sampling with replacement is carried out to generate a sample of size n with selection probabilities
and
, where Ii = 1 if
and 0 otherwise; see Section S6 of the supplementary material for details about the probability proportional to size sampling and the bootstrap weights. The sampling design is informative (Pfeffermann Citation1993), since the selection probabilities are partially determined by the residuals
. A similar model setup was also used by Verret, Rao, and Hidiroglou (Citation2015) in the context of small area estimation. We consider two scenarios:
and
.
We are interested in testing with the nominal significance level of
and consider three different values for
. The following testing methods are compared:
Naive likelihood ratio method with
in (10) and
as test statistic and reference distribution, respectively.
Naive quasi-score method with test statistic (14) and
as the reference distribution.
Lumley and Scott (Citation2014) method. The test statistic is
, where
is defined in (15), and
is obtained from
. The reference distribution is
, where
is an F distribution with parameters ν1 and ν2, k is the degrees of freedom of the variance estimator based on the sampling design, and k is obtained by subtracting the number of parameters from the effective sample size associated with the sampling design.
Bootstrap quasi-likelihood-ratio method with
in (10) being the test statistic, and the reference distribution is approximated by the empirical distribution of
in (12).
Bootstrap quasi-score method with
in (14) being the test statistic, and the reference distribution is approximated by the empirical distribution of the bootstrap test statistics,
When calculating the likelihood-ratio test statistics, we assume a normal distribution with a constant but unknown variance for the conditional distribution yi given xi. For each scenario, we generate 1000 Monte Carlo samples, and we consider M = 500 and M = 1000 iterations for both bootstrap methods. summarizes the simulation results. For both scenarios, the naive likelihood-ratio method and the naive quasi-score method have a significantly inflated Type I error rate when the null hypothesis is true. The corresponding design effects (Wu and Thompson Citation2020) are more than 2 for estimating θ2 under both scenarios; see Section S8.1 of the supplementary material for details. Thus, the distributions of and the quasi-score test statistic (14) are more “dispersed” than
under the null hypothesis, and due to the lack of design-effect corrections, the two naive methods are not theoretically valid. The Type I error rates of the Lumley and Scott (Citation2014) method and the two proposed bootstrap testing methods are similar under H0, regardless of the population and sample sizes. The Type I error rates of the three methods come closer to the nominal truth as the population and sample sizes increase. Furthermore, the powers of the two bootstrap testing methods are reasonable compared to the Lumley-Scott method. The naive methods have a slightly larger power at the expense of elevated Type I error rates. In addition, we get similar test results regardless of the number of bootstrap repetitions.
Table 1 Test power for the hypothesis test based on 1 000 Monte Carlo simulations, the true parameter is
, and the nominal significance level is 0.05.
Besides, Section S9 of the supplementary material reports results of an additional simulation study for a stratified two-stage sampling design.
6.2 Test of Independence
In this section, we consider test of independence in a 3 × 3 table of counts to check the performance of the proposed bootstrap testing method. For a finite population ,
is generated by a multinomial distribution with one trial and a success probability vector
, where
, and pij is the success probability for the cell in the ith row and jth column for
. That is,
is a dummy variable consisting of eight 0’s and a 1. For the success probability vector
, we consider three cases:
Case I:
.
Case II:
.
Case III:
.
Case I satisfies independence for the two-way table of counts, but Cases II–III do not. The level of nonindependence can be expressed by , and γ = 0 corresponds to independence. The values of γ are 0, 0.017 and 0.125 for Cases I–III, respectively.
For each , we generate an auxiliary variable
, where
for
. Probability proportional to size sampling with replacement is used to generate a sample of size n with selection probability proportional to xi. We consider two scenarios for the population and sample sizes:
and
.
We are interested in testing independence in the two-way table with nominal significance level. For each sample, we compare the following test methods:
Naive Pearson method based on
in (18) with
being the reference distribution.
The second-order Rao-Scott correction method using the Satterthwaite approximation (Thomas and Rao Citation1987). This method is widely used in analyzing two-way contingency table and is implemented by the command svychisq in the R package survey (Lumley Citation2020).
Bootstrap Pearson method using
, and its distribution is approximated by that of
in (19).
For each scenario, we generate 1000 Monte Carlo samples, and we consider M = 500 and M = 1000 iterations for both bootstrap methods. summarizes the simulation results. For Case I when the null hypothesis is true, the Type I error rate of the naive Pearson method is much larger than the nominal significance level 0.05 for different sample sizes, since the asymptotic distribution of the test statistic in (18) is more “dispersed” than a
distribution under the null hypothesis; see Section S8.2 of the supplementary material for details. The performance of the Rao-Scott method with second-order correction is similar to the bootstrap testing method in terms of Type I error and power. The power of the bootstrap testing method increases with the value of γ. Similar to the previous simulation, the power of the naive method is larger at the expense of elevated Type I error rates. In addition, we get similar test results regardless of the number of bootstrap repetitions.
Table 2 Power of the test procedures for independence based on 1000 Monte Carlo simulation samples, Case I corresponds to the independence scenario, and the nominal significance level is 0.05.
7 Application
We present an analysis of the 2011 Private Education Expenditure Survey (PEES) in South Korea using the proposed bootstrap testing methods. The purpose of this survey is to study the relationship between private education expenditure and academic performance of students before entering college.
A stratified two-stage sampling design was used for the 2011 PEES, and the strata consist of 16 first-tier administrative divisions, including most provinces and metropolitan cities in South Korea. For each stratum, PPS sampling with replacement was conducted in the first stage, and the primary sampling unit was the school. The students were randomly selected from the sample school in the second stage. There are about 1000 sample schools and 45,000 students involved in this survey.
For the student i in the sample A, let yi be the academic performance assessed by the teacher, and it takes a value from 1 through 3 corresponding to low, middle and high academic performance, respectively. Associated with yi, let be the vector of covariates of interest. As discussed by Kim, Park, and Lee (Citation2017), we consider the following covariates: after-school education, hours taking lessons provided by the school after regular classes in a month; father’s education, 1 for college or higher and 0 otherwise; gender, 1 for female and 0 for male; household income per month; mother’s education, 1 for college or higher and 0 otherwise; private education, hours taking private lessons in a month.
In this section, we study the academic performance of students in middle school and high school separately and are interested in studying the conditional probability of achieving high academic performance. Specifically, consider the following logistic model
(20)
(20) where
for
, Y = 1 if high academic performance is achieved and 0 otherwise,
is a vector of six covariates, and
. We are interested in testing the null hypotheses
for
with
nominal significance level. Since the Wald method is widely used in practice, the naive likelihood ratio method, naive quasi-score method, bootstrap quasi-likelihood ratio method, bootstrap quasi-score method with 1000 iterations and a two-sided Wald test are compared. The p-values for the two naive methods and the Wald test are obtained using reference distributions for simple random sampling. Specifically, the reference distribution for the two naive methods is
, and that of the Wald test is a normal distribution with estimated variance (Fuller Citation2009, sec. 1.2.8) for
using the sandwich formula. The p-values for the proposed bootstrap testing methods are obtained by bootstrap empirical distributions of the corresponding test statistics.
The results of our analysis are summarized in . The two bootstrap testing methods and the Wald test perform approximately the same in terms of p-values. The p-values of the two naive methods are approximately the same, but they differ from those of the bootstrap testing methods, especially for “after school education” at the middle school level and most covariates at the high school level, as the naive methods may not properly reflect the intra-cluster correlation due to cluster sampling.
Table 3 Estimates (Est) and the p-values (Unit: ) for testing
, where
, in (20) for the middle school and high school.
Based on the two bootstrap testing methods in , we have the following conclusions under nominal significance level. Controlling other covariates, the probability that female students achieve high academic performance is significantly higher than that of male students in middle school, but the gender effect is not significant in high school. Hours spent on private education and after-school education can significantly increase the probability of achieving high academic performance in both middle school and high school. The household income has a significantly positive influence on their child’s academic performance. However, the level of education of the father and mother has a significant influence during the middle school period only.
8 Concluding Remarks
Many statistical agencies provide microdata files containing survey weights and several sets of associated replication weights, in particular bootstrap weights. Standard statistical packages often permit the use of survey-weighted test statistics, and we have shown how to approximate their distributions under the null hypothesis by their bootstrap analogues computed from the bootstrap weights supplied in the data file. Using bootstrap weights, we can easily apply hypothesis testing without using specialized software designed for complex survey data, as noted by Beaumont and Bocci (Citation2009). The proposed bootstrap method is applied to quasi-likelihood ratio tests and quasi-score tests.
Our theories depend on establishing bootstrap central limit theorems under specified sampling designs, including stratified multi-stage sampling with clusters selected with replacement. Under some sampling designs, the central limiting theorem does not necessarily hold and the proposed method is not applicable. In the case of sampling designs that allow for the central limit theorem, the bootstrap central limit theorems can be established under some additional assumptions. Under the assumption that the bootstrap central limit theorem holds in the sampling design, the proposed hypothesis testing methods are applicable. The hypotheses we have discussed are special cases of the one considered in Section 4.2 of Jiang (Citation2019), so the theoretical properties of the proposed bootstrap method can also be investigated using the general method developed in Section 4.2 of Jiang (Citation2019). The proposed method can be applied to the case of categorical data by developing bootstrap procedures for testing simple goodness of fit and independence in a two-way table. We plan to extend our bootstrap method for categorical data to test hypotheses from multi-way tables of weighted counts or proportions, using a log-linear model approach proposed by Rao and Scott (Citation1984). Moreover, the bootstrap method can also be applied to nonresponse and weighting adjustments for variance estimation under regularity conditions, and extending the proposed bootstrap method for weight adjustments for hypothesis testing is also a promising project.
Supplementary Materials
Supplementary Material contains the following: comments on the regularity conditions (S1), proof of Theorem 1 (S2), proof of Theorem 2 (S3), verification of the proposed bootstrap method under stratified multi-stage sampling (S4), bootstrap method for Poisson sampling (S5) bootstrap method for probability proportional to size sampling with replacement (S6), stratified simple random sampling (S7), design effect (S8) and an additional simulation study: stratified two-stage sampling (S9).
Supplemental Material
Download Zip (276 KB)Acknowledgments
We are grateful to the associate editor and three anonymous referees for their constructive comments, which have greatly improved the article.
Additional information
Funding
References
- Antal, E., and Tillé, Y. (2011), “A Direct Bootstrap Method for Complex Sampling Designs from a Finite Population,” Journal of the American Statistical Association, 106, 534–543. DOI: 10.1198/jasa.2011.tm09767.
- Antal, E., and Tillé, Y. (2014), “A New Resampling Method for Sampling Designs Without Replacement: The Doubled Half Bootstrap,” Computational Statistics, 29, 1345–1363.
- Beaumont, J.-F., and Bocci, C. (2009), “A Practical Bootstrap Method for Testing Hypotheses from Survey Data,” Survey Methodology, 35, 25–35.
- Beaumont, J.-F., and Patak, Z. (2012), “On the Generalized Bootstrap for Sample Surveys with Special Attention to Poisson Sampling,” International Statistical Review, 80, 127–148. DOI: 10.1111/j.1751-5823.2011.00166.x.
- Berger, Y. G. (1998), “Rate of convergence to normal distribution for the Horvitz-Thompson estimator,” Journal of Statistical Planning and Inference, 67, 209–226. DOI: 10.1016/S0378-3758(97)00107-9.
- Bertail, P., and Combris, P. (1997), “Bootstrap Généralisé d’un Sondage,” Annales d’Economie et de Statistique, 46, 49–83. DOI: 10.2307/20076068.
- Bessonneau, P., Brilhaut, G., Chauvet, G., and Garcia, C. (2021), “With-Replacement Bootstrap Variance Estimation for Household Surveys: Principles, Examples and Implementation,” Survey Methodology, 47, 313–347.
- Bickel, P. J., and Freedman, D. A. (1984), “Asymptotic Normality and the Bootstrap in Stratified Sampling,” Annals of Statistics, 12, 470–482.
- Breidt, F. J., and Opsomer, J. D. (2000), “Local Polynomial Regression Estimators in Survey Sampling,” Annals of Statistics, 28, 1026–1053.
- Chambers, R., and Skinner, C. J. (2003), Analysis of Survey Data, Chichester: Wiley.
- Chao, M.-T., and Lo, S.-H. (1985), “A Bootstrap Method for Finite Population,” Sankhya: Series A, 47, 399–405.
- Chatterjee, S., and Bose, A. (2005), “Generalized Bootstrap for Estimating Equations,” Annals of Statistics, 33, 414–436.
- Chauvet, G. (2015), “Coupling Methods for Multistage Sampling,” Annals of Statistics, 43, 2484–2506.
- Chen, J., and Rao, J. (2007), “Asymptotic Normality Under Two-Phase Sampling Designs,” Statistica Sinica, 17, 1047–1064.
- Chipperfield, J., and Preston, J. (2007), “Efficient Bootstrap for Business Surveys,” Survey Methodology, 33, 167–172.
- Fuller, W. A. (2009), Sampling Statistics, New Jersey: Wiley.
- Hájek, J. (1964), “Asymptotic Theory of Rejective Sampling with Varying Probabilities from a Finite Population,” Annals of Mathematical Statistics, 35, 1491–1523.
- Han, Q., and Wellner, J. A. (2021), “Complex Sampling Designs: Uniform Limit Theorems and Applications,” Annals of Statistics, 49, 459–485.
- Isaki, C. T., and Fuller, W. A. (1982), “Survey Design under the Regression Superpopulation Model,” Journal of the American Statistical Association, 77, 89–96. DOI: 10.1080/01621459.1982.10477770.
- Jiang, J. (2019), Robust Mixed Model Analysis, Singapore: World Scientific.
- Kim, J. K., Michael Brick, J., Fuller, W. A., and Kalton, G. (2006), “On the Bias of the Multiple-Imputation Variance Estimator in Survey Sampling,” Journal of the Royal Statistical Society, Series B, 68, 509–521. DOI: 10.1111/j.1467-9868.2006.00546.x.
- Kim, J. K., Park, S., and Lee, Y. (2017), “Statistical Inference using Generalized Linear Mixed Models under Informative Cluster Sampling,” Canadian Journal of Statistics, 45, 479–497. DOI: 10.1002/cjs.11339.
- Korn, E. L., and Graubard, B. I. (1999), Analysis of Health Surveys, New York: Wiley.
- Lumley, T. (2020), Analysis of Complex Survey Samples, R package version 3.37.
- Lumley, T., and Scott, A. J. (2014), “Tests for Regression Models Fitted to Survey Data,” Australian & New Zealand Journal of Statistics, 56, 1–14. DOI: 10.1111/anzs.12065.
- Mashreghi, Z., Haziza, D., and Léger, C. (2016), “A Survey of Bootstrap Methods in Finite Population Sampling,” Statistics Surveys, 10, 1–52. DOI: 10.1214/16-SS113.
- Pfeffermann, D. (1993), “The Role of Sampling Weights when Modeling Survey Data,” International Statistical Review, 61, 317–337. DOI: 10.2307/1403631.
- Praestgaard, J., and Wellner, J. A. (1993), “Exchangeably Weighted Bootstraps of the General Empirical Process,” Annals of Probability, 21, 2053–2086.
- Rao, J. N. K., and Scott, A. J. (1981), “The Analysis of Categorical Data from Complex Sample Surveys: Chi-Squared Tests for Goodness of Fit and Independence in Two-Way Tables,” Journal of the American Statistical Association, 76, 221–230. DOI: 10.1080/01621459.1981.10477633.
- Rao, J. N. K., and Scott, A. J. (1984), “On Chi-Squared Tests for Multiway Contingency Tables with Cell Proportions Estimated from Survey Data,” Annals of Statistics, 12, 46–60.
- Rao, J. N. K., Scott, A. J., and Skinner, C. J. (1998), “Quasi-Score Tests with Survey Data,” Statistica Sinica, 8, 1059–1070.
- Rao, J. N. K., and Wu, C. F. J. (1988), “Resampling Inference with Complex Survey Data,” Journal of the American Statistical Association, 83, 231–241. DOI: 10.1080/01621459.1988.10478591.
- Rao, J. N. K., Wu, C. F. J., and Yue, K. (1992), “Some Recent Work on Resampling Methods for Complex Surveys,” Survey Methodology, 18, 209–217.
- Rosén, B. (1972a), “Asymptotic Theory for Successive Sampling with Varying Probabilities without Replacement, I,” Annals of Mathematical Statistics, 43, 373–397.
- Rosén, B. (1972b), “Asymptotic Theory for Successive Sampling with Varying Probabilities without Replacement, II,” The Annals of Mathematical Statistics, 43, 748–776.
- Rubin-Bleuer, S., and Schiopu-Kratina, I. (2005), “On the Two-Phase Framework for Joint Model and Design-based Inference,” Annals of Statistics, 33, 2789–2810.
- Särndal, C.-E., Swensson, B., and Wretman, J. (2003), Model Assisted Survey Sampling, New York: Springer.
- Särndal, C.-E., Swensson, B., and Wretman, J. H. (1989), “The Weighted Residual Technique for Estimating the Variance of the General Regression Estimator of the Finite Population Total,” Biometrika, 76, 527–537. DOI: 10.1093/biomet/76.3.527.
- Shao, J. (2003), Mathematical Statistics, New York: Springer.
- Sverchkov, M., and Pfeffermann, D. (2004), “Prediction of Finite Population Totals based on the Sample Distribution,” Survey Methodology, 30, 79–92.
- Thomas, D. R., and Rao, J. N. K. (1987), “Small-Sample Comparisons of Level and Power for Simple Goodness-of-Fit Statistics under Cluster Sampling,” Journal of the American Statistical Association, 82, 630–636. DOI: 10.1080/01621459.1987.10478476.
- Toth, D., and Eltinge, J. L. (2011), “Building Consistent Regression Trees From Complex Sample Data,” Journal of the American Statistical Association, 106, 1626–1636. DOI: 10.1198/jasa.2011.tm10383.
- Verret, F., Rao, J. N. K., and Hidiroglou, M. A. (2015), “Model-based Small Area Estimation Under Informative Sampling,” Survey Methodology, 41, 333–347.
- Wu, C., and Thompson, M. E. (2020), Sampling Theory and Practice, Gewerbestrasse: Springer.