Full article: Empirical likelihood inference and goodness-of-fit test for logistic regression model under two-phase case-control sampling

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Due to cost-effectiveness and high efficiency, two-phase case-control sampling has been widely used in epidemiology studies. We develop a semi-parametric empirical likelihood approach to two-phase case-control data under the logistic regression model. We show that the maximum empirical likelihood estimator has an asymptotically normal distribution, and the empirical likelihood ratio follows an asymptotically central chi-square distribution. We find that the maximum empirical likelihood estimator is equal to Breslow and Holubkov (1997)'s maximum likelihood estimator. Even so, the limiting distribution of the likelihood ratio, likelihood-ratio-based interval, and test are all new. Furthermore, we construct new Kolmogorov–Smirnov type goodness-of-fit tests to test the validation of the underlying logistic regression model. Our simulation results and a real application show that the likelihood-ratio-based interval and test have certain merits over the Wald-type counterparts and that the proposed goodness-of-fit test is valid.

KEYWORDS:

1. Introduction

Case-control studies have been conducted extensively in epidemiology and medical research, especially for rare diseases like cancer, as an easy and quick way of comparing treatments, or investigating the causes of disease. In a case-control study, the possible covariate information associated with diseases is collected for diseased and non-diseased individuals, and logistic regression models are usually used to model the relationship between the binary disease status and the covariate (Breslow & Day, Citation1980; Farewell, Citation1979; Prentice & Pyke, Citation1979). However, epidemiological and medical studies often require the collection of information on a large number of variables, including lifestyle, occupation, and environmental conditions. Certain variables are collected easily such as disease status and age in the social security system, but certain variables require a lot of cost, such as lifestyle. The difficulty can be overcome by employing a two-phase, two-stage or double sampling (Breslow et al., Citation2009; Neyman, Citation1938; Schaid et al., Citation2013; Thomas et al., Citation2013), where in the first phase, a relatively large number of individuals are sampled from a target population and information on variables that are easier to measure is collected. Together with a case-control sampling in the first phase, this leads to two-phase case-control sampling (Walker, Citation1982; White, Citation1982). At the first phase, cases and controls are sampled from a general population, and some stratifying information, such as a crude measure of exposure (e.g. disease or non-disease) is obtained. At the second phase, subsamples are selected within strata defined by disease status and by other stratifying information, and more refined information on exposure and other covariates is obtained for the subsample.

There have been many estimation methods dealing with logistic regression analysis of two-phase sampling data by making careful use of the information in the two sampling phases. Popular methods include the pseudo likelihood method (Breslow & Cain, Citation1988; Schill et al., Citation1993), the inverse probability weighted estimating method (Flanders & Greenland, Citation1991; Saegusa & Wellner, Citation2013), which takes the underlying sampling plan into account, and the maximum likelihood estimation method (Breslow & Holubkov, Citation1997; Lawless et al., Citation1999; Qin et al., Citation2015; Scott & Wild, Citation1997; Zhou et al., Citation2011). Among them, the methods developed by Lawless et al. (Citation1999), Zhou et al. (Citation2011), and Saegusa and Wellner (Citation2013) are designed for general two-phase prospective data. As disclosed by Prentice and Pyke (Citation1979), given usual case-control data, inferences for the odds ratio parameters based on prospective and retrospective likelihoods are equivalent. Therefore statistical methods designed for prospective data can usually be employed to make inference for retrospectively collected case-control data by taking them as if they were prospectively collected. Breslow and Holubkov (Citation1997, p. 452) proved that this equivalence still holds for two-phase case-control data.

The existing works suggested to construct confidence intervals or conduct hypothesis testing for the odds ratio parameters based on the asymptotic normality of their proposed point estimators. This necessitates a consistent estimator for the accompanying asymptotic variance, which may have a very complicated form. In contrast, likelihood-ratio confidence intervals or tests are generally preferable as they are free of variance estimation. In addition, all the aforementioned methods for two-phase case-control data analysis are built on the logistic regression model assumption, whose mis-specification may have a detrimental or invalid effect on the subsequent analysis. Therefore it is necessary to check whether the logistic model is valid or not for two-phase case-control data.

Logistic regression models are commonly used in analysing binary data which arise in studying the relationship between diseases and environment or genetic characteristics. See for example, Breslow and Day (Citation1980), Farewell (Citation1979), and Prentice and Pyke (Citation1979). Anderson (Citation1979) noticed that the prospective logistic regression model is equivalent to the two sample exponential tilting model where one sample is corresponding to the covariate distribution for non-diseased group and the other one is the diseased group. With the standard case-control data, Qin and Zhang (Citation1997) and Qin and Zhang (Citation2005) proposed a Kolmogorov-Smirnov type statistic to test the validation of the logistic link function, where they implicitly used the empirical likelihood method (Owen, Citation1988) to estimate the underlying parameters. Qin (Citation1998) found that the retrospective likelihood is equivalent to the empirical likelihood based on a density ratio model (Anderson, Citation1979, Citation1972). See Chen and Liu (Citation2013), Cai et al. (Citation2017), Diao et al. (Citation2012), and Luo and Tsai (Citation2012) for more about the density ratio model based on empirical likelihood.

Since the seminal work by Owen (Citation1988), the empirical likelihood has become a popular nonparametric tool in statistical and economic literatures (Kitamura, Citation2006; Owen, Citation2001), as it has many nice properties paralleling to parametric likelihood. For example, empirical likelihood confidence regions are Bartlett correctable (DiCiccio et al., Citation1991; Liu & Chen, Citation2010), range preserving, transformation respecting, and do not require estimation of variance. It has the advantage of allowing the data to ‘speak for itself’ more and is more robust to slight mis-specification than parametric likelihood. Empirical likelihood has also found many applications in surveying sampling problems. Chen and Qin (Citation1993) showed that empirical likelihood can effectively use auxiliary information. Chen et al. (Citation2002) developed an elegant algorithm to implement the empirical likelihood method for finite-sample sampling problems. Zhao and Wu (Citation2019) and Wu and Thompson (Citation2020) provided a comprehensive review on the developments of empirical likelihood methods for complex surveys.

In this paper, we extend the empirical likelihood method for standard case-control data to two-phase case-control data. We find that the proposed maximum empirical likelihood point estimator is equal to Breslow and Holubkov (Citation1997)'s maximum likelihood estimator under the logistic regression model. Unlike Breslow and Holubkov, (Citation1997), we show that the empirical likelihood ratio statistic follows a limiting central chi-square distribution with a known degree of freedom. A remarkable advantage of the empirical likelihood-ratio test over the existing asymptotic-normality-based tests is that it is free of variance estimation. This makes it more convenient to construct empirical likelihood confidence regions or empirical likelihood-ratio tests, to test, for example, whether a covariate has a significant effect on disease occurrence. Furthermore, we construct new Kolmogorov-Smirnov type goodness-of-fit tests to test the validation of the logistic regression model.

The rest of this paper is organised as follows. Section 2 introduces the data and model assumptions, presents empirical likelihood, goodness-of-fit tests based on the density ratio model, and investigates their large-sample properties. Simulation results and a real-data analysis are provided in Sections 3 and 4, respectively. Finally, Section 5 contains some discussions. For clarity, all technical proofs are relegated to the Supplementary Material.

2. Semi-parametric likelihood

2.1. Data and model

Let D denote the disease status with D = 1 for a disease (case) and 0 for a non-disease (control), and X and Z denote two covariate vectors, where Z takes only finitely many values: $z_{1}, z_{2}, \dots, z_{J}$ . Suppose the probability of disease occurrence in the population of interest follows a linear logistic regression model (1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) where $⊤$ stands for transpose. Here and in what follows, we use $p r$ to denote probability density function with respect to the counting measure for a discrete random variable, and the Lebesgue measure for an absolutely continuous random variable. We collect data from the population under study by two-phase case-control sampling. At the first phase, fixed number $n_{0}$ of controls and $n_{1}$ of cases (sample 1) are drawn, independent of X and Z. Let $N_{i j}$ denote the observed number of individuals with ( $D = i, Z = z_{j}$ ) for $i = 0, 1$ and $j = 1, 2, \dots, J$ . Then for each disease category, $(N_{i 1}, N_{i 2}, \dots, N_{i J})$ is a random vector following a multinomial distribution with parameter $n_{i}$ and $p r (z_{j} | D = i)$ ( $j = 1, 2, \dots, J$ ). At the second phase, within each Z-stratum of cases and controls in sample 1, a subsample is randomly selected, and exact or complete covariate measurements of the subsample are obtained (sample 2). Let $M_{i j}$ 's denote the (random or non-random) sample sizes of the $2 \times J$ validation strata. We observe a random sample ${X_{i j k} : k = 1, 2, \dots, M_{i j}}$ for each pair $(i, j)$ with i = 0, 1 and $j = 1, 2, \dots, J$ . Conditioning on ${M_{i j}}$ and ${N_{i j}}$ , all $X_{i j k}$ are assumed to be independent and $X_{i j 1}, X_{i j 2}, \dots, X_{i j M_{i j}}$ are assumed to be identically distributed with probability density $p r (x | D = i, Z = z_{j})$ .

Given ${M_{i j}}$ and ${N_{i j}}$ , the likelihood based on sample 2 is uninformative with respect to β. Therefore, the likelihood based on the two-phase case-control data is proportional to (2) $\prod_{i = 0}^{1} \prod_{j = 1}^{J} [{p r (z_{j} | D = i)}^{N_{i j}} \prod_{k = 1}^{M_{i j}} p r (x_{i j k} | D = i, z_{j})],$ (2) which is exactly the likelihood in Equation (Equation2(2) $\prod_{i = 0}^{1} \prod_{j = 1}^{J} [{p r (z_{j} | D = i)}^{N_{i j}} \prod_{k = 1}^{M_{i j}} p r (x_{i j k} | D = i, z_{j})],$ (2) ) of Breslow and Holubkov (Citation1997) for discrete covariate variables. This is the foundation of our empirical likelihood method. To proceed, we need to investigate $p r (z_{j} | D = i)$ , and $p r (x_{i j k} | D = i, z_{j})$ in Equation (Equation2(2) $\prod_{i = 0}^{1} \prod_{j = 1}^{J} [{p r (z_{j} | D = i)}^{N_{i j}} \prod_{k = 1}^{M_{i j}} p r (x_{i j k} | D = i, z_{j})],$ (2) ) and study the relationship between them.

Let $π = p r (D = 1)$ . By Bayes's formula, we have $\begin{aligned} p r (x, z | D = 1) & = \frac{1}{π} p r (D = 1 | x, z) p r (x, z), \\ p r (x, z | D = 0) & = \frac{1}{1 - π} p r (D = 0 | x, z) p r (x, z) . \end{aligned}$ It follows that $\begin{aligned} \frac{p r (x, z | D = 1)}{p r (x, z | D = 0)} & = \frac{1 - π}{π} \frac{p r (D = 1 | x, z)}{p r (D = 0 | x, z)} \\ = \frac{1 - π}{π} \exp (α^{*} + x^{⊤} β + γ z), \end{aligned}$ where we have used Equation (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ). This implies that (3) $\begin{aligned} p r (x, z | D = 1) \\ = \frac{1 - π}{π} \exp (α^{*} + x^{⊤} β + γ z) \cdot p r (x, z | D = 0) \\ = \exp (α + x^{⊤} β + γ z) \cdot p r (x | D = 0, z) \\ \times p r (z | D = 0), | d = 1 \end{aligned}$ (3) where $α = α^{*} + \log {(1 - π) / π}$ .

Thus we immediately have (4) $\begin{aligned} p r (z_{j} | D = 1) \\ = \int p r (x, z_{j} | D = 1) d x \\ = \int \exp (α + x^{⊤} β + γ z_{j}) \\ \times p r (x | D = 0, z_{j}) d x \cdot p r (z_{j} | D = 0) \\ = \exp (α + γ z_{j} - η_{j}) \cdot p r (z_{j} | D = 0), | d = 1 \end{aligned}$ (4) where $η_{j} = - \ln {\int \exp (x^{⊤} β) \cdot p r (x | D = 0, z_{j}) d x} .$ Combining Equations (Equation3(3) $\begin{aligned} p r (x, z | D = 1) \\ = \frac{1 - π}{π} \exp (α^{*} + x^{⊤} β + γ z) \cdot p r (x, z | D = 0) \\ = \exp (α + x^{⊤} β + γ z) \cdot p r (x | D = 0, z) \\ \times p r (z | D = 0), | d = 1 \end{aligned}$ (3) ) and (Equation4(4) $\begin{aligned} p r (z_{j} | D = 1) \\ = \int p r (x, z_{j} | D = 1) d x \\ = \int \exp (α + x^{⊤} β + γ z_{j}) \\ \times p r (x | D = 0, z_{j}) d x \cdot p r (z_{j} | D = 0) \\ = \exp (α + γ z_{j} - η_{j}) \cdot p r (z_{j} | D = 0), | d = 1 \end{aligned}$ (4) ) gives (5) $\begin{aligned} | z d & = 1 p r (x | D = 1, z_{j}) = \frac{p r (x, z_{j} | D = 1)}{p r (z_{j} | D = 1)} \\ = \exp (η_{j} + x^{⊤} β) \cdot p r (x | D = 0, z_{j}) . \end{aligned}$ (5) In summary, we have expressed $p r (z_{j} | D = i)$ and $p r (x_{i j k} | D = i, z_{j})$ in Equation (Equation2(2) $\prod_{i = 0}^{1} \prod_{j = 1}^{J} [{p r (z_{j} | D = i)}^{N_{i j}} \prod_{k = 1}^{M_{i j}} p r (x_{i j k} | D = i, z_{j})],$ (2) ) as functions of $p r (x | D = 0, z_{j})$ and some finite dimensional parameters.

2.2. Semi-parametric empirical likelihood

Putting the expressions in Equations (Equation4(4) $\begin{aligned} p r (z_{j} | D = 1) \\ = \int p r (x, z_{j} | D = 1) d x \\ = \int \exp (α + x^{⊤} β + γ z_{j}) \\ \times p r (x | D = 0, z_{j}) d x \cdot p r (z_{j} | D = 0) \\ = \exp (α + γ z_{j} - η_{j}) \cdot p r (z_{j} | D = 0), | d = 1 \end{aligned}$ (4) ) and (Equation5(5) $\begin{aligned} | z d & = 1 p r (x | D = 1, z_{j}) = \frac{p r (x, z_{j} | D = 1)}{p r (z_{j} | D = 1)} \\ = \exp (η_{j} + x^{⊤} β) \cdot p r (x | D = 0, z_{j}) . \end{aligned}$ (5) ) into Equation (Equation2(2) $\prod_{i = 0}^{1} \prod_{j = 1}^{J} [{p r (z_{j} | D = i)}^{N_{i j}} \prod_{k = 1}^{M_{i j}} p r (x_{i j k} | D = i, z_{j})],$ (2) ) and taking logarithm, we have a log-likelihood

$\begin{aligned} \sum_{i = 0}^{1} \sum_{j = 1}^{J} [N_{i j} \log {p r (z_{j} | D = i)} + \sum_{k = 1}^{M_{i j}} \log {p r (x_{i j k} | D = i, z_{j})}] \\ = \sum_{j = 1}^{J} [N_{\cdot j} \log {p r (z_{j} | D = 0)} + N_{1 j} (α + γ z_{j} - η_{j})] \\ + \sum_{j = 1}^{J} [\sum_{i = 0}^{1} \sum_{k = 1}^{M_{i j}} \log {p r (x_{i j k} | D = 0, z_{j})} + \sum_{k = 1}^{M_{1 j}} (η_{j} + x_{1 j k}^{⊤} β)}], \end{aligned}$

where $N_{\cdot j} = N_{0 j} + N_{1 j}$ . As Z takes only finite values, let $q_{j} = p r (z_{j} | D = 0)$ . In the principle of empirical likelihood, we model the conditional distributions of X given $D = 0$ and $Z = z_{j}$ by discrete distributions with support on the observations. Let $p_{i j k} = p r (x_{i j k} | D = 0, Z = z_{j})$ . According to Equations (Equation4(4) $\begin{aligned} p r (z_{j} | D = 1) \\ = \int p r (x, z_{j} | D = 1) d x \\ = \int \exp (α + x^{⊤} β + γ z_{j}) \\ \times p r (x | D = 0, z_{j}) d x \cdot p r (z_{j} | D = 0) \\ = \exp (α + γ z_{j} - η_{j}) \cdot p r (z_{j} | D = 0), | d = 1 \end{aligned}$ (4) ) and (Equation5(5) $\begin{aligned} | z d & = 1 p r (x | D = 1, z_{j}) = \frac{p r (x, z_{j} | D = 1)}{p r (z_{j} | D = 1)} \\ = \exp (η_{j} + x^{⊤} β) \cdot p r (x | D = 0, z_{j}) . \end{aligned}$ (5) ), the feasible $q_{j}$ and $p_{i j k}$ satisfy (6) ${\begin{cases} q_{j} \geq 0, p_{i j k} \geq 0, \\ \sum_{j = 1}^{J} q_{j} = \sum_{j = 1}^{J} q_{j} \exp (α + γ z_{j} - η_{j}) = 1, \\ \sum_{i = 0}^{1} \sum_{k = 1}^{M_{i j}} p_{i j k} = \sum_{i = 0}^{1} \sum_{k = 1}^{M_{i j}} p_{i j k} \exp (η_{j} + x_{i j k}^{⊤} β) = 1. \end{cases}$ (6) With these preparations, the proposed semi-parametric empirical log-likelihood is $\begin{aligned} \tilde{ℓ} & = \sum_{j = 1}^{J} {N_{\cdot j} \log (q_{j}) + N_{1 j} (α + γ z_{j} - η_{j})} \\ + \sum_{j = 1}^{J} {\sum_{i = 0}^{1} \sum_{k = 1}^{M_{i j}} \log (p_{i j k}) + \sum_{k = 1}^{M_{1 j}} (η_{j} + x_{1 j k}^{⊤} β)} . \end{aligned}$ We denote $N_{\cdot \cdot} = \sum_{j = 1}^{J} N_{\cdot j} = n$ and $θ = (β^{⊤}, γ, η^{⊤})^{⊤}$ with $η = (η_{1}, \dots, η_{J})^{⊤}$ . Inferences for $(θ, α)$ are more conveniently made based on their profile likelihood, which is the maximum of $\tilde{ℓ}$ with respect to $q_{j}$ and $p_{i j k}$ under the constraints in Equation (Equation6(6) ${\begin{cases} q_{j} \geq 0, p_{i j k} \geq 0, \\ \sum_{j = 1}^{J} q_{j} = \sum_{j = 1}^{J} q_{j} \exp (α + γ z_{j} - η_{j}) = 1, \\ \sum_{i = 0}^{1} \sum_{k = 1}^{M_{i j}} p_{i j k} = \sum_{i = 0}^{1} \sum_{k = 1}^{M_{i j}} p_{i j k} \exp (η_{j} + x_{i j k}^{⊤} β) = 1. \end{cases}$ (6) ). By the Lagrange multiplier method, the maximum is achieved at (7) ${\begin{cases} q_{j} = \frac{N_{\cdot j}}{N_{\cdot \cdot}} \frac{1}{1 + λ {\exp (α + γ z_{j} - η_{j}) - 1}}, \\ j = 1, 2, \dots, J, \\ p_{i j k} = \frac{1}{M_{\cdot j}} \frac{1}{1 + λ_{j} {\exp (η_{j} + x_{i j k}^{⊤} β) - 1}}, \\ i = 0, 1, k = 1, 2, \dots, M_{i j}, \end{cases}$ (7) where $M_{\cdot j} = M_{0 j} + M_{1 j}$ , $λ = λ (θ, α)$ , and $λ_{j} = λ_{j} (θ, α)$ are the solutions to ${\begin{cases} \sum_{j = 1}^{J} \frac{N_{\cdot j}}{N_{\cdot \cdot}} \frac{\exp (α + γ z_{j} - η_{j}) - 1}{1 + λ {\exp (α + γ z_{j} - η_{j}) - 1}} = 0, \\ \sum_{i = 0}^{1} \sum_{k = 1}^{M_{i j}} \frac{\exp (η_{j} + x_{i j k}^{⊤} β) - 1}{1 + λ_{j} {\exp (η_{j} + x_{i j k}^{⊤} β) - 1}} = 0, \\ j = 1, 2, \dots, J . \end{cases}$ This immediately leads to the profile empirical log-likelihood function of $(θ, α)$ , $\begin{aligned} ℓ (θ, α) \\ = C^{*} - \sum_{j = 1}^{J} N_{\cdot j} \log [1 + λ {\exp (α + γ z_{j} - η_{j}) - 1}] \\ + \sum_{j = 1}^{J} N_{1 j} (α + γ z_{j} - η_{j}) \\ - \sum_{j = 1}^{J} \sum_{i = 0}^{1} \sum_{k = 1}^{M_{i j}} \log [1 + λ_{j} {\exp (η_{j} + x_{i j k}^{⊤} β) - 1}] \\ + \sum_{j = 1}^{J} \sum_{k = 1}^{M_{1 j}} (η_{j} + x_{1 j k}^{⊤} β), \end{aligned}$ where $C^{*} = \sum_{j = 1}^{J} {N_{\cdot j} \log (N_{\cdot j} / N_{\cdot \cdot}) + \sum_{i = 0}^{1} M_{i j} \log (1 / M_{\cdot j})}$ does not depend on $(θ, α)$ .

We propose to estimate $(θ, α)$ by the maximum likelihood estimator (MLE), $(\hat{θ}, \hat{α}) = \arg max_{θ, α} ℓ (θ, α) .$ The parameter α is merely a normalised parameter and generally of no importance. We may further profile α out and make inference for θ through $ℓ (θ) = max_{α} ℓ (θ, α)$ . Here, we abuse the notation $ℓ (\cdot)$ , whose meaning is clear from the context. It is seen that the MLE of θ based on $ℓ (θ)$ is still $\hat{θ}$ .

Lemma 2.1

Our maximum empirical likelihood estimator for $(β, γ)$ is identical to the maximum likelihood estimator of Breslow and Holubkov (Citation1997) when their method takes the stratum variable into account.

Lemma 2.1 indicates that numerically our point estimator is equal to Breslow and Holubkov (Citation1997)'s MLE. Even so, the density ratio model based on empirical likelihood framework for two-phase sampling data is new in the literature, and as far as we know, likelihood-ratio-based inferences under this setting have not been investigated yet in the literature. Breslow and Holubkov (Citation1997)'s statistical inferences (interval estimation and hypothesis testing) for the unknown parameters were based on the asymptotic normality of the MLE.

2.3. Asymptotics

In this section, we establish the limiting distributions of the MLE $\hat{θ}$ and the semi-parametric likelihood-ratio statistic. First of all, we introduce some conditions on the sampling plan and the population under study.

Condition (C1)

There exist positive constants $λ_{0}^{*}, c^{*}, c_{j}^{*}, λ_{j 0}$ ( $j = 1, 2, \dots, J$ ) such that $c \equiv M_{\cdot \cdot} / n = c^{*} + o_{p} (1)$ , $λ_{0} \equiv n_{1} / n = λ_{0}^{*} + o (1)$ , $c_{j} \equiv M_{\cdot j} / n = c_{j}^{*} + o_{p} (1)$ , and $λ_{j 0} \equiv M_{1 j} / M_{\cdot j} = λ_{j 0}^{*} + o_{p} (1)$ as n tends to infinity.

Condition (C1) guarantees that the sizes of the subsamples in all the strata are comparable, and they are also comparable with the case and control samples in the first phase. Otherwise, the strata with negligible sample sizes is simply discarded from our theoretical analysis.

Condition (C2)

The integral $\int \exp (x^{⊤} β) f_{j} (x) d x$ is finite for β in a neighbourhood of $β_{0}$ and a neighbourhood of 0 for all $j = 1, 2, \dots, J$ , where $f_{j} (x) = p r (x | D = 0, Z = z_{j})$ .

Under Condition (C2), the moment generating functions of $p r (x | D = 0, Z = z_{j})$ , namely $\int \exp (x^{⊤} β) \times p r (x | D = 0, Z = z_{j}) d x$ , are well defined in a neighbourhood of the origin. Therefore in the stratum with D = 0 and $Z = z_{j}$ , the covariate X has all finite-order moments. If model (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ) is true, it follows from Equation (Equation5(5) $\begin{aligned} | z d & = 1 p r (x | D = 1, z_{j}) = \frac{p r (x, z_{j} | D = 1)}{p r (z_{j} | D = 1)} \\ = \exp (η_{j} + x^{⊤} β) \cdot p r (x | D = 0, z_{j}) . \end{aligned}$ (5) ) that $\begin{aligned} \int \exp (x^{⊤} β) p r (x | D = 1, Z = z_{j}) d x \\ = \int \exp (η_{j} + x^{⊤} β_{0}) \exp (x^{⊤} β) \\ \times p r (x | D = 0, Z = z_{j}) d x . \end{aligned}$ The finiteness of $\int \exp (x^{⊤} β) p r (x | D = 0, Z = z_{j}) d x$ in a neighbourhood of $β_{0}$ implies that the moment generating functions of $p r (x | D = 1, Z = z_{j})$ are also finite in a neighbourhood of the origin. Consequently, conditioning on D = 0 and $Z = z_{j}$ , the covariate X also has all finite-order moments.

The expression of the asymptotic variance of $\hat{θ}$ is rather complicated. For ease of presentation, some notations are needed. We use $θ_{0} = (β_{0}^{⊤}, γ_{0}, η_{0}^{⊤})^{⊤}$ with $η_{0} = (η_{10}, \dots, η_{J 0})$ to denote the true value of θ under model (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ), and define $a_{j} = \exp (α_{0} + γ_{0} z_{j} - η_{j 0})$ and $A_{j} = 1 + λ_{0} (a_{j} - 1)$ for $1 \leq j \leq J$ . Write $B^{\otimes 2} = B B^{⊤}$ for any vector of matrix B. Let $\begin{aligned} D_{j} & = \frac{1}{λ_{j 0}} \int \frac{1 - \exp (η_{j 0} + x^{⊤} β_{0})}{1 + λ_{j 0} {\exp (η_{j 0} + x^{⊤} β_{0}) - 1}} f_{j} (x) d x, \\ D_{j 0} & = \int \frac{\exp (η_{j 0} + x^{⊤} β_{0})}{1 + λ_{j 0} {\exp (η_{j 0} + x^{⊤} β_{0}) - 1}} f_{j} (x) d x, \\ D_{j 1} & = \int \frac{\exp (η_{j 0} + x^{⊤} β_{0})}{1 + λ_{j 0} {\exp (η_{j 0} + x^{⊤} β_{0}) - 1}} x f_{j} (x) d x, \\ D_{j 2} & = \int \frac{\exp (η_{j 0} + x^{⊤} β_{0})}{1 + λ_{j 0} {\exp (η_{j 0} + x^{⊤} β_{0}) - 1}} x^{\otimes 2} f_{j} (x) d x, \end{aligned}$ and $\begin{aligned} E_{0} & = \sum_{j = 1}^{J} \frac{q_{j} a_{j}}{A_{j}}, E_{1} = \sum_{j = 1}^{J} \frac{q_{j} a_{j}}{A_{j}} z_{j}, \\ E_{2} & = \sum_{j = 1}^{J} \frac{q_{j} a_{j}}{A_{j}} z_{j}^{\otimes 2}, \end{aligned}$ where $q_{j} = p r (z_{j} | D = 0)$ . The following matrix plays an important role in the asymptotic normality of $\hat{θ}$ : $V_{n} = (V_{i j}),$ where $\begin{aligned} V_{1, 1} & = λ_{0} (1 - λ_{0}) (E_{2} - \frac{E_{1}^{\otimes 2}}{E_{0}}), V_{1, 2} = 0, \\ V_{2, 2} & = \sum_{j = 1}^{J} c c_{j} {λ_{j 0} (1 - λ_{j 0}) D_{j 2} + \frac{D_{j 1}^{\otimes 2}}{D_{j}}}, \\ V_{2 + j, 2 + j} & = c c_{j} \frac{D_{j 0}}{D_{j}} + q_{j} \frac{λ_{0} (1 - λ_{0}) a_{j}}{A_{j}} \\ - \frac{λ_{0} (1 - λ_{0})}{E_{0}} \frac{q_{j}^{2} a_{j}^{2}}{A_{j}^{2}}, \\ V_{2 + j, 2 + k} & = - \frac{q_{j} a_{j}}{A_{j}} \frac{q_{k} a_{k}}{A_{k}} \frac{λ_{0} (1 - λ_{0})}{E_{0}}, 1 \leq j \neq k \leq J, \\ V_{1, 2 + j} & = - \frac{q_{j} a_{j}}{A_{j}} λ_{0} (1 - λ_{0}) (z_{j} - \frac{E_{1}}{E_{0}}), \\ V_{2, 2 + j} & = c c_{j} \frac{D_{j 1}}{D_{j}} . \end{aligned}$ It is worth noting that the quantities $λ_{0}, λ_{j 0}, c$ , and $c_{j}$ all depend on n.

Theorem 2.1

Suppose that model (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ) and Conditions (C1) and (C2) all hold. As $n \to \infty$ , if $V_{n}$ converges to a nonsingular matrix $V_{*}$ , then (I) $\sqrt{n} (\hat{θ} - θ_{0})$ converges in distribution to a multivariate normal distribution with mean zero and variance-covariance matrix $V_{*}$ ; (II) the semi-parametric likelihood ratio $2 {l (\hat{θ}) - l (θ_{0})}$ has a limiting $χ_{K}^{2}$ distribution, where K is the dimension of θ.

The proof of Theorem 2.1 also implies that the likelihood ratio statistic for any subvector of θ also follows an asymptotic chi-square distribution with a known degree of freedom. A key and interesting problem in case-control study is to check whether some or all of the covariates X and Z have significant effects on the disease occurrence D, or construct confidence intervals for their coefficients. Under the framework of the proposed semi-parametric empirical likelihood, we propose to construct confidence intervals and hypothesis tests based on the likelihood ratio function calibrated by its limiting chi-square distribution. Theorem 2.1 theoretically guarantees that for θ or any of its subvectors, our likelihood-ratio confidence intervals have asymptotically correct coverage probabilities, and our likelihood ratio tests have asymptotically correct type I errors.

2.4. Goodness-of-fit test

The previous nice properties of the proposed semi-parametric likelihood method are achieved under the assumption of model (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ). Possible misspecifications of this model pose a validation risk on the semi-parametric likelihood-based inference. Therefore, it is necessary for checking the logistic regression model. We achieve this purpose by checking models (Equation4(4) $\begin{aligned} p r (z_{j} | D = 1) \\ = \int p r (x, z_{j} | D = 1) d x \\ = \int \exp (α + x^{⊤} β + γ z_{j}) \\ \times p r (x | D = 0, z_{j}) d x \cdot p r (z_{j} | D = 0) \\ = \exp (α + γ z_{j} - η_{j}) \cdot p r (z_{j} | D = 0), | d = 1 \end{aligned}$ (4) ) and (Equation5(5) $\begin{aligned} | z d & = 1 p r (x | D = 1, z_{j}) = \frac{p r (x, z_{j} | D = 1)}{p r (z_{j} | D = 1)} \\ = \exp (η_{j} + x^{⊤} β) \cdot p r (x | D = 0, z_{j}) . \end{aligned}$ (5) ) simultaneously, since roughly speaking the validation of the assumption of model (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ) is equivalent to that of both models (Equation4(4) $\begin{aligned} p r (z_{j} | D = 1) \\ = \int p r (x, z_{j} | D = 1) d x \\ = \int \exp (α + x^{⊤} β + γ z_{j}) \\ \times p r (x | D = 0, z_{j}) d x \cdot p r (z_{j} | D = 0) \\ = \exp (α + γ z_{j} - η_{j}) \cdot p r (z_{j} | D = 0), | d = 1 \end{aligned}$ (4) ) and (Equation5(5) $\begin{aligned} | z d & = 1 p r (x | D = 1, z_{j}) = \frac{p r (x, z_{j} | D = 1)}{p r (z_{j} | D = 1)} \\ = \exp (η_{j} + x^{⊤} β) \cdot p r (x | D = 0, z_{j}) . \end{aligned}$ (5) ).

Similar to Qin and Zhang (Citation1997), we construct Kolmogrov–Smirnov type tests for models (Equation4(4) $\begin{aligned} p r (z_{j} | D = 1) \\ = \int p r (x, z_{j} | D = 1) d x \\ = \int \exp (α + x^{⊤} β + γ z_{j}) \\ \times p r (x | D = 0, z_{j}) d x \cdot p r (z_{j} | D = 0) \\ = \exp (α + γ z_{j} - η_{j}) \cdot p r (z_{j} | D = 0), | d = 1 \end{aligned}$ (4) ) and (Equation5(5) $\begin{aligned} | z d & = 1 p r (x | D = 1, z_{j}) = \frac{p r (x, z_{j} | D = 1)}{p r (z_{j} | D = 1)} \\ = \exp (η_{j} + x^{⊤} β) \cdot p r (x | D = 0, z_{j}) . \end{aligned}$ (5) ). Let ${\hat{q}}_{j}$ and ${\hat{p}}_{i j k}$ be the fitted probability weights obtained by plugging the MLE $(\hat{θ}, \hat{α})$ in Equation (Equation7(7) ${\begin{cases} q_{j} = \frac{N_{\cdot j}}{N_{\cdot \cdot}} \frac{1}{1 + λ {\exp (α + γ z_{j} - η_{j}) - 1}}, \\ j = 1, 2, \dots, J, \\ p_{i j k} = \frac{1}{M_{\cdot j}} \frac{1}{1 + λ_{j} {\exp (η_{j} + x_{i j k}^{⊤} β) - 1}}, \\ i = 0, 1, k = 1, 2, \dots, M_{i j}, \end{cases}$ (7) ) and ${\tilde{q}}_{i j} = N_{i j} / N_{i \cdot}$ . A measure of the departure from the assumption of model (Equation4(4) $\begin{aligned} p r (z_{j} | D = 1) \\ = \int p r (x, z_{j} | D = 1) d x \\ = \int \exp (α + x^{⊤} β + γ z_{j}) \\ \times p r (x | D = 0, z_{j}) d x \cdot p r (z_{j} | D = 0) \\ = \exp (α + γ z_{j} - η_{j}) \cdot p r (z_{j} | D = 0), | d = 1 \end{aligned}$ (4) ) is $Δ_{1} = max_{j} (| Δ_{10 j} | + | Δ_{11 j} |),$ where (8) $\begin{aligned} Δ_{10 j} & = n^{1 / 2} ({\tilde{q}}_{0 j} - {\hat{q}}_{j}), \\ Δ_{11 j} & = n^{1 / 2} {{\tilde{q}}_{1 j} - {\hat{q}}_{j} \exp (\hat{α} + z_{j}^{⊤} \hat{γ} - {\hat{η}}_{j})} . \end{aligned}$ (8)

Theorem 2.2

Under the same conditions as Theorem 2.1, $Δ_{10 j}$ and $Δ_{11 j}$ have a jointly normal limiting distribution with mean zero.

We consider testing the goodness-of-fit of model (Equation5(5) $\begin{aligned} | z d & = 1 p r (x | D = 1, z_{j}) = \frac{p r (x, z_{j} | D = 1)}{p r (z_{j} | D = 1)} \\ = \exp (η_{j} + x^{⊤} β) \cdot p r (x | D = 0, z_{j}) . \end{aligned}$ (5) ). We are using $F_{i j} (x)$ to denote the distribution function of $X_{i j k}$ given D = i and $Z = z_{j}$ , i = 0, 1 and $j = 1, 2, \dots, J$ . With the fitted values ${\hat{p}}_{i j k}$ , the empirical likelihood (EL) estimators of $F_{i j} (x)$ are $\begin{aligned} {\hat{F}}_{0 j} (x) & = \sum_{i = 0}^{1} \sum_{k = 1}^{M_{i j}} {\hat{p}}_{i j k} I (x_{i j k} \leq x), \\ {\hat{F}}_{1 j} (x) & = \sum_{i = 0}^{1} \sum_{k = 1}^{M_{i j}} {\hat{p}}_{i j k} \exp ({\hat{η}}_{j} + x_{i j k}^{⊤} \hat{β}) I (x_{i j k} \leq x), \end{aligned}$ where the inequality $x_{i j k} \leq x$ holds element-wise. Let ${\tilde{F}}_{i j} (x)$ denote the empirical distribution corresponding to the subsample ${x_{i j 1}, x_{i j 2}, \dots, x_{i j M_{i j}}}$ for fixed i and j. A measure of the departure from the assumption of model (Equation5(5) $\begin{aligned} | z d & = 1 p r (x | D = 1, z_{j}) = \frac{p r (x, z_{j} | D = 1)}{p r (z_{j} | D = 1)} \\ = \exp (η_{j} + x^{⊤} β) \cdot p r (x | D = 0, z_{j}) . \end{aligned}$ (5) ) is $Δ_{2 j} = sup_{x} | Δ_{20 j} (x) | + sup_{x} | Δ_{21 j} (x) |,$ where (9) $\begin{aligned} Δ_{20 j} (x) & = n^{1 / 2} {{\hat{F}}_{0 j} (x) - {\tilde{F}}_{0 j} (x)}, \\ Δ_{21 j} (x) & = n^{1 / 2} {{\hat{F}}_{1 j} (x) - {\tilde{F}}_{1 j} (x)} . \end{aligned}$ (9)

Theorem 2.3

Under the same conditions as Theorem 2.1, $Δ_{20 j} (x)$ and $Δ_{21 j} (x)$ all converge weakly to Brown motions with mean zero.

An overall measure of departures from the assumption of models (Equation4(4) $\begin{aligned} p r (z_{j} | D = 1) \\ = \int p r (x, z_{j} | D = 1) d x \\ = \int \exp (α + x^{⊤} β + γ z_{j}) \\ \times p r (x | D = 0, z_{j}) d x \cdot p r (z_{j} | D = 0) \\ = \exp (α + γ z_{j} - η_{j}) \cdot p r (z_{j} | D = 0), | d = 1 \end{aligned}$ (4) ) and (Equation5(5) $\begin{aligned} | z d & = 1 p r (x | D = 1, z_{j}) = \frac{p r (x, z_{j} | D = 1)}{p r (z_{j} | D = 1)} \\ = \exp (η_{j} + x^{⊤} β) \cdot p r (x | D = 0, z_{j}) . \end{aligned}$ (5) ) is (10) $\begin{aligned} Δ & = max_{j} (| Δ_{10 j} | + | Δ_{11 j} | + sup_{x} | Δ_{20 j} (x) | \\ + sup_{x} | Δ_{21 j} (x) |), \end{aligned}$ (10) where $Δ_{10 j}$ , $Δ_{11 j}$ , $Δ_{20 j}$ , and $Δ_{21 j}$ are all defined in Equations (Equation8(8) $\begin{aligned} Δ_{10 j} & = n^{1 / 2} ({\tilde{q}}_{0 j} - {\hat{q}}_{j}), \\ Δ_{11 j} & = n^{1 / 2} {{\tilde{q}}_{1 j} - {\hat{q}}_{j} \exp (\hat{α} + z_{j}^{⊤} \hat{γ} - {\hat{η}}_{j})} . \end{aligned}$ (8) ) and (Equation9(9) $\begin{aligned} Δ_{20 j} (x) & = n^{1 / 2} {{\hat{F}}_{0 j} (x) - {\tilde{F}}_{0 j} (x)}, \\ Δ_{21 j} (x) & = n^{1 / 2} {{\hat{F}}_{1 j} (x) - {\tilde{F}}_{1 j} (x)} . \end{aligned}$ (9) ). The limiting distribution of Δ is too complicated to be conveniently used in practice. We use a bootstrap procedure (Efron, Citation1979) to approximate it or the p-value.

Before introducing the bootstrap procedure, we briefly review the original two-phase case-control data. The data in the first phase consists of $n_{0}$ controls (D = 0) and $n_{1}$ cases (D = 1). There are $N_{i j}$ data in the strata with D = i and $Z = z_{j}$ , for $i = 0, 1$ and $j = 1, 2, \dots, J$ . Fix i = 0 or 1, the vector $(N_{i 1}, N_{i 2}, \dots, N_{i J})$ follows a multinomial distribution with parameters $n_{i}$ and ${p r (z_{j} | D = i) : j = 1, 2, \dots, J}$ . For each pair $(i, j)$ , the data in the second phase are $M_{i j}$ independent and identically distributed observations $X_{i j 1}, X_{i j 2}, \dots, X_{i j M_{i j}}$ from $p r (x | D = i, Z = z_{j})$ . If $M_{i j}$ are non-random integers, they are pre-specified. Otherwise, they are often determined by sampling proportions, say $r_{i j} \in (0, 1)$ , and then $M_{i j} = N_{i j} r_{i j}$ . We use $q_{i j}$ and $F_{i j} (x)$ to denote $p r (Z = z_{j} | D = i)$ and $p r (x | D = i, Z = z_{j})$ , respectively. And we have shown that $q_{1 j} = q_{0 j} \exp (α_{0} + γ_{0} z_{j} - η_{0})$ under model (Equation4(4) $\begin{aligned} p r (z_{j} | D = 1) \\ = \int p r (x, z_{j} | D = 1) d x \\ = \int \exp (α + x^{⊤} β + γ z_{j}) \\ \times p r (x | D = 0, z_{j}) d x \cdot p r (z_{j} | D = 0) \\ = \exp (α + γ z_{j} - η_{j}) \cdot p r (z_{j} | D = 0), | d = 1 \end{aligned}$ (4) ), and $F_{1 j} (x) = \int_{- \infty}^{x} \exp (η_{0} + x^{⊤} β_{0}) d F_{0 j} (x)$ under model (Equation5(5) $\begin{aligned} | z d & = 1 p r (x | D = 1, z_{j}) = \frac{p r (x, z_{j} | D = 1)}{p r (z_{j} | D = 1)} \\ = \exp (η_{j} + x^{⊤} β) \cdot p r (x | D = 0, z_{j}) . \end{aligned}$ (5) ).

Our bootstrap procedure to approximate the p-value of the goodness-of-fit test for model (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ) based on Δ is as follows.

Based on the original two-phase case-control data, do the following steps:
1. Calculate the EL estimator $(\hat{θ}, \hat{α}) = \arg max_{θ, α} ℓ (θ, α)$
2. Calculate the EL estimators ${\hat{q}}_{0 j} = {\hat{q}}_{j}$ , ${\hat{q}}_{1 j} = {\hat{q}}_{j} \exp (\hat{α} + z_{j}^{⊤} \hat{γ} - {\hat{η}}_{j})}$ , ${\hat{F}}_{0 j} (x)$ , and ${\hat{F}}_{1 j} (x) = \int_{- \infty}^{x} \exp (\hat{η} + t^{⊤} \hat{β}) d {\hat{F}}_{0 j} (t)$ under model (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ).
3. Calculate the test statistic Δ defined in Equation (Equation10(10) $\begin{aligned} Δ & = max_{j} (| Δ_{10 j} | + | Δ_{11 j} | + sup_{x} | Δ_{20 j} (x) | \\ + sup_{x} | Δ_{21 j} (x) |), \end{aligned}$ (10) ).
Generate a bootstrap sample based on the original sample.
1. Generate $n_{0}$ z-values from the discrete distribution defined by $P (Z = z_{j}) = {\hat{q}}_{0 j}$ , and $n_{1}$ z-values from the discrete distribution defined by $P (Z = z_{j}) = {\hat{q}}_{1 j}$ . We suppose that there are $N_{i j}^{*}$ observations with D = i and $Z = z_{j}$ . Then the bootstrap sample in the first phase is $S_{1}^{*} = {N_{i j}^{*}}$ .
2. At the strata with D = i and $Z = z_{j}$ , draw $M_{i j}$ (the non-random case) or $M_{i j} = N_{i j}^{*} r_{i j}$ (the random case) observations from ${\hat{F}}_{i j} (x)$ . The resulting observations constitute a bootstrap sample in the second phase, i.e. $S_{2}^{*} = {(x_{i j 1}^{*}, \dots, x_{i j M_{i j}}^{*}) : i = 0, 1; j = 1, \dots, J}$ .
Calculate the test statistic by replacing the original sample with the bootstrap sample, which consists of $S_{1}^{*}$ and $S_{2}^{*}$ . We denote the resulting test statistic by $Δ^{*}$ .
Repeat steps 2 and 3 B times (e.g. B = 200). We denote the resulting B test statistics by $Δ_{1}^{*}, \dots, Δ_{B}^{*}$ . The bootstrap p-value of the test based on Δ is $(1 / B) \sum_{i = 1}^{B} I (Δ_{i}^{*} \geq Δ)$ .

3. Simulation study

By simulations, we compare our method with Breslow and Holubkov (Citation1997)'s method from two aspects: interval estimation and hypothesis testing. We also investigate the finite-sample performance of the proposed goodness-of-fit test for model (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ).

3.1. Interval estimation and hypothesis testing about θ

Our proposed interval estimators and hypothesis tests about θ are both based on the semi-parametric empirical likelihood-ratio function. In contrast, the counterparts of Breslow and Holubkov (Citation1997) are both based on Wald's method and their maximum likelihood estimators, which are implemented with the R package missreg3 in R version 2.15.3. We use ELR and Wald to denote our and their methods, respectively.

We generate two-phase case-control data from model (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ), with $x = (x_{1}, x_{2}, x_{3})$ following the trivariate standard normal distribution, and $z = [s]$ , where $s \sim U (0, 2)$ and $[s]$ denotes the nearest integer to s. The true parameter value of $α_{*}$ is set to $- 4$ in all cases; and the rest model parameters $(γ, β_{1}, β_{2}, β_{3})$ are to be determined. We set $n_{0} = n_{1} = 200$ , $n_{0} = n_{1} = 100$ , and the number of simulations to 2000.

We first compare the ELR and Wald methods from interval estimation. We construct confidence intervals for γ, $(γ, β_{1})$ , $(γ, β_{1}, β_{2})$ , and $(γ, β_{1}, β_{2}, β_{3})$ at confidence levels 90%, 95%, and 99%. Two groups of parameter values are considered: $(γ, β_{1}, β_{2}, β_{3}) = (0, 0, 1, 2)$ and $(1, 0, 1, 2)$ . The simulated coverage probabilities of the ELR and Wald intervals are tabulated in Tables and . It is seen that both intervals have very accurate coverage probabilities. The ELR interval has slight under-coverage whereas the Wald interval has over-coverage when the total sample size is as small as $n_{0} = n_{1} = 100$ . The under-coverage of the ELR interval is at most 2%; however, the over-coverage of the Wald interval can be as large as 4.5%, in the case of $(γ, β_{1}, β_{2})$ at the 90% confidence level. Both of their coverage accuracies improve as the sample size increases to 200.

Table 1. Coverage probabilities (%) of the Wald and ELR confidence intervals with $n_{0} = n_{1} = 100$ .

Display Table

Table 2. Coverage probabilities (%) of the Wald and ELR confidence intervals with $n_{0} = n_{1} = 200$ .

Display Table

Next, we study the finite-sample performance of the ELR and Wald tests. We consider the hypothesis testing problems in Examples 3.1 and 3.2, respectively, which are concerned with the same four parameter combinations. The effect of the strata variable Z is set to zero in the null hypotheses of Example 3.1 and nonzero in those of Example 3.2, respectively.

Example 3.1

Consider four groups of hypotheses ( $k = 0, 1, \dots, 4$ ):

(A1)	$H_{0} : γ = 0 \leftrightarrow H_{1} : γ = 0.3 \times k$ , where $(β_{1}, β_{2}, β_{3}) = (- 1, 1, 2)$ ;
(A2)	$H_{0} : (γ, β_{1}) = (0, 0) \leftrightarrow H_{1} : (γ, β_{1}) = (0.35, - 0.1) \times k$ , where $(β_{2}, β_{3}) = (1, 2)$ ;
(A3)	$H_{0} : (γ, β_{1}, β_{2}) = (0, 0, 0) \leftrightarrow H_{1} : (γ, β_{1}, β_{2}) = (0.35, - 0.1, - 0.1) \times k$ , where $β_{3} = 2$ ;
(A4)	$H_{0} : (γ, β_{1}, β_{2}, β_{3}) = (0, 0, 0, 0) \leftrightarrow H_{1} : (γ, β_{1}, β_{2}) = (0.15, - 0.1, - 0.1, - 0.1) \times k$ .

Example 3.2

Consider four groups of hypotheses ( $k = 0, 1, \dots, 4$ ):

(B1)	$H_{0} : γ = 1 \leftrightarrow H_{1} : γ = 1 + 0.3 \times k$ , where $(β_{1}, β_{2}, β_{3}) = (- 1, 1, 2)$ ;
(B2)	$H_{0} : (γ, β_{1}) = (1, 1) \leftrightarrow H_{1} : (γ, β_{1}) = (1, 1) + (0.25, - 0.1) \times k$ , where $(β_{2}, β_{3}) = (1, 2)$ ;
(B3)	$H_{0} : (γ, β_{1}, β_{2}) = (1, 1, 1) \leftrightarrow H_{1} : (γ, β_{1}, β_{2}) = (1, 1, 1) + (0.35, - 0.05, - 0.05) \times k$ , where $β_{3} = 2$ ;
(B4)	$H_{0} : (γ, β_{1}, β_{2}, β_{3}) = (1, 1, 1, 1) \leftrightarrow H_{1} : (γ, β_{1}, β_{2}, β_{3}) = (1, 1, 1, 1) + (0.1, - 0.1, - 0.1, - 0.1) \times k$ .

The simulated rejection rates of the ELR and Wald tests in all the cases are displayed in Figures and . The rejection rates are simulated type I errors when k = 0 and are simulated powers when k>0. It is seen that both tests have well controlled type I errors. The ELR test is uniformly more powerful than the Wald test in all cases except case (B4), where the two tests have very close powers. The power gain of the ELR test over the Wald test is as large as 10%, for example, in case (B1).

Figure 1. Rejection probabilities of the ELR (dashed line) and Wald (solid line) tests in Scenarios (A1) and (A2) (First row), and (A3) and (A4) (Second row) of Example 3.1.

Figure 2. Rejection probabilities of the ELR (dashed line) and Wald (solid line) tests in Scenarios (B1) and (B2) (First row), and (B3) and (B4) (Second row) of Example 3.2.

Overall, compared with the Wald method, the ELR interval has comparable accuracy while the ELR tests are often more powerful. It is worth mentioning that a clear advantage of the ELR test over the Wald test is that it does not need a variance estimation. The price is numerical optimisation, which is not an issue in the current era of high-speed computer.

3.2. Goodness-of-fit test for logistic regression model

In this subsection, we investigate the finite-sample performance of the proposed goodness-of-fit test for the logistic regression model (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ); its null distribution is approximated by the bootstrap procedure in Section 2.4. No competitors are taken into account because to the best of our knowledge this problem has never been studied in the context of two-phase case-control data.

We set $n_{0} = n_{1} = 500$ and the number of simulations to 1000. We also suppose that Z takes three different values with $z_{1} = 0.15$ , $z_{2} = 0.2$ , and $z_{3} = 0.3$ . Given D, we generate Z from the distribution function $q_{i j} = p r (Z = z_{j} | D = i)$ such that $q_{01} = \exp (- z_{1} γ)$ , $q_{02} = \exp (- z_{2} γ)$ , $q_{03} = 1 - \exp (- z_{1} γ) - \exp (- z_{2} γ)$ , $q_{11} = \exp (α - η_{1})$ , $q_{12} = \exp (α - η_{2})$ , and $q_{13} = \exp (α - η_{3} + z_{3} γ) {1 - \exp (- z_{1} γ) - \exp (- z_{2} γ)}$ . We consider two types of x: univariate and bivariate, corresponding to Examples 3.3 and 3.4, respectively.

Example 3.3

Suppose that X is univariate, and its conditional distributions given D and Z are $X | (D = 0, Z = z_{1}) \sim N (0, 1)$ , $X | (D = 0, Z = z_{2}) \sim N (1, 1)$ , $X | (D = 0, Z = z_{3}) \sim N (1 / 2, 1)$ , $X | (D = 1, Z = z_{1}) \sim N (β, σ_{1}^{2})$ , $X | (D = 1, Z = z_{2}) \sim N (β + 1, σ_{1}^{2})$ , and $X | (D = 1, X = z_{3}) \sim N (β + 1 / 2, σ_{1}^{2})$ . It is verified that $η_{1} = - 0.5 β^{2}$ , $η_{2} = - 0.5 β^{2} - β$ , and $η_{3} = - 0.5 β^{2} - 0.5 β$ . We set $β = 0.2$ , $γ = 6$ , and $σ_{1}^{2} = 1 + k \times 0.4$ for $k = 0, \dots, 4$ .

When k = 0 or $σ_{1}^{2} = 1$ , it is verified that $p r (x | D = 1, z_{1} = 0.15) / p r (x | D = 0, z_{1} = 0.15) = \exp (η_{1} + x β)$ , $p r (x | D = 1, z_{2} = 0.2) / p r (x | D = 0, z_{2} = 0.2) = \exp (η_{2} + x β)$ , and $p r (x | D = 1, z_{3} = 0.3) / p r (x | D = 0, z_{2} = 0.3) = \exp (η_{3} + x β)$ . In other words, model (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ) is true when $σ_{1}^{2} = 1$ . The rejection proportion of the proposed test is a simulated type I error. When $k \neq 0$ or $σ_{1}^{2} \neq 1$ , the linear logistic model assumption is not true any longer, and the rejection proportions are simulated powers of the proposed test.

Example 3.4

Let $1 = (1, 1)^{⊤}$ , $I_{2} = d i a g (1)$ , and $c = (1 / 2, 1 / 2)^{⊤}$ . Suppose that X is bivariate and given D and Z, its conditional distributions are $X | (D = 0, Z = z_{1}) \sim N_{2} (0 \times 1, I_{2})$ , $X | (D = 0, Z = z_{2}) \sim N_{2} (1, I_{2})$ , $X | (D = 0, Z = z_{3}) \sim N_{2} (0.5 \times 1, I_{2})$ , $X | (D = 1, Z = z_{1}) \sim N_{2} (β, σ_{2}^{2} \times I_{2})$ , $X | (D = 1, Z = z_{2}) \sim N_{2} (β + 1, σ_{2}^{2} \times I_{2})$ , and $X | (D = 1, X = z_{3}) \sim N_{2} (β + 0.5 \times 1, σ_{2}^{2} \times I_{2})$ . It follows that $η_{1} = - 0.5 β^{⊤} β$ , $η_{2} = - 0.5 β^{⊤} \cdot (β + 1)$ , and $η_{3} = - 0.5 β^{⊤} (β + c)$ . We set $β = (- 1.2, 1.2)^{⊤}$ , $γ = 6$ , and $σ_{2}^{2} = 1 + k$ for $k = 0, \dots, 4$ .

Similar to Example 3.3, model (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ) is true in Example 3.4 when k = 0 or $σ_{2}^{2} = 1$ and is violated otherwise. Table presents the simulated rejection rates of the proposed goodness-of-fit test at the 5% significance level for Examples 3.3 and 3.4. Again the results corresponding to k = 0 are type I errors and others are powers. The proposed test has type I errors 6.1% and 5.1%, both of which are well controlled under the 5% significance level. As k increases from 1 to 4, the linear logistic model is violated more and more severely, and the proposed test has increasing powers to nearly 100%. This provides evidence for the consistency and validation of the proposed test for testing goodness-of-fit of model (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ).

Table 3. Rejection probabilities (%) of model checking of Examples 3.3 and 3.4.

Display Table

4. Real-data analysis

In this section, we illustrate the proposed semi-parametric empirical likelihood method by analysing a simulated two-phase data set constructed by the actual National Wilms Tumor Study Group (NWTSG). See D'Angio et al. (Citation1989), Breslow and Chatterjee (Citation1999), and Green et al. (Citation1998) for more description about the data. A problem of interest is to investigate whether there exists an association between treatment outcomes and tumour histology in 4028 children, who were diagnosed with the embryonal cancer of the kidney, known as Wilms tumour. The outcome variable of interest in this study is the relapse, which takes 1 (the patient's condition has deteriorated) and 0 (the patient's condition has improved) values. The covariates of interest include institution histology or IH (0 if favourable, 1 unfavourable), central histology or CH (0 if favourable, 1 unfavourable), stage ((1,0,0) if stage-II, (0,1,0) if stage-III, and (0,0,1) if stage-IV), and age (in months).

Two types of histology measurements are available in the study. First, according to histology, the classification of tumours as favourable and unfavourable is based on the pathologist at the hospital where the children were admitted. As the study data came from many different hospitals, institutional histology may be prone to errors due to the subjective judgments of different pathologists. Thus, the NWTGS re-evaluated histology using a central pathologist recruited for the entire study which was called central histology, the second measurement for histology available in the study. The covariate IH has no prognostic value once the account has been taken into central histology. Even so, we take it as a stratum variable in two-phase case-control sampling. The histologic diagnosis results are tabulated in Table .

Table 4. IH and outcome for Wilms tumour.

Download CSV Display Table

We take the data in Table as if they were prospective with a prevalence of 14.1%. In the first phase, we randomly choose $N_{1}$ cases from relapsed population and $N_{0}$ controls from non-relapsed population, where $N_{1} = 500$ and $N_{0} = 500$ . Then we classified the $N_{1}$ cases and $N_{0}$ controls according to their IH condition. In the second phase, we randomly choose $M_{i j} = N_{i j} / 3$ $(i = 0, 1; j = 1, 2)$ observations in each subpopulation.

Before applying model (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ) and our empirical likelihood method to analyse the data, we use the proposed goodness-of-fit test to check the validation of the linear logistic regression model. The resulting p-value of 0.71 provides no evidence against this model at 5% significance level.

Wilms tumours have no clear cause, but there are some potential factors that affect the risk. Besides stage, age may be an important prognostic factor for Wilms tumour. Figure displays the histograms of age at all the four combinations of outcome and stratum variables. We see that the four condition distributions of age are quite different from each other, which intuitively implies that age is probably associated with the outcome. Formally the ELR (p-value of 0.0014) and Wald (p-value of 0.0020) tests both provide strong evidence for the association of age and outcome at the 5% significance level, although the evidence from the ELR test is stronger. We also test whether stage and IH are associated with the outcome. For stage, the ELR test gives a positive answer (p-value of 0.0498) while the Wald test gives a negative answer (p-value of 0.0534). Both tests give negative answers (their p-values are 0.2082 and 0.1871, respectively) for IH and give positive answers (their p-values are 0 and 0.001, respectively) for CH.

Figure 3. Histograms of Age in different combinations of outcome (Row 1, relapses or cases; Row 2, non-relapses or controls) and the stratum variable (IH; Column 1, unfavourable; Column 2, favourable).

The point and interval estimators of the coefficients of all the covariates are presented in Table . The coefficient of age, 0.1316, is positive, indicating that older people are more likely to relapse than younger people. The coefficient of CH is significantly nonzero while no evidence supports that of IH is nonzero. This result is consistent with that IH has no prognostic value once account has been taken into central histology. The interval estimators confirm the above hypothesis testing results: namely age and CH are important factors to relapse.

Table 5. Results of regression parameter estimates in Wilms tumour study.

Download CSV Display Table

5. Discussion

This paper develops an empirical likelihood approach to two-phase case-control data. We require that the covariate Z takes finite different values because it is regarded as a stratification variable. Two issues about the covariate or strata are worth discussing. First, if Z is continuous, we need to transform it to a discrete variable U taking finite different values. Then we may proceed with U in place of Z, although there may be information loss in doing so. Alternatively, we may take U as a stratification variable, and take the raw variable Z as a subvector of X in model (Equation1(1) $p r (D = 1 | X = x, Z = z) = \frac{\exp (α^{*} + x^{⊤} β + γ z)}{1 + \exp (α^{*} + x^{⊤} β + γ z)},$ (1) ), so that the effect of Z on the disease can still be studied. Second, the performance of the proposed method depends on the approximation accuracy of its large-sample properties and may be undermined if some sample sizes of the $2 J$ strata are too small. A large number of strata may make some strata have very small sizes. To avoid this problem, we recommend stratifying the data such that there are at least 5 observations in each strata and the number of strata in each disease status is small, say 5. Strata with too small sizes should be merged to be large strata so that the asymptotic normality of the proposed estimator and the limiting chisquare distribution of the proposed likelihood ratio have acceptable accuracy.

Acknowledgments

The authors thank the Editor, the Associate Editor, and the two anonymous referees for helpful comments and suggestions that have led to significant improvements in the paper.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The research was supported by the National Natural Science Foundation of China [grant number 11771144], the State Key Program of National Natural Science Foundation of China [grant number 71931004], [grant number 32030063], the development fund for Shanghai talents, and the 111 project (B14019).

Notes on contributors

Zhen Sheng

Zhen Sheng is a PhD candidate in School of Statistics, Faculty of Economic and Management, East China Normal University, China. She received her master degree in Statistics in 2018 from Qufu Normal University, China. Her research interests include case-control study, Genome-wide association study, and experimental design.

Yukun Liu

Yukun Liu is a Professor in School of Statistics, Faculty of Economic and Management, East China Normal University, China. He received his PhD in Statistics in 2009 from Nankai University, China. His research is focused on empirical likelihood and its applications to case-control data, capture-recapture data, selection biased data, and finite mixture models.

Jing Qin

Jing Qin is a Mathematical Statistician in the Biostatistics Research Branch of the National Institute of Allergy and Infectious Diseases (NIAID), USA. He received his PhD in Statistics from the University of Waterloo in 1992. His research interests include empirical likelihood, case-control study, length bias sampling, survival analysis, missing data, causal inference, and survey sampling.

References

Anderson, J. A. (1972). Separate sample logistic discrimination. Biometrika, 59(1), 19–35. https://doi.org/10.1093/biomet/59.1.19
Web of Science ®Google Scholar
Anderson, J. A. (1979). Multivariate logistic compounds. Biometrika, 66, 17–26. https://doi.org/10.2307/2335237
Web of Science ®Google Scholar
Breslow, N. E., & Cain, K. C. (1988). Logistic regression for two-stage case-control data. Biometrika, 75(1), 11–20. https://doi.org/10.1093/biomet/75.1.11
Web of Science ®Google Scholar
Breslow, N. E., & Chatterjee, N. (1999). Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis. Applied Statistics, 48(4), 457–468. https://doi.org/10.1111/1467-9876.00165
Google Scholar
Breslow, N., & Day, N. E. (1980). Statistical methods in cancer research. Volume 1. The analysis of case-control studies. IARC Scientific Publications.
Google Scholar
Breslow, N. E., & Holubkov, R. (1997). Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. Journal of the Royal Statistical Society Series B, 59(2), 447–461. https://doi.org/10.1111/rssb.1997.59.issue-2
Google Scholar
Breslow, N. E., Lumley, T., Ballantyne, C. M., Chambless, L. E., & Kulich, M. (2009). Improved Horvitz-Thompson estimation of model parameters from two-phase stratified samples: Applications in epidemiology. Statistics in Biosciences, 1(1), 32–49. https://doi.org/10.1007/s12561-009-9001-6
Google Scholar
Cai, S., Chen, J., & Zidek, J. V. (2017). Hypothesis testing in the presence of multiple samples under density ratio models. Statistica Sinica, 27, 761–783. https://doi.org/10.5705/ss.2014.168
Web of Science ®Google Scholar
Chen, J., & Liu, Y. (2013). Quantile and quantile-function estimations under density ratio model. Annals of Statistics, 41(3), 1669–1692. https://doi.org/10.1214/13-AOS1129
Web of Science ®Google Scholar
Chen, J., & Qin, J. (1993). Empirical likelihood estimation for finite populations and the effective usage of auxiliary information. Biometrika, 80(1), 107–116. https://doi.org/10.1093/biomet/80.1.107
Web of Science ®Google Scholar
Chen, J., Sitter, R. R., & Wu, C. (2002). Using empirical likelihood methods to obtain range restricted weights in regression estimators for surveys. Biometrika, 89(1), 230–237. https://doi.org/10.1093/biomet/89.1.230
Web of Science ®Google Scholar
D'Angio, G. J., Breslow, N., Beckwith, J. B., Evans, A., Baum, H., Fernbach, D., Hrabovsky, E., Jones, B., & Kelalis, P. (1989). Treatment of Wilms' tumour. Results of the third national Wilms' tumor study. Cancer, 64(2), 349–360. https://doi.org/10.1002/(ISSN)1097-0142
PubMed Web of Science ®Google Scholar
Diao, G., Ning, J., & Qin, J. (2012). Maximum likelihood estimation for semiparametric density ratio model. The International Journal of Biostatistics, 8(1). https://doi.org/10.1515/1557-4679.1372
Google Scholar
DiCiccio, T., Hall, P., & Romano, J. (1991). Empirical likelihood is Bartlett-correctable. The Annals of Statistics, 19(2), 1053–1061. https://doi.org/10.1214/aos/1176348137
Web of Science ®Google Scholar
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1), 1–26. https://doi.org/10.1214/aos/1176344552
Web of Science ®Google Scholar
Farewell, V. (1979). Some results on the estimation of logistic models based on retrospective data. Biometrika, 66(1), 27–32. https://doi.org/10.1093/biomet/66.1.27
Web of Science ®Google Scholar
Flanders, W. D., & Greenland, S. (1991). Analytic methods for two-stage case-control studies and other stratified designs. Statistics in Medicine, 10(5), 739–747. https://doi.org/10.1002/(ISSN)1097-0258
Web of Science ®Google Scholar
Green, D. M., Breslow, N. E., Beckwith, J. B., Finklestein, J. Z., Grundy, P. E., P. R. Thomas, Kim, T., Shochat, S. J., Haase, G. M., Ritchey, M. L., Kelalis, P. P., & D'Angio, G. J. (1998). Comparison between single-dose and divided-dose administration of dactinomycin and doxorubicin for patients with Wilms' tumor: A report from the national Wilms' tumor study group. Journal of Clinical Oncology: Official Journal of the American Society of Clinical Oncology, 16(1), 237–245. https://doi.org/10.1200/JCO.1998.16.1.237
PubMed Web of Science ®Google Scholar
Kitamura, Y. (2006). Empirical likelihood methods in econometrics: theory and practice. Discussion Paper 1569. Cowles Foundation.
Google Scholar
Lawless, J. F., Kalbfleisch, J. D., & Wild, C. J. (1999). Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society Series B, 61(2), 413–438. https://doi.org/10.1111/rssb.1999.61.issue-2
Google Scholar
Liu, Y., & Chen, J. (2010). Adjusted empirical likelihood with high-order precision. The Annals of Statistics, 38(3), 1341–1362. https://doi.org/10.1214/09-AOS750
Web of Science ®Google Scholar
Luo, X., & Tsai, W. Y. (2012). A proportional likelihood ratio model. Biometrika, 99(1), 211–222. https://doi.org/10.1093/biomet/asr060
Web of Science ®Google Scholar
Neyman, J. (1938). Contribution to the theory of sampling human populations. Journal of the American Statistical Association, 33(201), 101–116. https://doi.org/10.1080/01621459.1938.10503378
Google Scholar
Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2), 237–249. https://doi.org/10.1093/biomet/75.2.237
Web of Science ®Google Scholar
Owen, A. B. (2001). Empirical likelihood. Chapman and Hall.
Google Scholar
Prentice, R. L., & Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika, 66(3), 403–411. https://doi.org/10.1093/biomet/66.3.403
Web of Science ®Google Scholar
Qin, J. (1998). Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85(3), 619–630. https://doi.org/10.1093/biomet/85.3.619
Web of Science ®Google Scholar
Qin, J., & Zhang, B. (1997). A goodness-of-fit test for logistic regression models based on case-control data. Biometrika, 84(3), 609–618. https://doi.org/10.1093/biomet/84.3.609
Web of Science ®Google Scholar
Qin, J., & Zhang, B. (2005). Density estimation under a two-sample semiparametric model. Journal of Nonparametric Statistics, 17(6), 665–683. https://doi.org/10.1080/10485250500039346
Web of Science ®Google Scholar
Qin, J., Zhang, H., Li, P., Albanes, D., & Yu, K. (2015). Using covariate-specific disease prevalence information to increase the power of case-control studies. Biometrika, 102(1), 169–180. https://doi.org/10.1093/biomet/asu048
Web of Science ®Google Scholar
Saegusa, T., & Wellner, J. A. (2013). Weighted likelihood estimation under two-phase sampling. Annals of Statistics, 41(1), 269–295. https://doi.org/10.1214/12-AOS1073
PubMed Web of Science ®Google Scholar
Schaid, D. J., Jenkins, G. D., Ingle, J. N., & Weinshilboum, R. M. (2013). Two-phase designs to follow-up genome-wide association signals with DNA resequencing studies. Genetic Epidemiology, 37(3), 229–238. https://doi.org/10.1002/gepi.2013.37.issue-3
Web of Science ®Google Scholar
Schill, W., Jöckel, K. H., Drescher, K., & Timm, J. (1993). Logistic analysis in case-control studies under validation sampling. Biometrika, 80(2), 339–352. https://doi.org/10.1093/biomet/80.2.339
Web of Science ®Google Scholar
Scott, A. J., & Wild, C. J. (1997). Fitting regression models to case-control data by maximum likelihood. Biometrika, 84(1), 57–71. https://doi.org/10.1093/biomet/84.1.57
Web of Science ®Google Scholar
Thomas, D. C., Yang, Z., & Yang, F. (2013). Two-phase and family-based designs for next-generation sequencing studies. Frontiers in Genetics, 4, 276. https://doi.org/10.3389/fgene.2013.00276
Google Scholar
Walker, A. M. (1982). Anamorphic analysis: sampling and estimation for covariate effects when both exposure and disease are known. Biometrics, 38(4), 1025–1032. https://doi.org/10.2307/2529883
Web of Science ®Google Scholar
White, J. E. (1982). A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology, 115(1), 119–128. https://doi.org/10.1093/oxfordjournals.aje.a113266
PubMed Web of Science ®Google Scholar
Wu, C., & Thompson, M. E. (2020). Sampling theory and practice. Springer.
Google Scholar
Zhao, P., & Wu, C. (2019). Some theoretical and practical aspects of empirical likelihood methods for complex surveys. International Statistical Review, 87(1), S239–S256. https://doi.org/10.1111/insr.v87.S1
Google Scholar
Zhou, H., Song, R., Wu, Y., & Qin, J. (2011). Statistical inference for a two-stage outcome-dependent sampling design with a continuous outcome. Biometrics, 67(1), 194–202. https://doi.org/10.1111/j.1541-0420.2010.01446.x
PubMed Web of Science ®Google Scholar

Empirical likelihood inference and goodness-of-fit test for logistic regression model under two-phase case-control sampling

ABSTRACT

1. Introduction