Search in:

Statistical Theory and Related Fields Volume 3, 2019 - Issue 2

Submit an article Journal homepage

Free access

451

Views

CrossRef citations to date

Altmetric

Listen

Articles

Multivariate small area estimation under nonignorable nonresponseFootnote^*
* The opinions expressed in this paper are of the authors and do not necessarily represent the policies of the U.S. Bureau of Labor Statistics and the Israel Central Bureau of Statistics.

Danny PfeffermannNational Statistician and CBS Director, Jerusalem, Israel;Department of Statistics, Hebrew University, Jerusalem, Israel;Southampton Statistical Sciences Research Institute (S3RI), University of Southampton, Southampton, UKView further author information

Michael SverchkovBureau of Labor Statistics, Washington, DC, USACorrespondence[email protected]
View further author information

Pages 213-223 | Received 01 Jan 2019, Accepted 02 Oct 2019, Published online: 22 Oct 2019

Cite this article
https://doi.org/10.1080/24754269.2019.1676683
CrossMark

In this article

ABSTRACT
1. Introduction, models and assumptions
2. Estimation of response model parameters
3. Imputation of the missing data.
4. Estimation of prediction MSE
5. Simulation study
6. Summary
Disclosure statement
Additional information
Footnotes
References
Appendixes

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

We consider multivariate small area estimation under nonignorable, not missing at random (NMAR) nonresponse. We assume a response model that accounts for the different patterns of the observed outcomes, (which values are observed and which ones are missing), and estimate the response probabilities by application of the Missing Information Principle (MIP). By this principle, we first derive the likelihood score equations for the case where the missing outcomes are actually observed, and then integrate out the unobserved outcomes from the score equations with respect to the distribution holding for the missing data. The latter distribution is defined by the distribution fitted to the observed data for the respondents and the response model. The integrated score equations are then solved with respect to the unknown parameters indexing the response model. Once the response probabilities have been estimated, we impute the missing outcomes from their appropriate distribution, yielding a complete data set with no missing values, which is used for predicting the target area means. A parametric bootstrap procedure is developed for assessing the mean squared errors (MSE) of the resulting predictors. We illustrate the approach by a small simulation study.

KEYWORDS:

Distribution of missing data
imputation under nonignorable nonresponse
missing information principle
MSE estimation
NMAR nonresponse

1. Introduction, models and assumptions

Let ${y_{i j}, x_{i j}; i = 1, \dots, M, j = 1, \dots, N_{i}}$ represent the data in a finite population of $N$ units, belonging to $M$ areas, with $N_{i}$ units in area $i$ , $\sum_{i = 1}^{M} N_{i} = N$ , where $y_{i j} = (y_{i j, 1}^{}, \dots, y_{i j, K}^{})^{'}$ is the vector of outcome values for unit $j$ in area $i$ and $x_{i j} = (x_{i j, 1}^{}, \dots, x_{i j, L}^{})^{'}$ is a vector of corresponding $L$ covariates. Note that the use of a single vector $x_{i j}$ for the covariates accommodates situations where in fact different covariates, possibly of different dimension, apply to different observations. We assume that the covariates are known for every unit in the population, from a recent census or some administrative files. Suppose that the outcome values follow the generic two-level population model: (1) $\begin{aligned} y_{i j} | x_{i j}, u_{i}^{U} \overset{i n d}{\sim} f (y_{i j} | x_{i j}, u_{i}^{U}), i = 1, \dots, M, j = 1, \dots, N_{i} \\ u_{i}^{U} \overset{i n d}{\sim} f (u_{i}^{U}); E (u_{i}^{U}) = 0 = (0, \dots, 0)^{'}, V (u_{i}^{U}) = Σ^{U}, \end{aligned}$ (1) where $u_{i}^{U} = (u_{i, 1}^{U}, \dots, u_{i, K}^{U})^{'}$ is a K-dimensional latent random effect.

In the present article we assume that a noninformative sample has been drawn from the above population, but the observed data is incomplete because of not missing at random (NMAR) nonresponse. By noninformative sampling we mean that the sampling probabilities are not related to the outcome variable of interest after conditioning on the model covariates, such that the conditional distribution of the outcome variable in the sample, given the covariates, is the same as the corresponding distribution in the population from which the sample is taken.

In practice, the observed data in a sample are almost never complete due to non-response. The extent of the non-response may differ from unit to unit within an area, with some units providing all the requested information, while others only providing part of it, with different units answering different questions. And to make matters worse, the non-response is NMAR, that is, the probability of some target component of a unit being missing may depend, at least in part, on the missing target value, as well as the other target values for that unit, whether observed or missing. See e.g., Equation (10) for a simple example. As a consequence, approaches that ignore the non-response and just use the complete responses or those that model the non-response only as functions of the observed covariates may yield biased small area predictors. See the simulation study in Section 5.

As a practical example, consider the Household Expenditure Survey (HES) carried out by Israel's Central Bureau of Statistics. The survey collects information on socio-demographic characteristics, as well as information on income and expenditure. The sample consists of households selected with equal probabilities by a two-stage sampling design. Three important questions asked in this survey (and in other similar surveys across the world) relate to the salary in each of the three months preceding the month of the interview. Table presents the distribution of the observed response patterns of the three variable in the 2017 survey, with “1” defining response and “0” nonresponse. The first position to the left defines the response regarding the salary in the month preceding the interview, the middle position defines the response regarding the salary 2 months before the interview, and the third position defines the response regarding the salary 3 months ago.

Table 1. Response patterns on 3 salary variables in Israel's HES. 2017.

Download CSV Display Table

Pfeffermann and Sikov (Citation2011) found that the response to salary questions is informative but they did not consider SAE and restricted to a single target variable. For further discussion and illustrations of NMAR nonresponse and related concepts, see, Rubin (Citation1976), Little (Citation1982), Little and Rubin (Citation2002), Pfeffermann and Sikov (Citation2011), and references therein.

Returning to the present article, the target is to impute the missing data and use the observed and missing data for estimating the small area means, or other summary measures of interest. It may come as a surprise, but we are not familiar with published articles considering small area estimation under NMAR nonresponse, except for Sverchkov and Pfeffermann (Citation2018), which treats the case of univariate outcomes. The present paper extends the methodology developed in that article. See Pfeffermann and Sikov (Citation2011) and Riddles, Kim, and Im (Citation2016) for reviews and many references addressing the problem of NMAR nonresponse when fitting models to survey data, but with no attention to SAE applications.

Define the response indicator $R_{i j, k} = 1 (0)$ if $y_{i j, k}$ is observed (unobserved), and let $R_{i j} = (R_{i j, 1}, \dots, R_{i j, K})^{'}$ .

Assumption 1.1:

(1a) The response occurs independently between the units,

(1b) $Pr [R_{i j} = r | (y_{i^{*} j^{*}}, x_{i^{*} j^{*}}, u_{i^{*}}^{U}), i^{*} = 1, \dots . M, j^{*} = 1, \dots, N_{i}] = Pr [R_{i j} = r | y_{i j}, x_{i j}]$ .

As noted in Sverchkov and Pfeffermann (Citation2018), Assumption 1b is very reasonable. In particular, it states that the probability to respond to the target variable $y_{i j}$ does not depend on the corresponding random effect given $y_{i j}$ , $Pr [R_{i j} = r | y_{i j}, u_{i}^{U}, x_{i j}] = Pr [R_{i j} = r | y_{i j}, x_{i j}]$ . Furthermore, it guarantees the identification of the response model. See Remark 2.2 in Section 2 for further discussion.

Note that under (1) and Assumption 1.1, (2) $\begin{aligned} f [y_{i j} | x_{i j}, u_{i}^{U}, R_{i j}, {(y_{i^{*} j^{*}}, x_{i^{*} j^{*}}, R_{i^{*} j^{*}}, u_{i^{*}}^{U}), i^{*} \\ = 1 \dots M, j^{*} = 1 \dots N_{i}; (i^{*}, j^{*}) \neq (i, j)}] \\ = f (y_{i j} | x_{i j}, u_{i}^{U}, R_{i j}) . \end{aligned}$ (2) We assume a parametric form for the completely observed outcomes, (3) $\begin{aligned} y_{i j} | x_{i j}, u_{i}, R_{i j} = 1 = (1, \dots, 1)^{'} \sim f_{R} (y_{i j} | x_{i j}, u_{i}; θ_{1}) \\ = f (y_{i j} | x_{i j}, u_{i}, R_{i j} = 1; θ_{1}); \\ u_{i} = u_{i}^{U} - E (u_{i}^{U} | R_{i j} = 1) \sim f_{R} (u_{i}; θ_{2}) \\ = f (u_{i} | R_{i j} = 1; θ_{2}), E_{R} (u_{i}; θ_{2}) = 0. \end{aligned}$ (3) Note that in general, $u_{i}^{U}$ and $u_{i}^{}$ are different if the nonresponse is NMAR.

Assumption 1.2:

The subset ${(i, j) : R_{i j} = 1}$ is not empty for every sampled area, such that the parameters $θ = (θ_{1}, θ_{2})$ can be estimated by restricting to the fully observed data (units with no missing data), using classical small area estimation (SAE) procedures.

Remark 1.1:

Assumption 1.2 is for convenience and it is sufficient for our present approach to have fully observed data in only sufficient number of areas to allow efficient estimation of the parameters $θ = (θ_{1}, θ_{2})$ . Additionally, for a general response model under which the response to any given component of the multivariate target variable $y$ may depend on the component itself as well as the other components, with possibly different coefficients for each component, (see for example Equation (10) in Section 4), we also require sufficient number of observations for each response pattern $R_{i j}$ , thus allowing efficient estimation of the response model for each component.

Denote by $\hat{θ} = ({\hat{θ}}_{1}, {\hat{θ}}_{2})$ the estimate of $θ$ obtained that way. For known $θ$ , the best predictor of the random effect $u_{i}$ given the completely observed data, $O_{C} = {(y_{i j} : R_{i j} = 1), x_{i j}, i = 1, \dots, M, j = 1, \dots, N_{i}}$ , is $E (u_{i}^{} | O_{C}; θ)$ . We predict, ${\hat{u}}_{i}^{} = E (u_{i}^{} | O_{C}; θ = \hat{θ})$ .

Our proposed procedure to deal with the multivariate informative (NMAR) nonresponse consists of the following steps:

1. Fit a parametric model for the completely observed outcomes, (Equation (3)).

2. Fit an appropriate parametric model for the response probabilities, which may depend on the outcome and the covariates (Assumption 1b), indexed by the unknown vector parameter $γ$ ; $p_{r} (y_{i j}, x_{i j}; γ) = Pr [R_{i j} = r | y_{i j}, x_{i j}; γ]$ , with $p_{r} (y_{i j}, x_{i j}; γ)$ differentiable with respect to $γ$ . See Section 2 for details.

3. Impute the missing outcomes from their appropriate distribution with the unknown parameters $(θ_{1}, θ_{2}, γ)$ replaced by their sample estimates, and then use the ‘complete’ sample data (observed and imputed values), to predict the small area means or other area measures of interest. See Section 3 for the imputation equations under the model. Since we assume noninformative sampling such that if there was no nonresponse, the sample data would follow the same model as in the population, in what follows we do not distinguish between the population and sample data and consider the population data as our sample. The results of the present study can easily be generalised to the case where first a sample is selected from the finite population by some non-informative or informative sampling scheme, and then nonresponse occurs. In this case one can use the estimated distribution (3) and the estimated response model for imputation of the missing sample data as defined in the present article. Once the missing sample data are imputed, the small area means of interest can be estimated using the approach of Pfeffermann and Sverchkov (Citation2007).

In the next section, we apply the MIP principle for estimating the response model parameters and discuss some related questions. In Section 3, we develop the imputation equations for the missing data, which, when combined with the observed data, permit simple estimation of the small area means or other area parameters of interest. In Section 4, we propose a parametric bootstrap procedure for estimating the prediction Root MSE of the resulting predictors. We illustrate our approach with a small simulation study in Section 5 and conclude with a summary of the main outcomes in Section 6.

2. Estimation of response model parameters

If the missing outcome values were actually observed, the vector parameter $γ$ , indexing the response probabilities model, could be estimated by solving the likelihood equations: (4) $\sum_{r = (0, \dots, 0)^{'}}^{(1, \dots, 1)^{'}} \sum_{(i, j) : R_{i j} = r} \frac{\partial \log p_{r} (y_{i j}, x_{i j}; γ)}{\partial γ} = 0,$ (4) where the external summation is over all the K-dimension vectors with 0,1 elements.

In practice, the missing data are unobserved for $R_{i j} \neq 1$ and hence the likelihood equations (4) are not operational. However, one may apply in this case the missing information principle (MIP; Cepillini, Siniscialco, & Smith, Citation1955; Orchard & Woodbury, Citation1972). See, in particular, Sverchkov (Citation2008), Sverchkov and Pfeffermann (Citation2018), and Riddles et al. (Citation2016) for recent applications of the principle to handle univariate NMAR nonresponse.

Missing Information Principle: Let $O = {(y_{i j . k} : R_{i j, k} = 1), x_{i j}, i = 1, \dots, M, j = 1, \dots, N_{i}}$ denote all the observed data. Since no observations are available for elements $(i j, k) : R_{i j, k} = 0$ , solve instead the best predictor of (4) given the observed data: (5) $\begin{aligned} E (\sum_{r = (0, \dots, 0)^{'}}^{(1, \dots, 1)^{'}} \sum_{(i, j) : R_{i j} = r} \frac{\partial \log p_{r} (y_{i j}, x_{i j}; γ)}{\partial γ}| O) \\ = E [\sum_{r = (0, \dots, 0)^{'}}^{(1, \dots, 1)^{'}} \sum_{(i, j) : R_{i j} = r} \\ \times E (\frac{\partial \log p_{r} (y_{i j}, x_{i j}; γ)}{\partial γ}| O, u_{i}, R_{i j} = r)| O] = 0. \end{aligned}$ (5) The expectation $E ((\partial \log p_{r} (y_{i j}, x_{i j}; γ) / \partial γ) | O, u_{i}, R_{i j} = r)$ can be approximated and solved as follows: Let $α$ denote the set of indexes with observed values $y_{i j, k}$ and $β$ denote the complement of $α$ , i.e., $y_{i j, α} = {y_{i j, k}; r_{k} = 1}$ , $y_{i j, β} = {y_{i j, k}; r_{k} = 0}$ . Denote, $R_{i j, α} = (R_{i j, k} : k \in α)$ , $R_{i j, β} = (R_{i j, k} : k \in β)$ and define by $1_{β}, 1_{α}$ the corresponding unit vectors of respective dimensions. By Assumption (1b), (6) $\begin{aligned} E (\frac{\partial \log p_{r} (y_{i j}, x_{i j}; γ)}{\partial γ}| O, u_{i}, R_{i j} = r) \\ = \int \frac{\partial \log p_{r} (y_{i j}, x_{i j}; γ)}{\partial γ} \\ \times f (y_{i j, β} | y_{i j, α}, x_{i j}, u_{i}, R_{i j} = r) d y_{i j, β} \\ = \int \frac{\partial \log p_{r} (y_{i j}, x_{i j}; γ)}{\partial γ} \\ \times \frac{\begin{matrix} {{[Pr (R_{i j, β} = 1_{β} | x_{i j}, u_{i}, R_{i j, α} = 1_{α}, y_{i j})]}^{- 1} - 1} \\ f (y_{i j, β} | y_{i j, α}, x_{i j}, u_{i}, R_{i j} = 1) d y_{i j, β} \end{matrix}}{\begin{matrix} \int {[Pr (R_{i j, β} = 1_{β} | x_{i j}, u_{i}, R_{i j, α} = 1_{α}, y_{i j})]}^{- 1} \\ f (y_{i j, β} | y_{i j, α}, x_{i j}, u_{i}, R_{i j} = 1) d y_{i j, β} - 1 \end{matrix}}; \\ Pr (R_{i j, β} = 1_{β} | x_{i j}, u_{i}, R_{i j, α} = 1_{α}, y_{i j}) \\ = \frac{p_{r} (y_{i j}, x_{i j}; γ)}{\int p_{r} (y_{i j}, x_{i j}; γ) f (y_{i j, β} | y_{i j, α}, x_{i j}, u_{i}, R_{i j} = 1) d y_{i j, β}} . \end{aligned}$ (6) Finally, solve (5) with respect to $γ$ by substituting $f (y_{i j, β} | y_{i j, α}, x_{i j}, u_{i}, R_{i j} = 1; {\hat{θ}}_{1}) = (f_{R} (y_{i j} | x_{i j}, u_{i}; {\hat{θ}}_{1}) / \int f_{R} (y_{i j} | x_{i j}, u_{i}; {\hat{θ}}_{1}) d y_{i j, β})$ for $f (y_{i j, β} | y_{i j, α}, x_{i j}, u_{i}, R_{i j} = 1)$ , replacing $u_{i}$ by ${\hat{u}}_{i}$ and dropping the external expectation. See Sverchkov and Pfeffermann (Citation2018) for a similar approximation in the univariate case.

The last equality (product) in (6) extends to the multivariate case the following fundamental relationship between the sample and sample-complement distributions, derived in Sverchkov and Pfeffermann (Citation2004) for the univariate case: (7) $\begin{aligned} f (y_{i j} | x_{i j}, u_{i}, R_{i j} = 0) \\ = \frac{[p_{r}^{- 1} (y_{i j}, x_{i j}) - 1] f (y_{i j} | x_{i j}, u_{i}, R_{i j} = 1)}{E {[p_{r}^{- 1} (y_{i j}, x_{i j}) - 1] | x_{i j}, u_{i}, R_{i j} = 1}} . \end{aligned}$ (7) Equation (7) and its multivariate extension in Equation (6) form the basis for our proposed approach. It states that the distribution of an unobserved (missing) value $y_{i j}$ is defined mathematically by the distribution of $y_{i j}$ if it was observed, and the response model. Notice that under NMAR nonresponse, the distribution of $y_{i j}$ given that the unit responded is different from the distribution of $y_{i j}$ given that the unit did not respond, and also different from the population distribution of $y_{i j}$ , before nonresponse takes place. The proof of the multivariate extension applied in (6) follows the same simple steps of the proof of (7) in Sverchkov and Pfeffermann (Citation2004), utilising Bayes theorem. See also Sverchkov (Citation2008) and Riddles et al. (Citation2016).

In the Appendix, we illustrate the construction of Equation (6) under the mixed logistic model for the outcome variable.

Remark 2.1:

The dimension of the set of equations in (5) is equal to the dimension of $γ$ indexing the response model and hence it is impossible to estimate the parameters $γ$ and the parameters $θ = (θ_{1}, θ_{2})$ of the outcome model defined by (3), by solely solving this set.

Remark 2.2:

A fundamental question regarding the use of the MIP equations (5) is the existence of a unique solution, or more generally, the identifiability of the response model. For the univariate case, Riddles et al. (Citation2016) deal with NMAR nonresponse in the general context of sample surveys by following an approach proposed by Sverchkov (Citation2008), which is similar to our present approach. Riddles et al. (Citation2016) established the following fundamental condition for the response model identifiability: the covariates $x$ can be decomposed as $x = (x_{1}, x_{2})$ , with $d i m (x_{2}) \geq 1$ , such that $Pr (R_{i j} = 1 | y_{i j}, x_{i j}) = Pr (R_{i j} = 1 | y_{i j}, x_{1 i j})$ . In other words, the covariates in $x_{2}$ that appear in the outcome model do not affect the response probabilities, given the outcome and the other covariates. Covariates of this property may or may not exist in a general set up, but interesting enough, SAE models actually contain such a variable, namely, the random effects. The random effects play a fundamental role in SAE models so the outcome clearly depends on them, but it is reasonable to assume that the response probabilities do not depend on the random effects, given the outcome value, (which depends on the random effects). In practice, the random effects are unobservable but we estimate them and then solve the equations (5) by conditioning on the estimated effects. So, it is actually the estimated random effects that play the role of the covariates $x_{2}$ . In practice, other covariates that are predictive of the outcome but not of the response might exist as well.

3. Imputation of the missing data.

Once the parameters $θ$ and $γ$ are estimated, the estimates can be substituted (together with ${\hat{u}}_{i}$ ) into the model holding for the missing data, using the relationship used in (6), yielding the following estimated distribution. Let $y_{i j, β} = {y_{i j, k}; r_{k} = 0}$ define, as before, the unobserved data. (8) $\begin{aligned} f (y_{i j, β} | y_{i j, α}, x_{i j}, {\hat{u}}_{i}, R_{i j} = r; \hat{γ}, \hat{θ}) \\ = \frac{\begin{matrix} [{(\frac{p_{r} (y_{i j}, x_{i j}; \hat{γ})}{\int p_{r} (y_{i j}, x_{i j}; \hat{γ}) f (y_{i j, β} | y_{i j, α}, x_{i j}, {\hat{u}}_{i}, R_{i j} = 1) d y_{i j, β}})}^{- 1} - 1] \\ \frac{f_{R} (y_{i j} | x_{i j}, {\hat{u}}_{i}; {\hat{θ}}_{1})}{\int f_{R} (y_{i j} | x_{i j}, {\hat{u}}_{i}; {\hat{θ}}_{1}) d y_{i j, β}} \end{matrix}}{\begin{matrix} \int {(\frac{p_{r} (y_{i j}, x_{i j}; \hat{γ})}{\int p_{r} (y_{i j}, x_{i j}; \hat{γ}) f (y_{i j, β} | y_{i j, α}, x_{i j}, {\hat{u}}_{i}, R_{i j} = 1) d y_{i j, β}})}^{- 1} \\ \frac{f_{R} (y_{i j} | x_{i j}, {\hat{u}}_{i}; {\hat{θ}}_{1})}{\int f_{R} (y_{i j} | x_{i j}, {\hat{u}}_{i}; {\hat{θ}}_{1}) d y_{i j, β}} d y_{i j, β} - 1 \end{matrix}} . \end{aligned}$ (8) Note again that the distribution $f_{R} (y_{i j} | x_{i j}, {\hat{u}}_{i}; {\hat{θ}}_{1})$ is of the observed data and can thus be estimated from the data using standard SAE model fitting procedures.

Imputation of the missing data can be carried out by drawing at random from the distribution (8). One may draw a single observation or multiple observations.

Once the missing observations are imputed, prediction of the true population mean of the outcome variable or other measures of interest is carried out by application of standard procedures. See the empirical study in Section 5.

Remark 2.3:

By Assumption 1.1, the response occurs independently between units.

4. Estimation of prediction MSE

As in any other statistical inference problem, one has to assess the error of the resulting predictors. In SAE applications under the frequentist paradigm, it is common to estimate the Root Prediction Mean Squared Error (RPMSE). It is quite obvious that no analytic expression of the RPMSE can be derived, given the complexity of the prediction procedure, and we therefore propose a bootstrap procedure. As before, we assume for convenience no sampling, such that the sample consists of all the population units. See Remark 4.1 below. The proposed bootstrap procedure consists of the following steps:

B0. Impute the missing values as developed in Section 3. Consider the pseudo-population of complete responses as the ‘true’ population and calculate the corresponding true-pseudo area means.
B1. For each unit $(i, j)$ with complete observation $y_{i j}^{c}$ generated in Step B0, draw observed outcomes with probabilities $p_{r} (y_{i j}^{c}, x_{i j}; \hat{γ})$ .
B2. Apply all estimation and imputation procedures described in Sections 2 and 3 to the observed sample obtained in Step B1. Estimate all the area means.
B3. Repeat Steps B1 and B2 independently B times (B large) and compute for each area $i$ the bootstrap RPMSE, (9) $\begin{aligned} R P M S E_{m, k} & = \frac{1}{B} \sum_{b = 1}^{B} {({\hat{\bar{Y}}}_{m, k, b} - {\bar{Y}}_{m, k}^{B 0})}^{2}; \\ m & = 1, \dots, M, b = 1, \dots B, \end{aligned}$ (9) where ${\hat{\bar{Y}}}_{m, k, b}$ is the predictor obtained from bootstrap sample $b$ for the mean of the k-th component of the outcome variable in area m and ${\bar{Y}}_{m, k}^{B 0}$ is the corresponding pseudo mean in area $m$ as obtained in Step B0.

Remark 4.1:

The bootstrap procedure outlined above is partly design-based in the sense that we consider a single pseudo population and the models are used only for estimating the response probabilities and the model holding for the completely observed data. The procedure can easily be extended in two ways. First, we may generate a new pseudo population for each bootstrap sample, thus accounting also for the variability induced by the random generation of the population values. Second, we may extended the procedure to the case where a sample is selected from the population and nonresponse occurs in the sample, by first obtaining complete sample observations as in Step B0 and then generating a pseudo population using the procedure of Sverchkov and Pfeffermann (Citation2004). Thereafter, a sample is drawn from the pseudo population with the same sampling design that was used for drawing the original sample. The other steps follow Steps B1-B3 above (with or without accounting for the generation of the pseudo population, i.e., by generating only one pseudo population or generating a new population each time).

5. Simulation study

In this section we describe the results of a simulation experiment when applying the procedures proposed in Sections 2, 3 and 4 (assuming no sampling and a single pseudo population).

The experiment consists of the following steps:

S1. Generation of population values: generate for each area $i, i = 1, \dots, 300$ and for each unit $j, j = 1, \dots, 50$ binary covariate values $x_{i j}$ with $Pr (x_{i j} = 1) = Pr (x_{i j} = 0) = 0.5$ , random effects $u_{i} = (u_{i, 1}, u_{i, 2})^{'} \sim N (0, I)$ , $i = 1, \dots, 300$ , and corresponding independent outcome values from the mixed logistic model, (9) $\begin{aligned} p_{y_{1}} (x_{i j}, u_{i}) = Pr (y_{i j, 1} = 1 | x_{i j}, u_{i}) \\ = \exp (- .1 - x_{i j} + u_{i, 1}) / [1 + \exp (- .1 - x_{i j} + u_{i, 1})], \\ p_{y_{2}} (x_{i j}, u_{i}) = Pr (y_{i j, 2} = 1 | x_{i j}, u_{i}) \\ = \exp (.9 + u_{i, 2}) / [1 + \exp (.9 + u_{i, 2})] \end{aligned}$ (9)

Remark 5.1:

The random effects are generated independently but they are not assumed to be independent in the estimation process.

S2. Response mechanism: compute response probabilities for unit $j$ in area $i$ as: (10) $\begin{aligned} p_{r} (y_{i j}, x_{i j}, γ) \\ = \{\begin{cases} C (x_{i j}, y_{i j}) \exp (γ_{0} + γ_{1} x_{i j} + γ_{2} y_{i j, 1} + γ_{3} y_{i j, 2}), \\ if r = (1, 1)^{'} \\ C (x_{i j}, y_{i j}) \exp (γ_{4} + γ_{5} x_{i j} + γ_{6} y_{i j, 1} + γ_{7} y_{i j, 2}), \\ if r = (1, 0)^{'} \\ C (x_{i j}, y_{i j}) \exp (γ_{8} + γ_{9} x_{i j} + γ_{10} y_{i j, 1} + γ_{11} y_{i j, 2}), \\ if r = (0, 1)^{'} \\ C (x_{i j}, y_{i j}), \\ if r = (0, 0)^{'} \end{cases}; \\ C (x_{i j}, y_{i j}) = [1 + \exp (γ_{0} + γ_{1} x_{i j} + γ_{2} y_{i j, 1} + γ_{3} y_{i j, 2}) \\ + \exp (γ_{4} + γ_{5} x_{i j} + γ_{6} y_{i j, 1} + γ_{7} y_{i j, 2}) \\ + \exp (γ_{8} + γ_{9} x_{i j} + γ_{10} y_{i j, 1} + γ_{11} y_{i j, 2})]^{- 1}; \end{aligned}$ (10) $γ_{0} = 0, γ_{1} = - .5, γ_{2} = 3, γ_{3} = - 3, γ_{4} = 0, γ_{5} = - .5, γ_{6} = 2, γ_{7} = - 2, γ_{8} = 0, γ_{9} = - .5, γ_{10} = 1, γ_{11} = - 1 .$ Clearly, the nonresponse is NMAR since the response probabilities depend on the outcomes. Notice that the response for $y_{i j, 1}, y_{i j, 2}$ is generated independently between units.

Remark 5.2:

We generated a single (finite) population and hence, a single set of response probabilities.

S3. Generating responses: generate responses from the (single) population generated in S1, with response probabilities defined in S2 (Equation (10)).

S4. Fitting respondents’ model: estimate ${\hat{p}}_{y_{1}}^{} (x_{i j}, u_{i}) = \hat{P} r (y_{i j, 1} = 1 | x_{i j}, {\hat{u}}_{i}, R_{i j} = 1)$ , ${\hat{p}}_{y_{2}}^{} (x_{i j}, u_{i}) = \hat{P} r (y_{i j, 2} = 1 | x_{i j}, {\hat{u}}_{i}, R_{i j} = 1)$ by fitting the mixed logistic model (9), using PROC NLMIX in SAS with default options. Notice that the model (9) is not the true respondents’ model under the response model (10), because of the NMAR nonresponse.

S5. Estimation of response probabilities: assume the parametric response model (10), compute the expectations in (6) under the estimated models ${\hat{p}}_{y_{1}}^{} (x_{i j}, {\hat{u}}_{i})$ , ${\hat{p}}_{y_{2}}^{} (x_{i j}, {\hat{u}}_{i})$ in Step S4 and estimate $γ$ , using the procedure described in Section 2. See Sverchkov and Pfeffermann (Citation2018) for numerical details.

S6. Imputation of missing data: impute the unobserved data from the distribution of the missing data defined in Section 3, which in the present case reduces to: $\begin{aligned} f (y_{i j, β} | y_{i j, α}, x_{i j}, u_{i}, R_{i j} = r) \\ = \frac{\begin{matrix} {{[Pr (R_{i j, β} = 1_{β} | x_{i j}, u_{i}, R_{i j, α} = 1_{α}, y_{i j})]}^{- 1} - 1} \\ f (y_{i j, β} | y_{i j, α}, x_{i j}, u_{i}, R_{i j} = 1) d y_{i j, β} \end{matrix}}{\begin{matrix} \int {[Pr (R_{i j, β} = 1_{β} | x_{i j}, u_{i}, R_{i j, α} = 1_{α}, y_{i j})]}^{- 1} \\ f (y_{i j, β} | y_{i j, α}, x_{i j}, u_{i}, R_{i j} = 1) d y_{i j, β} - 1 \end{matrix}} . \end{aligned}$

Remark 5.3:

We imputed a single value for each missing value but one may impute several values, using a multiple imputation approach.

Repeat Steps S3–S6 independently 500 times.

Predictors considered: compute the following predictors for each area on each simulation. $\begin{aligned} 1. {\hat{\bar{Y}}}_{i, 1}^{i g n} & = N_{i}^{- 1} \{\sum_{j, R_{i j, 1} = 1} y_{i j, 1} + \sum_{k = 1, R_{i k, 1} = 0}^{N_{i}} {\hat{p}}_{y_{1}} (x_{i k}, {\hat{u}}_{i})\}, \\ {\hat{\bar{Y}}}_{i, 2}^{i g n} & = N_{i}^{- 1} \{\sum_{j, R_{i j, 2} = 1} y_{i j, 2} + \sum_{k = 1, R_{i k, 2} = 0}^{N_{i}} {\hat{p}}_{y_{2}} (x_{i k}, {\hat{u}}_{i})\} . \end{aligned}$ The predictors ${\hat{\bar{Y}}}_{i, 1}^{i g n}, {\hat{\bar{Y}}}_{i, 2}^{i g n}$ ignore the response process and ‘assume’ that the population distribution holds also for the observed outcomes.

2. ${\hat{\bar{Y}}}_{i, 1}^{n e w} = N_{i}^{- 1} \sum_{j = 1}^{N_{i}} y_{i j, 1}^{i m p}$ , ${\hat{\bar{Y}}}_{i, 2}^{n e w} = N_{i}^{- 1} \sum_{j = 1}^{N_{i}} y_{i j, 2}^{i m p}$ , where $y_{i j, k}^{i m p} = y_{i j, k}^{}$ if $y_{i j, k}^{}$ is observed, and $y_{i j, k}^{i m p}$ is the imputed value from Step S6 if $y_{i j, k}^{}$ is missing ( $k = 1, 2$ ).

The estimators ${\hat{\bar{Y}}}_{i, 1}^{n e w}, {\hat{\bar{Y}}}_{i, 2}^{n e w}$ are our proposed estimators, accounting for the multivariate NMAR nonresponse.

Statistics considered for assessment of the of predictors’ performance

Denote by ${\bar{Y}}_{i, k, r}$ the true mean of area $i$ on the r-th simulation (for first or second coordinate, k = 1 or 2), and let ${\hat{\bar{Y}}}_{i, k, r}$ represent the first or second predictors defined above, $r = 1, \dots, 500$ . $\begin{aligned} B i a s_{i, k} & = \frac{\sum_{r = 1}^{500} ({\hat{\bar{Y}}}_{i, k, r} - {\bar{Y}}_{i, k, r})}{500}; \\ R P M S E_{i, k} & = \frac{\sum_{r = 1}^{500} {({\hat{\bar{Y}}}_{i, k, r} - {\bar{Y}}_{i, k, r})}^{2}}{500}; \\ R e l B i a s_{i, k} & = \frac{B i a s_{i, k}}{\sqrt{V_{i, k}}}; \\ V_{i, k} & = \frac{\sum_{r = 1}^{500} {({\hat{\bar{Y}}}_{i, k, r} - \frac{1}{500} \sum_{r = 1}^{500} {\hat{\bar{Y}}}_{i, k, r})}^{2}}{500}; \\ R e l R P M S E_{i, k} & = \frac{\sqrt{R P M S E_{i, k}}}{(\frac{1}{500} \sum_{r = 1}^{500} {\bar{Y}}_{i, k, r})} . \end{aligned}$ We calculated for each area the average (over the 500 simulations) of the number of complete responses and ordered the areas by these averages (the smallest mean number of complete responses is 2.3, the largest is 28.1).

S7. Estimation of the Root Prediction MSE (RPMSE): compute bootstrap estimates of RPMSE following the steps B0–B3 in Section 4.

In the following four figures we show the results for $R e l B i a s_{i, k}$ and $R e l R M S E_{i, k}$ , $k = 1, 2$ for each area, with the areas ordered as above, starting with the area with the smallest number of complete responses.

Figures and show how the proposed method reduces very significantly the bias due to NMAR nonresponse. As expected, the bias of both set of predictors decreases as the number of complete responses increases but our proposed predictors are seen to be much less biased.

Figure 1. $Rel B i a s_{i, 1}$ of ${\hat{\bar{Y}}}_{i, 1}^{i g n}$ (‘o’) and ${\hat{\bar{Y}}}_{i, 1}^{n e w}$ (‘+’).

Figure 1. RelBiasi,1 of Y¯ˆi,1ign (‘o’) and Y¯ˆi,1new (‘+’).

Figure 2. $Rel B i a s_{i, 2}$ of ${\hat{\bar{Y}}}_{i, 2}^{i g n}$ (‘o’) and ${\hat{\bar{Y}}}_{i, 2}^{n e w}$ (‘+’).

Figure 2. RelBiasi,2 of Y¯ˆi,2ign (‘o’) and Y¯ˆi,2new (‘+’).

The reduction in RelRMSE by accounting for the NMAR nonresponse in Figure is not big, which is explained by the fact that the bias of the predictors that ignore the nonresponse is not very high in this case. Notice in this respect that the average number of missing values $y_{i j, 1}$ over the 500 simulations is 5531.5, compared to an average number of 6014.6 missing values of $y_{i j, 2}$ . Nonetheless, when averaging the $R e l R M S E_{i, 1}$ over all the areas we find that, $\begin{aligned} Average[R e l R P M S E_{i, 1} ({\hat{\bar{Y}}}_{i, 1}^{i g n})] \\ = \frac{1}{300} \sum_{i = 1}^{300} R e l R P M S E_{i, 1} ({\hat{\bar{Y}}}_{i, 1}^{i g n}) = 0.51, \\ Average[R e l R P M S E_{i, 1} ({\hat{\bar{Y}}}_{i, 1}^{n e w})] = 0.44. \end{aligned}$ When estimating ${\bar{Y}}_{i, 2}$ in Figure , the reduction in the $R e l R P M S E$ by use of the proposed procedure is much more drastic, particularly in the areas with small numbers of complete responses, due to the large bias when ignoring the NMAR nonresponse. $\begin{aligned} Average[R e l R P M S E_{i, 2} ({\hat{\bar{Y}}}_{i, 2}^{i g n})] = 0.93, \\ Average[R e l R P M S E_{i, 2} ({\hat{\bar{Y}}}_{i, 2}^{n e w})] = 0.28. \end{aligned}$ Next, we study the sensitivity of the proposed approach to correct specification of the response model. For this, we repeated the same simulation study, but by computing the response probabilities as: (11) $\begin{aligned} p_{r} (y_{i j}, x_{i j}, γ) \\ = \{\begin{cases} C (x_{i j}, y_{i j}) \exp [γ_{0} + γ_{1} x_{i j} (γ_{2} y_{i j, 1} + γ_{3} y_{i j, 2})], \\ if r = (1, 1)^{'} \\ C (x_{i j}, y_{i j}) \exp [γ_{4} + γ_{5} x_{i j} (γ_{6} y_{i j, 1} + γ_{7} y_{i j, 2})], \\ if r = (1, 0)^{'} \\ C (x_{i j}, y_{i j}) \exp [(γ_{8} + γ_{9} x_{i j} (γ_{10} y_{i j, 1} + γ_{11} y_{i j, 2})], \\ if r = (0, 1)^{'} \\ C (x_{i j}, y_{i j}), \\ if r = (0, 0)^{'} \end{cases}; \end{aligned}$ (11) $\begin{aligned} C (x_{i j}, y_{i j}) & = {1 + \exp [γ_{0} + γ_{1} x_{i j} (γ_{2} y_{i j, 1} + γ_{3} y_{i j, 2})] \\ + \exp [γ_{4} + γ_{5} x_{i j} (γ_{6} y_{i j, 1} + γ_{7} y_{i j, 2})] \\ \exp [γ_{8} + γ_{9} x_{i j} (γ_{10} y_{i j, 1} + γ_{11} y_{i j, 2})]}^{- 1}, \end{aligned}$ with the same coefficients as in (10).

Figure 3. $RelRPMS E_{i, 1}$ of ${\hat{\bar{Y}}}_{i, 1}^{i g n}$ (‘o’) and ${\hat{\bar{Y}}}_{i, 1}^{n e w}$ (‘+’).

Figure 3. RelRPMSEi,1 of Y¯ˆi,1ign (‘o’) and Y¯ˆi,1new (‘+’).

Figure 4. $RelRPMS E_{i, 2}$ of ${\hat{\bar{Y}}}_{i, 2}^{i g n}$ (‘o’) and ${\hat{\bar{Y}}}_{i, 2}^{n e w}$ (‘+’).

Figure 4. RelRPMSEi,2 of Y¯ˆi,2ign (‘o’) and Y¯ˆi,2new (‘+’).

With these response probabilities, the number of complete responses in an area (averaged over the 500 simulations), is in the range [9.3, 18.3].

When estimating the response model parameters in Step S5 of the simulation, we still use the model (10) as the working model, so that the model for the response is misspecified, and so is the model estimated for the missing data. (As mentioned before, the model estimated for the completely observed outcomes is also not correct).

Table compares the true response probabilities (Equation (11)) with the average of the estimated response probabilities over the 500 simulations under the misspecified response model (Equation (10)). Notice that except in a few cases, the averages of the estimated response probabilities under the misspecified model are close to the true response probabilities, already illustrating lack of sensitivity of our proposed approach to correct specification of the response model, although the differences between the true response probabilities and their estimates are occasionally larger for any given sample.

Table 2. True response probabilities, $p_{r, (i, j)}$ , and average of estimated response probabilities, $A {\hat{p}}_{r, (i, j)}$ under misspecified response model, for different response patterns $r_{i j}; (i = 0, 1, j = 0, 1)$ .

Display Table

As seen in Figures and , with the misspecified response model (and the model for the completely observed data), there are no big biases even when ignoring the NMAR nonresponse. Nonetheless, even in this case, when averaging over all the areas, $Average [| R e l b i a s ({\hat{\bar{Y}}}_{i, 1}^{i g n}) |] = 1.09$ , $Average [| R e l b i a s ({\hat{\bar{Y}}}_{i, 1}^{n e w}) |] = 0.50$ , $Average [| R e l b i a s ({\hat{\bar{Y}}}_{i, 2}^{i g n}) |] = 1.17$ , $Average [| R e l b i a s ({\hat{\bar{Y}}}_{i, 2}^{n e w}) |] = 0.47$ .

Figure 5. $Rel B i a s_{i, 1}$ of ${\hat{\bar{Y}}}_{i, 1}^{i g n}$ (‘o’) and ${\hat{\bar{Y}}}_{i, 1}^{n e w}$ (‘+’), response model misspecified.

Figure 5. RelBiasi,1 of Y¯ˆi,1ign (‘o’) and Y¯ˆi,1new (‘+’), response model misspecified.

Figure 6. $Rel B i a s_{i, 2}$ of ${\hat{\bar{Y}}}_{i, 2}^{i g n}$ (‘o’) and ${\hat{\bar{Y}}}_{i, 2}^{n e w}$ (‘+’), response model misspecified.

Figure 6. RelBiasi,2 of Y¯ˆi,2ign (‘o’) and Y¯ˆi,2new (‘+’), response model misspecified.

Next we compare the RelRPMSEs of the two estimators.

Figures and show reduction in the RelRPMSEs when accounting for the NMAR nonresponse in the areas with small number of complete responses. When averaging over all the areas, $\begin{aligned} Average[R e l R P M S E_{i, 1} ({\hat{\bar{Y}}}_{i, 1}^{i g n})] = 0.41, \\ Average[R e l R P M S E_{i, 1} ({\hat{\bar{Y}}}_{i, 1}^{n e w})] = 0.34; \\ Average[R e l R P M S E_{i, 2} ({\hat{\bar{Y}}}_{i, 2}^{i g n})] = 0.17, \\ Average[R e l R P M S E_{i, 2} ({\hat{\bar{Y}}}_{i, 2}^{n e w})] = 0.14. \end{aligned}$ We conclude that even under the misspecified models, our approach generally yields predictors with smaller RelRMSEs than when ignoring the NMAR nonresponse (Figures and ). Clearly, the predictors obtained under this approach have larger variances than when ignoring the NMAR nonresponse, due to all the complex computations involved, so that the large differences in the bias do not always translate into corresponding large differences in the RelRMSEs.

Figure 7. $RelRPMS E_{i, 1}$ of ${\hat{\bar{Y}}}_{i, 1}^{i g n}$ (‘o’) and ${\hat{\bar{Y}}}_{i, 1}^{n e w}$ (‘+’), response model misspecified.

Figure 7. RelRPMSEi,1 of Y¯ˆi,1ign (‘o’) and Y¯ˆi,1new (‘+’), response model misspecified.

Figure 8. $RelRPMS E_{i, 2}$ of ${\hat{\bar{Y}}}_{i, 2}^{i g n}$ (‘o’) and ${\hat{\bar{Y}}}_{i, 2}^{n e w}$ (‘+’), response model misspecified.

Figure 8. RelRPMSEi,2 of Y¯ˆi,2ign (‘o’) and Y¯ˆi,2new(‘+’), response model misspecified.

Finally, we report the results of RelRPMSE estimation. Due to time limitation, the results so far are based on only 100 parent samples and 50 bootstrap samples for each parent sample. Figures and compare the ‘true’ (empirical) RelRPMSEs over the 100 parent samples, with the mean of the corresponding bootstraps estimates.

Figure 9. $RelRPMS E_{i, 1}$ of ${\hat{\bar{Y}}}_{i, 1}^{n e w}$ (‘+’), and bootstrap estimates (‘o’).

Figure 9. RelRPMSEi,1 of Y¯ˆi,1new (‘+’), and bootstrap estimates (‘o’).

Figure 10. $RelRPMS E_{i, 2}$ of ${\hat{\bar{Y}}}_{i, 2}^{n e w}$ (‘+’), and bootstrap estimates (‘o’).

Figure 10. RelRPMSEi,2 of Y¯ˆi,2new (‘+’), and bootstrap estimates (‘o’).

The results in Figures and show for most areas good performance of the bootstrap estimators and we believe that with more parent samples and bootstrap samples, the results will look even better. Even with the current runs, when averaging over all the areas, $\begin{aligned} Average[R e l R P M S E_{i, 1}] = 0.38, \\ Average Bootstrap[R e l R P M S E_{i, 1}] = 0.35, \\ Average[R e l R P M S E_{i, 2}] = 0.41, \\ Average Bootstrap[R e l R P M S E_{i, 2}] = 0.41, \end{aligned}$ illustrating the unbiasedness of the bootstrap estimators when averaging over all the areas.

We compared the empirical RelRPMSE’s with the bootstrap estimates also for the case of the misspecified response model and obtained similar results. To save in space, we don’t show the corresponding figures.

6. Summary

In this paper we propose a general approach for multivariate SAE under NMAR nonresponse within the selected areas. The approach consists of fitting a model for the observed data and using this model for estimating a postulated multivariate response model by application of the missing information principle. Once the response model is estimated, we derive the model holding for the missing data, which is used for imputing the missing data, thus obtaining a complete file of sample data that is used for estimating the unknown small area parameters. A bootstrap procedure is proposed for estimating the root prediction mean squared errors of the small area predictors, which consists of generating a pseudo population with similar behaviour to the behaviour of the true underlying population, and selecting many samples from the pseudo population and many sets of responses for each sample.

A simulation study shows good performance of our approach in terms of point and RPMSE estimation. The simulation study also illustrates certain robustness to misspecification of the response model. The empirical study in this paper considers the case where the models that are fitted for the responding units and the response probabilities are logistic, but the theoretical derivations assume general models for the observed data and the response mechanism. Thus, we encourage researchers of SAE to apply the procedure to simulated and real data sets, with possibly different models assumed for the observed data and the response probabilities.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Notes on contributors

Danny Pfeffermann

Danny Pfeffermann is currently the National Statistician and General Director of the Central Bureau of Statistics of Israel. His main research areas are analytic inference from complex sample surveys and in particular, informative samples with non-ignorable nonresponse, small area estimation, seasonal adjustment and trend estimation and recently, accounting for mode effects and proxy surveys.

Michael Sverchkov

Michael Sverchkov is a Research Mathematical Statistician at the US Bureau of Labor Statistics. His main research areas are analytic inference from complex sample surveys and in particular, informative samples with non-ignorable nonresponse, small area estimation, seasonal adjustment.

Notes

* The opinions expressed in this paper are of the authors and do not necessarily represent the policies of the U.S. Bureau of Labor Statistics and the Israel Central Bureau of Statistics.

References

Cepillini, R., Siniscialco, M., & Smith, C. A. B. (1955). The estimation of gene frequencies in a random mating population. Annals of Human Genetics, 20, 97–115. doi: 10.1111/j.1469-1809.1955.tb01360.x
Web of Science ®Google Scholar
Little, R. J. A. (1982). Models for nonresponse in sample surveys. Journal of the American Statistical Association, 77, 237–250. doi: 10.1080/01621459.1982.10477792
Web of Science ®Google Scholar
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). New York: Wiley.
Google Scholar
Orchard, T., & Woodbury, M. A. (1972). A missing information principle: Theory and application. Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability, 1, 697–715.
Google Scholar
Pfeffermann, D., & Sikov, N. (2011). Imputation and estimation under nonignorable nonresponse in household surveys with missing covariate information. Journal of Official Statistics, 27, 181–209.
Web of Science ®Google Scholar
Pfeffermann, D., & Sverchkov, M. (2007). Small-area estimation under informative probability sampling of areas and within selected areas. Journal of the American Statistical Association, 102, 1427–1439. doi: 10.1198/016214507000001094
Web of Science ®Google Scholar
Riddles, K. M., Kim, J. K., & Im, J. (2016). A propensity-score adjustment method for nonignorable nonresponse. Journal of Survey Statistics and Methodology, 4, 215–245. doi: 10.1093/jssam/smv047
Google Scholar
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–590. doi: 10.1093/biomet/63.3.581
Web of Science ®Google Scholar
Sverchkov, M. (2008). A new approach to estimation of response probabilities when missing data are not missing at random. Joint statistical Meetings. Proceedings of the section on survey research methods (pp. 867–874).
Google Scholar
Sverchkov, M., & Pfeffermann, D. (2004). Prediction of finite population totals based on the sample distribution. Survey Methodology, 30, 79–92.
Google Scholar
Sverchkov, M., & Pfeffermann, D. (2018). Small area estimation under informative sampling and not missing at random nonresponse. Journal of the Royal Statistical Society JRSS-SA, 181, 981–1008. doi: 10.1111/rssa.12362
Web of Science ®Google Scholar

Appendix. Illustration of the use of Equation (6) for Estimation of the response probabilities:

Mixed logistic model for the outcome variables with a single covariate.Consider bivariate variables

y_{i j} = (y_{i j 1}, y_{i j 2})

, and suppose that the model fitted to the observed data of the respondents is the mixed generalised logistic model,

(A1)

\begin{aligned} p_{y_{1}} (x_{i j}, u_{i}) & = Pr (y_{i j, 1} = 1 | x_{i j}, u_{i}, R_{i j} = 1) \\ = \exp (α_{1} + β_{1} x_{i j} + u_{i, 1}) \\ \times [1 + \exp (α_{1} + β_{1} x_{i j} + u_{i, 1})]^{- 1} \\ p_{y_{2}} (x_{i j}, u_{i}) & = Pr (y_{i j, 2} = 1 | x_{i j}, u_{i}, R_{i j} = 1) \\ = \exp (α_{2} + β_{2} x_{i j} + u_{i, 2}) \\ \times [1 + \exp (α_{2} + β_{2} x_{i j} + u_{i, 2})]^{- 1} \\ u_{i} & = (u_{i, 1}, u_{i, 2})^{'} \sim N (0, Σ) \end{aligned}

(A1) Suppose a generic response model,

p_{r} (y_{i j}, x_{i j}; γ)

Pr [R_{i j} = r | y_{i j}, x_{i j}; γ]

We assume that $y_{i j, 1}$ and $y_{i j, 2}$ are independent given $x_{i j}, u_{i}, R_{i j} = 1$ , and that $Pr [R_{i j} = r | y_{i j}, x_{i j}, u_{i}; γ] = Pr [R_{i j} = r | y_{i j}, x_{i j}; γ]$ .

Then, for example, for $r = (0, 1)^{'}$ , the components of (6) can be written as, $\begin{aligned} \int \frac{\partial \log p_{r} (y_{i j}, x_{i j}; γ)}{\partial γ} \\ \times {[Pr (R_{i j, β} = 1_{β} | x_{i j}, u_{i}, R_{i j, α} = 1_{α}, y_{i j})]^{- 1} - 1} \\ \times f (y_{i j, β} | y_{i j, α}, x_{i j}, u_{i}, R_{i j} = 1) d y_{i j, β} \\ = \frac{\partial \log p_{r} [(1, y_{i j, 2})^{'}, x_{i j}; γ]}{\partial γ} \\ \times {[\frac{p_{r} [(1, y_{i j, 2})^{'}, x_{i j}; γ]}{\begin{matrix} p_{r} [(1, y_{i j, 2})^{'}, x_{i j}; γ] p_{y_{1}} (x_{i j}, u_{i}) \\ + p_{r} [(0, y_{i j, 2})^{'}, x_{i j}; γ] [1 - p_{y_{1}} (x_{i j}, u_{i})] \end{matrix}}]^{- 1} - 1} \\ \times p_{y_{1}} (x_{i j}, u_{i}) + \frac{\partial \log p_{r} [(0, y_{i j, 2})^{'}, x_{i j}; γ]}{\partial γ} \\ \times {[\frac{p_{r} [(0, y_{i j, 2})^{'}, x_{i j}; γ]}{\begin{matrix} p_{r} [(1, y_{i j, 2})^{'}, x_{i j}; γ] p_{y_{1}} (x_{i j}, u_{i}) \\ + p_{r} [(0, y_{i j, 2})^{'}, x_{i j}; γ] [1 - p_{y_{1}} (x_{i j}, u_{i})] \end{matrix}}]^{- 1} - 1} \\ \times [1 - p_{y_{1}} (x_{i j}, u_{i})], \\ \int {{[Pr (R_{i j, β} = 1_{β} | x_{i j}, u_{i}, R_{i j, α} = 1_{α}, y_{i j})]}^{- 1} - 1} \\ \times f (y_{i j, β} | y_{i j, α}, x_{i j}, u_{i}, R_{i j} = 1) d y_{i j, β} \\ = {[\frac{p_{r} [(1, y_{i j, 2})^{'}, x_{i j}; γ]}{\begin{matrix} p_{r} [(1, y_{i j, 2})^{'}, x_{i j}; γ] p_{y_{1}} (x_{i j}, u_{i}) \\ + p_{r} [(0, y_{i j, 2})^{'}, x_{i j}; γ] [1 - p_{y_{1}} (x_{i j}, u_{i})] \end{matrix}}]^{- 1} - 1} \\ \times p_{y_{1}} (x_{i j}, u_{i}) \\ + {[\frac{p_{r} [(0, y_{i j, 2})^{'}, x_{i j}; γ]}{\begin{matrix} p_{r} [(1, y_{i j, 2})^{'}, x_{i j}; γ] p_{y_{1}} (x_{i j}, u_{i}) \\ + p_{r} [(0, y_{i j, 2})^{'}, x_{i j}; γ] [1 - p_{y_{1}} (x_{i j}, u_{i})] \end{matrix}}]^{- 1} - 1} \\ \times [1 - p_{y_{1}} (x_{i j}, u_{i})] \end{aligned}$

Similar expressions are obtained for $r = (1, 0)^{'}$ and $r = (0, 0)^{'}$ .

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Multivariate small area estimation under nonignorable nonresponseFootnote^*
* The opinions expressed in this paper are of the authors and do not necessarily represent the policies of the U.S. Bureau of Labor Statistics and the Israel Central Bureau of Statistics.

ABSTRACT

1. Introduction, models and assumptions

Table 1. Response patterns on 3 salary variables in Israel's HES. 2017.

2. Estimation of response model parameters

3. Imputation of the missing data.

4. Estimation of prediction MSE

5. Simulation study

Table 2. True response probabilities, $p_{r, (i, j)}$ , and average of estimated response probabilities, $A {\hat{p}}_{r, (i, j)}$ under misspecified response model, for different response patterns $r_{i j}; (i = 0, 1, j = 0, 1)$ .

6. Summary

Disclosure statement

Notes on contributors

Danny Pfeffermann

Michael Sverchkov

References

Appendix. Illustration of the use of Equation (6) for Estimation of the response probabilities:

Information for

Open access

Opportunities

Help and information

Multivariate small area estimation under nonignorable nonresponseFootnote** The opinions expressed in this paper are of the authors and do not necessarily represent the policies of the U.S. Bureau of Labor Statistics and the Israel Central Bureau of Statistics.

ABSTRACT

1. Introduction, models and assumptions

Table 1. Response patterns on 3 salary variables in Israel's HES. 2017.

2. Estimation of response model parameters

3. Imputation of the missing data.

4. Estimation of prediction MSE

5. Simulation study

Table 2. True response probabilities, pr,(i,j), and average of estimated response probabilities, Apˆr,(i,j)under misspecified response model, for different response patterns rij;(i=0,1,j=0,1).

6. Summary

Disclosure statement

Additional information

Notes on contributors

Danny Pfeffermann

Michael Sverchkov

Notes

References

Appendix. Illustration of the use of Equation (6) for Estimation of the response probabilities:

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Multivariate small area estimation under nonignorable nonresponseFootnote^*
* The opinions expressed in this paper are of the authors and do not necessarily represent the policies of the U.S. Bureau of Labor Statistics and the Israel Central Bureau of Statistics.

Table 2. True response probabilities, $p_{r, (i, j)}$ , and average of estimated response probabilities, $A {\hat{p}}_{r, (i, j)}$ under misspecified response model, for different response patterns $r_{i j}; (i = 0, 1, j = 0, 1)$ .