![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
ABSTRACT
We consider general statistical models defined by moment equations when data are missing at random. Using the inverse probability weighting, such a model is shown to be equivalent with a model for the observed variables only, augmented by a moment condition defined by the missing mechanism. Our framework covers a large class of parametric and semiparametric models where we allow for missing responses, missing covariates and any combination of them. The equivalence result is stated under minimal technical conditions and sheds new light on various aspects of interest in the missing data literature, as for instance the efficiency bounds and the construction of the efficient estimators, the restricted estimators and the imputation.
1. Introduction
Models defined by moment and conditional moment equations are widely used in statistics, biostatistics and econometrics; see, for instance, Ai and Chen(Citation2003, Citation2012), Domínguez and Lobato (Citation2004), and the references therein. Here, we investigate general moment or conditional moment equation models with missing data. The main idea we propose is that under a missing at random assumption, the initial model with missing data is equivalent with a inverse probability weighting moment equations model for the complete observations, augmented by a moment condition defined by the missing mechanism. The equivalence, a generalisation of the GMM equivalence result of Graham (Citation2011), is stated in terms of sets of probability measures. It has numerous implications and provides valuable insight, for instance on the efficiency bound calculations and the construction of efficient estimators.
In the framework of missing data, the assumption of missing at random (MAR) is presumably the most used when trying to describe an ignorable mechanism on the missingness. However, this concept, first introduced by Rubin (Citation1976), does not have the same meaning for everyone. For simplicity, let the full observations be i.i.d. replications of a vector and let
be a random vector such that its component takes the value 1 if we observe the corresponding component of L and 0 otherwise. For Rubin (Citation1976) (see also, for example, Little & Rubin, Citation2002; Robins & Gill, Citation1997), MAR means that missingness depends only on the observed components, denoted by
of L:
(1)
(1) This concept was generalised to CAR, coarsening at random, by Heitjan and Rubin (Citation1991) (see also, for example, van der Laan and Robins (Citation2003)):
is the same as
for an always observable transformation
of the full data L and the censoring variable C. In the context of regression-like models, the MAR assumption is usually stated in a different and more restrictive way. A strongly ignorable selection mechanism (also called conditional independence, or selection on observables, etc.) means that, assuming some components of L are always observed,
(2)
(2) This assumption was originally introduced by Rosenbaum and Rubin (Citation1983) in the framework of randomised clinical trials, which corresponds in our simple example, with
, to the case where, for example, X is always observed, and one and only one of Y and Z is observed. This means that the selection vector R takes the form
, where Y is observed iff D = 1 and Z is observed iff D = 0. In this situation, MAR means
or, equivalently,
(3)
(3) Meanwhile a strongly ignorable missingness mechanism writes
or, equivalently,
(4)
(4) Clearly, condition (Equation4
(4)
(4) ) implies condition (Equation3
(3)
(3) ), but the reverse is not true in general. In the present work we consider the case of i.i.d. replications of a vector containing missing components for which the same subvector is missing for the incomplete replicates. In this case the MAR assumption (Equation1
(1)
(1) ) and the the strongly ignorable MAR assumption (Equation2
(2)
(2) ) coincide (and are equivalent to CAR), as is it is also the case, for example, in Cheng (Citation1994), Tsiatis (Citation2007), Graham (Citation2011), among others.
Other MAR-related assumptions appear in the literature. For instance, when the response Y is missing, while X and Z are observed, Wei, Ma, and Carroll (Citation2012) consider the assumption that is stronger than the MAR assumption (Equation2
(2)
(2) ), commonly used for regression models. Another assumption for the missingness mechanism is introduced in Wooldridge (Citation2007) :
and
is a random variable such that W and Z are observed whenever S = 1, and
Wooldridge's assumption is more general than the MAR condition (Equation2
(2)
(2) ) where Z is supposed to be always observed. Indeed, Wooldridge (Citation2007) does not suppose that W and/or Z are missing if
The paper is organised as follows. The main equivalence result is stated in Section 2. In Section 3, we revisit some examples considered in the literature in the MAR setup: estimating mean functionals in parametric and nonparametric regressions; and quantile regression with missing responses and/or covariates. For these examples, our equivalence result suggests new ways for calculating efficiency bounds and constructing efficient estimators, using for instance the GMM, empirical likelihood approaches, the SMD approach of Ai and Chen (Citation2007), or the kernel-based method of Lavergne and Patilea (Citation2013). In Section 4 we reinterpret some classes of so-called restricted estimators; see, for instance, Tsiatis (Citation2007) and Tan (Citation2011). Finally, in Section 5 we use our general result to discuss on a common belief that the (multiple) imputation is necessary in order to capture all the information from the partially observed data.
2. Equivalent moment model
The following statement is a version of Theorems 1 and 2 in Hristache and Patilea (Citation2017). The proof is very similar and hence will be omitted. In the following, vectors a columns matrices and for any matrix A, denotes its transpose.
Theorem 2.1
Let and
be two models defined for random vectors
as follows:
(5)
(5) and
(6)
(6) where
is an unknown (possibly infinite dimensional) parameter,
, for
, is a collection of known measurable functions, and π is a unknown measurable function such that
almost surely.
The models and
are equivalent if restricted to the laws of
; more precisely,
,
such that
and
have the same distribution.
Remarks
The parameter γ in model
could include parameters of interest and parameters of nuisance.
The function
usually called the propensity score, could be considered completely unknown and modelled nonparametrically, or modelled using a parametric model. With at hand an estimate of
obtained from the second equation in the model
, one could use existing moment equation approaches for the estimation of the parameters in the first equation of
. See our Example 3.1.
The link of this theorem with models where data are missing at random is made if we consider that the vector U is observed if and only if D = 1. The theorem then basically says that at the observational level, which means for the laws of the observed vector
, the two models
and
are equivalent. As a consequence, inference for the law of
in the model
, a moment conditions model under an assumption of data missing at random, could be done based on the model
, which is defined using only the observed part
of the vector vector
. In particular, efficiency bound calculations and efficient estimator constructions could be done in the model
, which in many cases could be much easier.
The underlying condition ‘DU is always observed’ includes the usual case
but it is more general. When D = 1 one observes the value of U. Meanwhile, one should read that when
U could be observed or not since whatever the value of U is, DU = 0.
3. Some examples revisited
In this section we present two examples of models already studied in the literature for which our approach gives new insights and sometimes allows for simpler methods for obtaining efficiency bounds and asymptotically efficient estimators. The guiding principle is to use Theorem 2.1 and put the model of interest, in the presence of a MAR mechanism, under an equivalent form
(7)
(7) where the two sets of equations are orthogonal, meaning that
The equivalent model (Equation7
(7)
(7) ) has a sequential moment structure that allows to compute the efficiency bound; see Ai and Chen (Citation2012). Moreover, the finite dimensional interest parameter θ can be efficiently estimated from the first equations, with the (possibly infinite dimensional) nuisance parameter α known or suitably estimated from the last equations. A similar statement on the efficient estimation of θ, in the particular case of a finite dimensional α and without conditioning on X and X, Z, can be found in Theorem 2.2, point 8, of Prokhorov and Schmidt (Citation2009).
3.1. Mean functionals with data missing at random
Consider the problem of estimating the mean of functionals of the variables in a parametric regression model with missing responses:
(8)
(8) The parameter of interest here is
, where
is some given squared-integrable function; see Müller (Citation2009). Hristache and Patilea (Citation2017) considered the same framework and focused on the case where
does not depend on X. Here we investigate the general case where
that could also depend on X. Some usual examples are the mean of the response variable (
), the second-order moment of the response (
), the cross-product of the response and the covariate vector (
). (Here,
is the vectorisation operator that transforms a matrix in a column vector by stacking the columns of the matrix.) For simplicity, we take Y with real values in the following of this section.
The regression function has a known (parametric) form, X is always observed, Y is only observed when D = 1 and a MAR assumption holds :
. With
, the model can be written, at the observational level, under the following equivalent form:
(9)
(9) The last two equations being orthogonal, since
it is also equivalent to the model defined by the following system of orthogonal equations, where
stands for the conditional variance
:
(10)
(10) Solving for θ, we get
where
and
Let
be an estimator of α obtained in the model. With the variance
and the functions
and
estimated nonparametrically, the plug-in estimator
would be efficient. Since the first equation in system (Equation10
(10)
(10) ) is orthogonalised with respect to the last one, for the propensity score
, one could use a parametric model without affecting the efficiency bound.
3.2. Quantile regression with data missing at random
A particular setting of quantile regression with missing data at random is considered in Wei et al. (Citation2012). For , the conditional quantile
of the always observed response Y given the regressor vectors Z (always observed) and X (observed iff D = 1) is assumed to be linear,
(11)
(11) and the missingness mechanism is defined by the strong missing at random condition
(12)
(12) Taking in (Equation6
(6)
(6) ) U = X, V = Y, W = Z,
,
, where the family of functions
spans
, the model defined by (Equation11
(11)
(11) ) and (Equation12
(12)
(12) ) can be written under the following equivalent form:
(13)
(13) The two sets of equations being already orthogonal (with respect to the σ-field
), in this situation we can efficiently estimate the parameter
from the complete data only, that is from the model defined by (Equation11
(11)
(11) ) keeping for the statistical analysis only the observations for which all the components of the vector
are observed. The gain in efficiency observed in the simulation experiment of Wei et al. (Citation2012) for their multiple imputation improved estimator comes, in our opinion, from the supplementary parametric assumption on the form of the conditional density of X given Z (see their Assumption 4).
A more general linear quantile regression model defined by (Equation11(11)
(11) ) with missing data at random is considered in Chen, Wan, and Zhou (Citation2014). With their notations, we have
(14)
(14) for the full data model. They also denote by X the always observed components of the vector
and with
the components of the same vector that are observed iff the binary variable D takes the value 1 and use the ‘standard’ missing at random assumption
. This fits our framework by taking U = X, V = 1,
and
where the family of functions
spans
. The equivalent moment equations model, at the observational level, can be written as
(15)
(15) The information bound for this model is given in Hristache and Patilea (Citation2016). It can not be calculated explicitly, except some special cases, which includes the missing responses as before or the case where X or/and Z are discrete. It is different from the information bound given in Chen, Hong, and Tarozzi (Citation2008) which corresponds to a model defined by an unconditional quantile moment and a MAR assumption and could be represented equivalently under the form
(16)
(16) Models (Equation15
(15)
(15) ) and (Equation16
(16)
(16) ) are quite different and so are the corresponding efficiency bounds, so that no estimation procedure given in Chen et al. (Citation2014) could be efficient in their linear quantile regression model (Equation14
(14)
(14) ) with missing data at random.
4. Restricted estimators for quantile regressions and general conditional moment models with data missing at random
The model defined by the regression-like equation
and the MAR selection mechanism
is equivalent, at the observational level, to the following model defined by conditional moment equations :
This framework includes many situations. For instance, taking
we obtain the case in which some regressors (conditioning variables) X are missing, while with
we cover the case of missing responses. Splitting Y in an observed subvector
and a not always observed subvector
, with
this corresponds to the case where both some responses and some covariates are missing. In all these examples, U is the vector of not always observed components of the data vector.
For the model
denoting by
the true law of
, the tangent space is
For the model
the tangent space is
The tangent space
of
is (see Hristache & Patilea, Citation2016)
We obtain the efficient score
by projecting the score
on
,
which gives the following solution :
where
Remark
is also the efficient score in the model
or in the model defined by the moment condition
As shown in Hristache and Patilea (Citation2016),
satisfies an equation of the form
with T a contraction operator. The solution of this equation is unique, but in order to obtain it one needs to use nonparametric estimators at each step of the iterative procedure. An alternative approach would be to consider finite dimensional subspaces
and
when calculating the ‘efficient score’, leading to an approximately efficient score. We obtain in this way what is known in the literature as restricted estimators. We can write:
finite dimensional
s.t.
Compare to
Similarly for
:
finite dimensional
s.t.
An optimal class 1 restricted estimator (see Tan, Citation2011; Tsiatis, Citation2007) is solution of the approximated efficient score equation
where
and
are given by
In fact,
is the efficient score in the following moment equations model:
This allows for a new, simple and intuitive interpretation of the optimal class 1 restricted estimators as efficient estimators in a larger model, obtained from the initial one by using appropriate ‘instruments’ to transform the conditional moment equations in a (growing) number of unconditional moment conditions. Another advantage of this new perspective is the access to the most commonly used methods of obtaining efficient estimators in moment equations models such as GMM, SMD (see Lavergne & Patilea, Citation2013) or empirical likelihood estimators.
Similar procedures can be used for class 2 restricted estimators, based on
and class 3 restricted estimators (Tan, Citation2011), based on
4.1. Simulation study
The approach on restricted estimators is illustrated in a setting already considered by Chen, Wan, and Zhou(Citation2015); see their Example 1, scenario . With the notations of the previous section, the data are generated from the following model:
(17)
(17) where
and
follows a centred bivariate normal distribution with unit variances and correlation equal to 0.5. The parameter of interest here is the vector coefficient
of the conditional quantile of Y given X and V:
(18)
(18) with
,
,
, where
is the τth quantile of ϵ,
. Herein, we only report the case
. The variables Y and V are always observed, while X is observed if and only if D = 1, where D is a Bernoulli random variable such that
, with
. The model for the fully observed data is defined by the regression-like equation
where
. Under the MAR selection mechanism
it is equivalent, at the observational level, to the following model defined by conditional moment equations:
The restricted estimators considered are obtained by the generalised method of moments in the following models
,
, which contain the model
:
where:
,
,
,
,
,
(no missing data, 1, X and V as instruments);
,
,
,
,
,
(no missing data,
,
and
as instruments);
,
,
,
,
(true propensity score,
,
and
as instruments);
, with
,
and
estimated from a logistic regression,
,
,
,
(propensity score estimated by a logistic regression on Y and V,
,
and
as instruments for the IPW quantile equation);
,
,
,
,
,
,
,
,
, (
,
and
as instruments for the IPW quantile equation, 1, Y and V as instruments for the propensity score equation);
,
,
,
,
,
,
,
,
, (
,
and
as instruments for the IPW quantile equation, 1, Y −V and
as instruments for the propensity score equation).
The estimates of the MSE obtained in the case from 1000 replications, with sample size
, are given in Table .
Table 1. Estimates of ![](//:0)
over 1000 replicates when ![](//:0)
.
Note that none of the GMM estimators in models could be efficient in the initial model
, but only approximately efficient, if the instruments are suitably chosen, which could be a delicate point in practice. Here we observe that the instruments
, involved in the first equations in the first equations of the model
performs better that the instruments
used in
. We observe a similar phenomenon for the propensity score equations when looking at the columns
,
and
. The case in
corresponds to common practice when one trusts the logistic regression for the propensity score. The cases in
and
correspond to our approach based on instruments with more effective instruments in the later case. The non-orthogonality of the quantile model equations and the propensity score equations could explain the better results in
. A joint estimation of the two set of equations with effective instruments could improve over the common practice. Next, let us notice that the models
and
are similar: we use the same instruments for the first equation in
. Moreover, in
we use the true propensity score, while in
we use an estimated propensity score obtained from a model that is somehow close to the true propensity score. As the two equations in the model
are not orthogonal, estimating the propensity score could improve the asymptotic variance of the estimators of
. This is related to the so-called puzzling phenomenon noticed by Prokhorov and Schmidt (Citation2009). Here, even if propensity score the model is slightly wrong, there is still a gain of MSE. Let us also note the surprisingly good results for the model
in which
is approximated by a quadratic function of Y −V. Using the same instruments for the conditional quantile equations, the estimation with missing data is even better than in the case with full data (compare results for model
to those for model
). This could be explained by the fact that in model
we do not use the optimal instruments that should be proportional to the conditional density of the error term at the origin. The weighting introduced by the propensity score seems, in some sense, to compensate the non-optimal instruments. This suggests further possible improvements based on other choices of instrumental variables in order to approach efficiency.
5. Is imputation really informative?
Multiple imputation is a widely used method to generate substitute values when data are missing. However, under the MAR assumption, the interest of multiple imputation in the context of conditional moment restriction models is at least questionable, as discussed in the following.
Consider that is always observed and consider the MAR assumption
(19)
(19) Then, any substitute observation generated from the law of
is adequate to replace a missing U, where the law of
should be such that
(Here,
denotes the conditional law of
given
.) Since, in general, the law
is unknown, one can estimate it, parametrically or nonparametrically, and generate substitute observations from this estimate. This is the so-called parametric or nonparametric imputation. See, for instance, Wang and Chen (Citation2009), Wei et al. (Citation2012), Chen and Van Keilegom (Citation2013) for some nonparametric imputation applications.
The equivalence established by Theorem 2.1 for models defined by moment restrictions, implies that all the information on the parameter θ in the initial model under the MAR assumption (Equation19(19)
(19) ) is contained in the model defined by the equations (Equation6
(6)
(6) ). Let us point out that the last equation of the model (Equation6
(6)
(6) ) includes the information contained in the incomplete observations. Indeed, to estimate
parametrically or nonparametrically, one uses all the observations of W. This remark opens new perspectives for defining estimators of θ without using substitute observations. Moreover, this remark sheds some new light on a common justification used in the literature, namely that imputation is necessary in order to capture the information contained in the partially observed data.
6. Conclusions
We consider a statistical model defined by an arbitrary number of moment equations. Our framework includes a large panel of models defined through conditional and/or unconditional moments. Next, we assume that some variables are missing at random. In this setup of modelling with missing data, we present a model equivalence result. It states that the initial statistical model together with the MAR mechanism is equivalent to a moment equations model. Using the equivalent model could greatly simplify the estimation and the inference with missing data problems. We discuss several consequences for widely used models, including the quantile regressions.
Disclosure statement
No potential conflict of interest was reported by the authors.
Additional information
Notes on contributors
Marian Hristache
Marian Hristache is Associated Professor, Ecole Nationale de la Statistique et de l'Analyse de l'Information (Ensai), Rennes, France (E-mail: [email protected]).
Valentin Patilea
Valentin Patilea is Professor, Ecole Nationale de la Statistique et de l'Analyse de l'Information (Ensai), and Center for Research in Economics and Statistics (CREST), Rennes, France (E-mail: [email protected]).
References
- Ai, C., & Chen, X. (2003). Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71, 1795–1843. doi: 10.1111/1468-0262.00470
- Ai, C., & Chen, X. (2007). Estimation of possibly misspecified semiparametric conditional moment restriction models with different conditioning variables. Journal of Econometrics, 141, 5–43. doi: 10.1016/j.jeconom.2007.01.013
- Ai, C., & Chen, X. (2012). The semiparametric efficiency bound for models of sequential moment restrictions containing unknown functions. Journal of Econometrics, 170, 442–457. Thirtieth Anniversary of Generalized Method of Moments. doi: 10.1016/j.jeconom.2012.05.015
- Chen, X., Hong, H., & Tarozzi, A. (2008). Semiparametric efficiency in GMM models with auxiliary data. The Annals of Statistics, 36, 808–843. doi: 10.1214/009053607000000947
- Chen, S. X., & Van Keilegom, I. (2013). Estimation in semiparametric models with missing data. Annals of the Institute of Statistical Mathematics, 65, 785–805. doi: 10.1007/s10463-012-0393-6
- Chen, X., Wan, A. T. K., & Zhou, Y. (2014). Efficient quantile regression analysis with missing observations. Journal of the American Statistical Association, 110(510), 723–741. doi: 10.1080/01621459.2014.928219
- Chen, X., Wan, A. T. K., & Zhou, Y. (2015). Efficient quantile regression analysis with missing observations. Journal of the American Statistical Association, 110, 723–741. doi: 10.1080/01621459.2014.928219
- Cheng, P. E. (1994). Nonparametric estimation of mean functionals with data missing at random. Journal of the American Statistical Association, 89, 81–87. doi: 10.1080/01621459.1994.10476448
- Domínguez, M. A., & Lobato, I. N. (2004). Consistent estimation of models defined by conditional moment restrictions. Econometrica, 72, 1601–1615. doi: 10.1111/j.1468-0262.2004.00545.x
- Graham, B. S. (2011). Efficiency bounds for missing data models with semiparametric restrictions. Econometrica, 79, 437–452. doi: 10.3982/ECTA7379
- Heitjan, D. F., & Rubin, D. B. (1991). Ignorability and coarse data. The Annals of Statistics, 19, 2244–2253. doi: 10.1214/aos/1176348396
- Hristache, M., & Patilea, V. (2016). Semiparametric efficiency bounds for conditional moment restriction models with different conditioning variables. Econometric Theory, 32, 917–946. doi: 10.1017/S0266466615000080
- Hristache, M., & Patilea, V. (2017). Conditional moment models with data missing at random. Biometrika, 104, 735–742. doi: 10.1093/biomet/asx025
- Lavergne, P., & Patilea, V. (2013). Smooth minimum distance estimation and testing with conditional estimating equations: uniform in bandwidth theory. Journal of Econometrics, 177, 47–59. doi: 10.1016/j.jeconom.2013.05.006
- Little, R., & Rubin, D. (2002). Statistical analysis with missing data. Wiley series in probability and mathematical statistics. Probability and mathematical statistics. John Wiley & Sons, Inc., Hoboken, New Jersey.
- Müller, U. U. (2009). Estimating linear functionals in nonlinear regression with responses missing at random. The Annals of Statistics, 37, 2245–2277. doi: 10.1214/08-AOS642
- Prokhorov, A., & Schmidt, P. (2009). GMM redundancy results for general missing data problems. Journal of Econometrics, 151, 47–55. doi: 10.1016/j.jeconom.2009.03.010
- Robins, J. M., & Gill, R. D. (1997). Non-response models for the analysis of non-monotone ignorable missing data. Statistics in Medicine, 16, 39–56. doi: 10.1002/(SICI)1097-0258(19970115)16:1<39::AID-SIM535>3.0.CO;2-D
- Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. doi: 10.1093/biomet/70.1.41
- Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592. doi: 10.1093/biomet/63.3.581
- Tan, Z. (2011). Efficient restricted estimators for conditional mean models with missing data. Biometrika, 98, 663–684. doi: 10.1093/biomet/asr007
- Tsiatis, A. (2007). Semiparametric theory and missing data. New York: Springer-Verlag.
- van der Laan, M. J., & Robins, J. M. (2003). Unified methods for censored longitudinal data and causality. New York: Springer-Verlag.
- Wang, D., & Chen, S. X. (2009). Empirical likelihood for estimating equations with missing values. The Annals of Statistics, 37, 490–517. doi: 10.1214/07-AOS585
- Wei, Y., Ma, Y., & Carroll, R. J. (2012). Multiple imputation in quantile regression. Biometrika, 99, 423–438. doi: 10.1093/biomet/ass007
- Wooldridge, J. M. (2007). Inverse probability weighted estimation for general missing data problems. Journal of Econometrics, 141, 1281–1301. doi: 10.1016/j.jeconom.2007.02.002