![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
Abstract
We propose a constrained generalised method of moments (CGMM) for enhancing the efficiency of estimators in meta-analysis in which some studies do not measure all covariates associated with the response or outcome. Under some assumptions, we show that the proposed CGMM estimators have good asymptotic properties. We also demonstrate the effectiveness of the proposed method through simulation studies with fixed sample sizes.
1. Introduction
Because of the availability of multiple datasets, not just summary statistics, from different studies in modern applications, meta-analysis has become an important tool to gain efficiency in estimating a common structural parameter vector of interest from all studies by appropriately using multiple datasets (Hartung, Knapp, & Sinha, Citation2008; Higgins & Thompson, Citation2002; Higgins, Thompson, Deeks, & Altman, Citation2003; Schmidt & Hunter, Citation2014; Simonian & Laird, Citation1986). There exists a rich literature on how to form optimal calibration equations for improving the efficiency of parameter estimates within various classes of unbiased estimators (Chen & Chen, Citation2000; Deville & Sarndal, Citation1992; Lumley, Shaw, & Dai, Citation2011; Robins, Rotnitzky, & Zhao, Citation1994; Slud & DeMissie, Citation2011; Wu, Citation2003; Wu & Sitter, Citation2001). The methodology for ‘model-based’ maximum likelihood estimation has also been studied previously in some special cases of this problem (Chatterjee, Chen, Maas, & Carroll, Citation2016). A number of researchers have proposed semiparametric maximum likelihood methods for various types of regression models, while accounting for complex sampling designs (Breslow & Holubkov, Citation1997; Lawless, Wild, & Kalbfleisch, Citation1999; Qin, Zhang, Li, Albanes, & Yu, Citation2015; Rao & Molina, Citation2015; Scott & Wild, Citation1997).
One issue that has to be addressed with multiple studies is that some studies may not measure all covariates although all studies have the same responses (Chatterjee et al., Citation2016). Specifically, a past study only measured q of the p+q covariates measured in the current study. Although unobserved covariate values in the past study can be treated as missing covariate values, better statistical procedure may be derived because in each study, a covariate is either observed or missing entirely, which is referred to as systematic missing covariates.
To illustrate the idea, let us consider the special case of two studies. Let Y be a response or outcome of interest, and
be p- and q-dimensional vectors of associated covariates measured in study 1, and
be the covariate vector measured only in study 2. We focus on the situation where whether
is observed does not affect the conditional means, i.e.,
(1)
(1) where
for study k=1,2. In the missing data literature, the ‘missingness’ of
with property (Equation1
(1)
(1) ) is referred to as missing at random, but not missing completely at random.
Suppose that we are interested in the parameters in the conditional mean , which can be called structural parameters. From the first equation in (Equation1
(1)
(1) ), estimation can actually be done using data from study 1. However, we want to make use of data from study 2, which is the purpose of meta-analysis. The second equation, which will be referred to as bridge equation, may enable us to obtain estimators based on data from all studies that are more efficient than those using only data from study 1.
In this article, we assume that the conditional means in (Equation1(1)
(1) ) follow linear models for both observation and bridge equation. Although more complicated models may be encountered in applications, the discussion with linear models is a good start to this problem. In Section 2, we propose a constraint generalised method of moments for estimation in the case of two studies. Asymptotic distributions of the proposed estimators are established, with which we illustrate when more asymptotically efficient estimators can be obtained. Simulation studies support our asymptotic results and illustrate the magnitude of efficiency gain. Our method can be extended to the case of more than two studies. As an example of extension, in Section 3, we consider the situation of three studies and supplemented with simulation results. The last section contains some technical details.
2. Results for two independent studies
In this section, we consider two studies, indicated by , with independent datasets. Following Section 1, we use Y,
, and
as the response of interest, the covariate vector measured in both studies, and the covariate vector measured only in study 1, respectively.
2.1. Constrained generalised method of moments
For illustration, we first consider a univariate . Assume (Equation1
(1)
(1) ) and linear models for two independent studies as follows:
(2)
(2)
(3)
(3)
(4)
(4) where
,
, and
are independent with mean 0 and variances
,
, and
, respectively,
,
,
, and
are parameter vectors with appropriate dimensions, and the superscript
denotes vector transpose. Models (Equation2
(2)
(2) )–(Equation4
(4)
(4) ) assume that the structure parameters
,
, and
are the same for all studies, while the distributions of ϵ's can vary with studies, exhibiting the heteroscedasticity of the data among different studies.
We are mainly interested in estimating and
in (Equation2
(2)
(2) ). Instead of using data from study 1 only, we try to make use of data from study 2 to gain estimation efficiency. Condition (Equation4
(4)
(4) ) is for bridging data in two studies to gain efficiency by using the additional data from study 2. It is not necessary. See our discussion in Section 4.
Assume that we have two independent random samples with sizes and
from studies 1 and 2, respectively. We denote the total sample size from all studies as
. From (Equation2
(2)
(2) )–(Equation4
(4)
(4) ), we construct estimating equations
and a constraint
, where
is the vector of zeros,
,
,
,
is a column vector with elements of
where
is the indicator function of the event A.
Let ,
, be observed data from samples, where
with
are identically distributed as
with
, k=1,2, and let
. The two step constrained generalised method of moments (CGMM) is applied as follows.
Compute
over
with constraint
.
Compute the weight matrix
.
Compute the two step CGMM estimator
over
with constraint
.
We now extend our idea to a multivariate that is observed in study 1 but not in study 2. Let
be p-dimensional and
be its jth component. Then, the previous procedure can still be applied with U,
,
,
,
,
, and
replaced by
,
,
,
,
,
, and
, respectively.
2.2. Asymptotic properties
The general theory for the generalised method of moment (GMM) is given in Hansen (Citation1982). The CGMM we proposed in Section 2.1 adds a constraint to the GMM. For the purpose of testing hypotheses, Engle and McFadden (Citation1994) considered the CGMM. We now establish an asymptotic result in a similar manner. For simplicity, we consider only a univariate U.
Let denote the true value of the parameter vector
. For the CGMM estimator
defined in Section 2.1 with the constraint
, we have the following result.
Theorem 2.1
Assume that models (Equation2(2)
(2) )–(Equation4
(4)
(4) ) hold;
is the unique root of
both
and
diverge to ∞ and
with
and the matrices
and
are all of full rank, where
is the identity matrix of order q. Then,
(5)
(5) where
and
denotes convergence in distribution as
.
If we do not use the constraint , then the unconstraint GMM (UGMM) estimator in our problem described in Section 2.1 is the vector of the least square estimators of
,
, and
based on data in study 1 only and the least squares estimator of
based on data in study 2 only. Let
be the UGMM estimator. Then
(6)
(6) which can be derived in the same manner as deriving (Equation5
(5)
(5) ) but with
.
Is the CGMM estimator asymptotically more efficient than the UGMM estimator
because of utilising two data sets? It follows from results (Equation5
(5)
(5) ) and (Equation6
(6)
(6) ) that a component of
is asymptotically more efficient if and only if the corresponding diagonal element of the matrix
is positive.
To find out the magnitude of efficiency gains in using CGMM, we need to address the issue that the limit h of sample size ratio may be different from 1/2, and need to derive more explicitly the asymptotic covariance matrices in (Equation5(5)
(5) ) and (Equation6
(6)
(6) ).
Note that the first 1+2q components of , denoted by
, estimates
based on data from study 1 with size
, whereas the last q components of
, denoted by
, estimates
based on data from study 2 with size
. From the technical details in Section 5, we obtain from (Equation5
(5)
(5) ) that
(7)
(7) where
is a diagonal matrix whose first 1+2q diagonal elements are
and last q diagonal elements are
. For the special case where δ and
in (Equation1
(1)
(1) ) are independent (so that missing U is completely at random), it is further shown in Section 5 that
(8)
(8) where
denotes a column or row vector of zeros or a matrix of zeros with an appropriate dimension, and that
(9)
(9) where
and
(10)
(10) Similarly, if
and
denote the UGMM estimators of
and
, then
(11)
(11) We define the asymptotic relative efficiency gain in using CGMM estimator
, the jth component of
, with respect to the unconstraint GMM estimator
, the jth component of
, to be
. From (Equation7
(7)
(7) ) and (Equation11
(11)
(11) ), we derive
's as follows. First,
, i.e., there is no gain in estimating
. Intuitively, this is because the data set in study 2 does not have information on U. Second, for estimating q components of
,
where
is the tth diagonal element of the matrix
and
is the tth component of
. Third, for estimating q components of
,
Note that
is a decreasing function of h. Hence the CGMM estimators of components of
and
are increasingly more efficient when h decreases, i.e.,
increases, which means more information can be borrowed from study 2. Finally, for estimating q components of
,
which increases when h increases.
2.3. Simulation study
Two simulation studies are carried out to check the empirical performance of the CGMM and UGMM estimators with finite fixed sample sizes. In the first simulation, we consider univariate U and X, i.e., p=q=1. The covariate is generated from the standard normal distribution. The covariate U and response Y are generated according to (Equation2(2)
(2) )–(Equation4
(4)
(4) ) with
,
, and
independently distributed as standard normal.
Based on 2000 simulations, Table gives the simulation variances of estimators of univariate parameters ,
,
, and
, for both CGMM and UGMM. All simulation biases are less than 0.006 and thus not reported. True values of parameters and different sample sizes are included in Table .
Table 1. Simulation variances of CGMM and UGMM estimators (p=q=1).
A few conclusions can be made from Table .
When
, the simulation relative efficiency gain of CGMM over UGMM is
for estimating
. This indicates that there is almost no improvement in estimating
, but there are substantial gains in estimating other 3 parameters, which supports our asymptotic result discussed in Section 2.2. In fact, the vector of asymptotic relative efficiency gains in theory defined in Section 2.2 is
, which is very close to the simulation relative gains.
When
and
, more information from study 2 can be borrowed to estimate parameters in study 1. The simulation relative efficiency gain vector is
. We have more gains in estimating
and
, but less gain in estimating
. The vector of asymptotic relative efficiency gains in theory defined in Section 2.2 is
, which is very close to the simulation relative gain.
When
and
, the simulation relative efficiency gain vector is
. We have less gains in estimating
and
, but more gain in estimating
. the vector of asymptotic relative efficiency gains in theory defined in Section 2.2 is
, which is very close to the simulation relative gain.
Our second simulation considers a q=2 dimensional , while U is still univariate. Data are generated according to (Equation2
(2)
(2) ) – (Equation4
(4)
(4) ) with
being a two-dimensional normal with zero marginal means, unit marginal variances, and a correlation ρ.
Note that ,
, and
are all 2-dimensional. Based on 2000 simulations, Table gives the simulation variances of estimators of
,
,
,
, and
for both CGMM and UGMM. Results for the variances of estimators of
and
are omitted. Again, all simulation biases are less than 0.004 and thus not reported. True values of parameters are included in Table . Sample size
and
, 0.3, and 0.6 are considered.
Table 2. Simulation variances of CGMM and UGMM estimators (p=1, q=2).
Table shows similar results to those in Table . In estimating , the simulation relative gain of CGMM over UGMM ranges from 10% to 20%, while there is no gain in estimating
. Increasing the value of ρ, the correlation between two components of
increases the relative efficiency gain, but not substantially.
3. Results for three independent studies
The method and results in Section 2 can be extended to various situations where the number of independent studies is more than 2 and different covariates are observed in different studies. We consider in this section the case of three studies where the response Y and covariates ,
, and
are observed according to the following with sample sizes in three studies:
The total sample size from all studies is
.
3.1. CGMM
Similar to (Equation1(1)
(1) ) and (Equation2
(2)
(2) )–(Equation4
(4)
(4) ), we assume that
(12)
(12)
(13)
(13)
(14)
(14)
, and that
(15)
(15)
(16)
(16)
(17)
(17)
(18)
(18) where
matrix
,
matrix
,
matrix
, and
matrix
. Assume samples are independent and identically distributed within each study and independent among studies and ϵ's are independent with mean zero. By assumptions (Equation12
(12)
(12) ), (Equation14
(14)
(14) ), (Equation15
(15)
(15) ), (Equation17
(17)
(17) ), and the expression of
in (Equation18
(18)
(18) ), we have the following constraint conditions:
(19)
(19) By assumptions (Equation12
(12)
(12) ), (Equation13
(13)
(13) ), (Equation15
(15)
(15) ), (Equation16
(16)
(16) ), and the expression of
in (Equation18
(18)
(18) ), we have the following constraint conditions:
(20)
(20) Denote
,
, and
by
,
, and
, respectively. Models (Equation15
(15)
(15) )–(Equation18
(18)
(18) ) assume that the structure parameters
,
,
,
, and
are the same for all studies, while the distributions of ϵ's can vary with studies. We are mainly interested in estimating
in (Equation15
(15)
(15) ). Instead of using data from study 1 only, we try to make use of data from studies 2 and 3 to gain estimation efficiency. Condition (Equation18
(18)
(18) ) is needed for bridging data among three studies; without this condition, it is hard to gain any efficiency by using the additional data from study 2 and 3.
Denote vec a row vector contains all rows in a matrix
. From (Equation15
(15)
(15) )–(Equation20
(20)
(20) ), we construct estimating equations
and a constraint
, where
,
,
is a column vector with elements of
Let
,
, be independent samples, where
with
are identically generated from the distribution of
with
, k=1,2,3. Define
. The two step CGMM is applied as follows.
Compute
over
with constraint
.
Compute the weight matrix
Compute the two step CGMM estimator
over
with constraint
.
Asymptotic property of CGMM estimator can be established similarly to Theorem 2.1.
3.2. Simulation study
In this section, we consider univariate U and V, i.e., p=l=1. Then ,
,
,
,
,
, and
reduce to scalars
,
,
,
,
,
, and
, respectively. Two simulation studies are carried out to check the empirical performance of the CGMM and UGMM estimators with finite fixed sample sizes. In the first simulation, we consider univariate X, i.e., q=1. Then
,
,
,
, and
reduce to scalars
,
,
,
, and
, respectively. The covariate X and V are independently generated from the standard normal distribution. The covariate U and response Y are generated according to (Equation15
(15)
(15) )–(Equation18
(18)
(18) ) with
,
,
, and
independently distributed as standard normal.
Based on 2000 simulations, Table gives the simulation variances of estimators of parameters ,
, and
for both CGMM and UGMM. Results for the variances of estimators of other parameters are omitted. All simulation biases are less than 0.007 and thus not reported. True values of parameters and different sample sizes are included in Table .
Table 3. Simulation variances of CGMM and UGMM estimators (p=l=q=1).
It can be seen that messages provided by Table are very similar to those from Table . When , the simulation relative efficiency gain of CGMM over UGMM is
for estimating
. When
,
and
, more information from study 3 can be borrowed to estimate parameters in study 1, and the simulation relative efficiency gain vector is
for estimating
. When
,
and
, more information from study 2 can be borrowed to estimate parameters in study 1, and the simulation relative efficiency gain vector is
for estimating
.
Our second simulation considers a q=2 dimensional , while U and V are still univariate. Data are generated according to (Equation15
(15)
(15) )–(Equation18
(18)
(18) ) with
being two dimensional normal with zero marginal means, unit marginal variances, and a correlation ρ.
Based on 2000 simulations, Table gives the simulation variances of estimators of ,
,
, and
for both CGMM and UGMM. Results for the variances of estimators of other parameters are omitted. Again, all simulation biases are less than 0.004 and thus not reported. True values of parameters are included in Table . Sample size
and
, 0.3, and 0.6 are considered. The results show substantial improvement of CGMM over UGMM, and the effect of ρ is not substantial.
Table 4. Simulation variances of CGMM and UGMM estimators (p=l=1, q=2).
4. Discussion
We have proposed a CGMM estimator for using information from datasets in different studies. An asymptotic theorem is established in the case of two studies to illustrate that the CGMM estimator is more efficient than the UGMM estimator using data from one study only. Our simulation studies show that the CGMM estimators can achieve major efficiency gains over the UGMM estimators in cases with two or three studies.
Comparing results for three studies with those for two studies, we conclude that the conclusions are similar, but the CGMM procedure is more complicated with three studies. This is still true if we encounter more studies. The improvement of the CGMM over the UGMM (which basically uses within-study data) increases as the number of studies increases, since more datasets are involved when more studies are considered. However, the derivation of CGMM may be messy when there are many studies and datasets.
We consider linear models for data in both observation and bridge patterns. This is not necessary and can be extended. For example, assumptions (Equation3(3)
(3) ) and (Equation4
(4)
(4) ) may be replaced by a more general assumption on
, either parametric or nonparametric. More research is needed to extend the framework and to explore methods that can handle more general model assumptions.
5. Technical details
Proof of Theorem 2.1
Note that is a sample average of i.i.d. random vectors with mean zero and finite covariance matrix
. Then the Lindeberg–Levy central limit theorem implies
(21)
(21) Define a Lagrangian for
, where
. In this expression,
is a column vector of undetermined Lagrangian multipliers; these are non-zero when the constraints are binding. The first-order conditions for solution of the constrained optimisation problem are
(22)
(22) Let
. Since
is a consistent estimator of
,
and
, where
denotes a sequence of random vectors converging to zero in probability. Using these results and Taylor expansions, we have
and
Substituting these into the first-order conditions in (Equation22
(22)
(22) ) yields
(23)
(23) Applying the formula for the inverse of partitioned matrix (Lu & Shiou, Citation2002) to (Equation23
(23)
(23) ) and the fact that
yields
(24)
(24) Note that
and
yield
Then, result (Equation5
(5)
(5) ) follows from (Equation21
(21)
(21) ) and (Equation24
(24)
(24) ).
Proofs of (Equation7(7)
(7) )–(Equation10
(10)
(10) ). Let
be a diagonal matrix whose first 1+2q diagonal elements to be
and last q diagonal elements to be
. Then
, which together with (Equation5
(5)
(5) ) imply result (Equation7
(7)
(7) ).
To complete the proof, we now give derivations of (Equation8(8)
(8) ) – (Equation10
(10)
(10) ). Write
where
is
,
and
are both
, and the dimension of
is the same as that of
. By
, and the definitions of
and
,
where the second equation follows from the assumption that δ and
are independent and
is independent of
, the third equation follows from
and
has variance
, and the last equation follows from (Equation4
(4)
(4) ) and the assumption that
with mean zero is independent of
so that
and
. Similarly, by
and the assumption that δ and
are independent, we have
where the last equation is guaranteed by the assumption that
. Since
,
and
are
. Thus,
By the definitions of
and
, partial derivatives corresponding to blocks other than diagonal blocks in
are zero, i.e.,
,
, and
are
. By
, the assumptions that
, and δ is independent of
, we have
Combining these results, we obtain that
(25)
(25) By (Equation25
(25)
(25) ) and the definition of
, we have the explicit form of
in (Equation8
(8)
(8) ).
Note that . the explicit form of
in (Equation9
(9)
(9) ) – (Equation10
(10)
(10) ) follows from (Equation25
(25)
(25) ), the definition of
, and
Disclosure statement
No potential conflict of interest was reported by the authors.
Additional information
Funding
Notes on contributors
Menghao Xu
Menghao Xu is a doctoral candidate in college of statistics, East China Normal University. His main research direction is variable selection, missing data and survival analysis.
Jun Shao
Jun Shao is a professor in department of University of Wisconsin-Madison and in college of statistics, East China Normal University. His research covers a wide range of fields, s.t. the jackknife, bootstrap and other resampling methods; variable selection and inference with high dimensional data; sample surveys (variance estimation, imputation for nonrespondents); missing data (nonignorable missing, dropout, semi-parametric methods); longitudinal data analysis with missing data and/or measurement error; medical statistics (clinical trials, personalized medicine, bioequivalence). He is the author of Mathematical Statistics, which is a wildly used graduate textbook covering topics in statistical theory essential for graduate students preparing for work on a Ph.D. degree in statistics.
References
- Breslow, N. E., & Holubkov, R. (1997). Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. Journal of the Royal Statistical Society, Series B, 59(2), 447–461. doi: 10.1111/1467-9868.00078
- Chatterjee, N., Chen, Y.-H., Maas, P., & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111, 107–117. doi: 10.1080/01621459.2015.1123157
- Chen, Y.-H., & Chen, H. (2000). A unified approach to regression analysis under double sampling design. Journal of the Royal Statistical Society, Series B, 62, 449–460. doi: 10.1111/1467-9868.00243
- Deville, J. C., & Sarndal, C. E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87, 376–382. doi: 10.1080/01621459.1992.10475217
- Engle, R. F., & McFadden, D. L. (1994). Handbook of econometrics (Vol. 4). Amsterdam: Elsevier Science, North Holland.
- Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica, 50, 1029–1054. doi: 10.2307/1912775
- Hartung, J., Knapp, G., & Sinha, K. B. (2008). Statistical meta-analysis with applications. New York, NY: Wiley.
- Higgins, J. P. T., & Thompson, S. G. (2002). Quantifying heterogeneity in a meta-analysis. Statistics in Medicine, 21, 1539–1558. doi: 10.1002/sim.1186
- Higgins, J. P. T., Thompson, S. G., Deeks, J. J., & Altman, D. G. (2003). Measuring inconsistency in meta-analyses. British Medical Journal, 327, 557–560. doi: 10.1136/bmj.327.7414.557
- Lawless, J. F., Wild, C. J., & Kalbfleisch, J. D. (1999). Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society, Series B, 61, 413–438. doi: 10.1111/1467-9868.00185
- Lu, T. T., & Shiou, S. H. (2002). Inverses of 2×2 block matrices. Computers and Mathematics with Applications, 43, 119–129. doi: 10.1016/S0898-1221(01)00278-4
- Lumley, T., Shaw, P. A., & Dai, J. Y. (2011). Connections between survey calibration estimators and semiparametric models for incomplete data. International Statistical Review, 79, 200–220. doi: 10.1111/j.1751-5823.2011.00138.x
- Qin, J., Zhang, H., Li, P., Albanes, D., & Yu, K. (2015). Using covariate-specific disease prevalence information to increase the power of case- control studies. Biometrika, 102, 169–180. doi: 10.1093/biomet/asu048
- Rao, J. N. K., & Molina, I. (2015). Small area estimation. New York, NY: Wiley.
- Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89, 846–866. doi: 10.1080/01621459.1994.10476818
- Schmidt, F. L., & Hunter, J. E. (2014). Methods of meta-analysis: Correcting error and bias in research findings. Newbury Park, CA: Sage Publications.
- Scott, A. J., & Wild, C. J.. (1997). Fitting regression models to case-control data by maximum likelihood. Biometrika, 84, 57–71. doi: 10.1093/biomet/84.1.57
- Simonian, R. D., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7, 177–188. doi: 10.1016/0197-2456(86)90046-2
- Slud, E., & DeMissie, D. (2011). Validity of regression meta-analyses versus pooled analyses of mixed linear models. Mathematics in Engineering, Science and Aerospace, 2, 251–265.
- Wu, C. (2003). Optimal calibration estimators in survey sampling. Biometrika, 90, 937–951. doi: 10.1093/biomet/90.4.937
- Wu, C., & Sitter, R. R. (2001). A model-calibration approach to using complete auxiliary information from survey data. Journal of the American Statistical Association, 96, 185–193. doi: 10.1198/016214501750333054