ABSTRACT
In many settings, multiple data collections and analyses on the same topic are summarised separately through statistical estimators of parameters and variances, and yet there are scientific reasons for sharing some statistical parameters across these different studies. This paper summarises what is known from large-sample theory about when estimators of a common structural parameter from several independent samples can be combined functionally, or more specifically linearly, to obtain an asymptotically efficient estimator from the combined sample. The main idea is that such combination can be done when the separate-sample nuisance parameters, if any exist, vary freely and independently of one another. The issues are illustrated using data from a multi-centre lung cancer clinical trial. Examples are presented to show that separate estimators cannot always be combined in this way, and that the functionally combined separate estimators may have low or 0 efficiency compared to the unified analysis that could be performed by pooling the datasets.
1. Introduction
In many fields of social and biomedical science, multiple studies estimating the same parameter are conducted and summarised separately. The parameter of interest is often a measure of effectiveness, that is, of difference in response between groups with and without a certain treatment. Effectiveness may be quantified by the positive effect of a biomedical treatment or of an intervention in a social-science context such as education or criminal justice. The most common version of the problem, addressed by Meta-Analysis (Hartung, Knapp, & Sinha, Citation2008), arises when many independent and uncoordinated studies are not individually large enough to make a definitive statement about positivity of the single (usually scalar) parameter θ which has scientific or policy interest. Another setting in which analyses are combined is the less common one of large similar studies (say, done in different geographic regions) with a shared parameter, where there may or may not be shared nuisance parameters. Whether large or small, few or many, the separate studies might differ in the precise characteristics of the population investigated, the criteria of inclusion in each study, or in the measurements collected either because of the choice of auxiliary variables or because of the definitions and research methods used.
The primary objective here, as in the numerous examples of meta-analysis in Hartung et al. (Citation2008), is to combine the results of many moderate-sample studies to distill a consensus parameter estimate, or a corresponding test of significance of specific component parameter(s) representing treatment effectiveness. Since it is not always possible to gain simultaneous access to the raw unit-level data of previously published studies, researchers seeking a definitive test or estimator of treatment effectiveness often attempt to combine the study results through a function of their separate summary statistics rather than through a re-analysis of all pooled study subject-level data based on a unified model.
The active field of statistical meta-analysis is occupied with methods of simultaneously modelling disparate studies or the estimated parameters from those studies in such a way that they can be combined (Hartung et al., Citation2008). Much of this effort has gone into models that account for differences in study methodology through random effects. Random-effect, empirical-Bayes, and hierarchical Bayes methods have all proved useful in this effort to combine study results. Efforts to ‘borrow strength’ across distinct experimental entities are ubiquitous in the random-effect and Bayes community (Carlin & Louis, Citation2008; Efron, Citation1996), but arise also under the heading of Small Area Estimation in the survey world (Rao & Molina, Citation2015). The validity of analyses depending on models to combine different experiments with shared parameters and random effects can always be questioned, but sometimes such analyses turn out to be surprisingly robust (Slud & DeMissie, Citation2011). We discuss and illustrate some of the relevant modelling issues in Section 2 below.
This paper studies the problem of combining studies using a frequentist large-sample approach based on the standard tools of Fisher information and large-sample asymptotics. We begin by giving notations, statements and explanations of relevant results.
Suppose that k large independent samples of independent vectors of data
are observed, for
, with the goal of estimating the common parameter
efficiently. The densities
of individual data-vectors
are assumed known except for the parameters
, where the nuisance parameters
may not be present for all j, but k and the parameter dimensions
are fixed while the sample-sizes
all tend to ∞. Assume that the separate samples
are summarised only through estimators
of β, along with estimators
of their variances, and that it is desired to estimate β as efficiently as possible from these statistics. The vectors
might then be treated as independent data with approximate means β and with known variance matrices
. We refer to
as separate-sample estimators and to estimators within a unified model of the k-sample data considered together as combined-sample estimators.
When β is scalar, the best unbiased linear-combination estimator with respect to mean-squared error is well known to have
. This estimator
has often been viewed in Meta-Analysis (Hartung et al., Citation2008, Chapter 4) as arising from the model
(1) where the variances
are treated as known. This model is treated as the source of the meta-analytic estimator (Equation3
(3) ) below also in the p-vector setting.
Olkin and Sampson (Citation1998) considered a balanced ANOVA model
(2) which fits into our setting, where
, the components
of the p-vectors
are uncorrelated and all have mean 0 and (known or unknown) variance
, and
are the least-squares estimators of the β vectors in terms of the jth-sample data
. Olkin and Sampson (Citation1998) prove – in a setting extending (Equation2
(2) ) to unbalanced ANOVA – that the best linear estimator (Equation3
(3) ) below in model (Equation1
(1) ) is identical to the least squares estimator of β within the combined unit-level model (Equation2
(2) ). When the errors
are normal, the Gauss-Markov Theorem implies that the linear-combination estimator (Equation3
(3) ) is Maximum Likelihood and therefore efficient within (Equation2
(2) ).
Two recent articles, Lin and Zeng (Citation2010) and Liu, Liu, and Xie (Citation2015), have treated in different ways the problem of efficiently combining separate-study estimators of a shared parameter when there are nuisance parameters. Lin and Zeng (Citation2010) discusses the parametric combination of separate-study estimators when only the separate-study estimators and variance estimators
are available. They prove the result given as (V) below, in the case where the same parameter β is estimated by maximum likelihood in each study and all nuisance parameters vary freely and unconstrained between studies. They do not cover the case where nuisance parameters are infinite-dimensional or where the separate studies may estimate only projections of β, a situation that occurs naturally in meta-analysis, as illustrated in the example of Section 2 below. These extensions are covered in the present paper, in Section 3 and Appendices 1 and 3. The second issue, that separate studies may estimate only functions of a common parameter, is treated fully by Liu et al. (Citation2015) in a more general setting where nuisance parameters may constrain each other across studies. The paper of Liu et al. (Citation2015), using the idea of Confidence distributions, establishes under general regularity conditions the form of the optimal combination of the separate-study estimators of all of the parameters, not just the shared structural parameters but the nuisance parameters too. That paper is therefore less directly relevant to meta-analysis in practice, since it is very uncommon for investigators in separate studies to report the estimated nuisance parameters as well as joint variance estimates of all structural and nuisance parameters, as was noted by Lin and Zeng (Citation2010, 1st paragraph of Section 2.2).
In the rest of this paper, we allow the possibility that in the jth sample, the estimand may be not the full p-vector parameter β but rather a projection to a known subspace of the p-dimensional parameter space, with ‘structural zeroes’ in place of the other coordinates
. In case the range of the projection
has dimension
, the jth sample estimator
is assumed to be either an efficient estimator or one derived by solving an estimating equation of
, with asymptotic variance
, where
is consistently estimated by
. Both
and
are assumed to have range-spaces the same as that of
, and respective generalised-inverses
which are inverses of
on
and are the
operator on the orthogonal-complement space
. Appendix 1 proves under general large-sample regularity conditions, that if
are asymptotically efficient estimators of
from the separate samples, then the overall p-vector estimator
(3) is an efficient estimator of β from the combined sample
, i.e., has minimal asymptotic variance, if there exists a regular efficient estimator depending (smoothly) on
alone. Alternatively, this efficiency is proved to hold in Section 3 under the restriction that the nuisance parameters
range freely without any constraints connecting them for different j. In a further reformulation in Section 3, it is shown that when the nuisance parameters
range freely without any constraints connecting them for different j, and the estimators
are obtained by solving M-estimating equations
(4) the estimator
is efficient in the sense of having the same asymptotic variance as the best estimating-equation estimator linearly combining the estimating functions
with matrix coefficients. Through examples in Section 4, we show that when the parameters
are related by constraints, the estimator (Equation3
(3) ) is generally not efficient and may have efficiency 0, which means intuitively that its large-sample variance has larger order of magnitude than the best possible estimator based on the combined sample.
Before going on to theoretical developments, we illustrate these ideas in the next section using data from a real clinical trial.
2. Data example from a multi-centre clinical trial
Large randomised clinical trials are often conducted simultaneously at different medical centres. They are generally governed by the same clinical protocols — including formal entry criteria, randomisation methods, baseline and cross-sectional measurements to be collected, and study endpoints — but can differ in many of the same ways as completely separate studies: the patient populations from which study participants are recruited, slight differences in the way entry criteria and medical procedures are applied by medical personnel at the different centres, and the patient management strategies of individual physicians associated with the different centres. However, since the study design is shared by all the centres, such large clinical trials can be excellent test-beds for methods purporting to estimate shared parameters by combining analyses done in separate smaller studies. We describe the separate and combined analyses for just such a study, the Eastern Cooperative Oncology Group (ECOG) EST 1582 clinical trial of two different combination chemotherapies for treatment of small cell lung cancer, which has previously been analysed by Gray (Citation1994) and studied by meta-analysis in Slud and DeMissie (Citation2011). The standard therapy in this trial was CAV, a combination of cyclophosphamide, adriamycin and vincristine, and the experimental treatment regimen (CAV-HEM) alternated cycles of CAV with hexamethylmelamine, etoposide and methotrexate. Allocation to these two treatment arms was randomised and equal.
The covariates in the study were binary indicators: Trt for experimental treatment, bone for bone metastasis, liver for liver metastasis, wtloss for weight-loss prior to study entry, and a measure Perf of performance status at baseline. These covariates entered significantly into an overall Bayesian proportional-hazards analysis by Gray (Citation1994), who found that both a coefficient for Trt and one for an interaction term Trt-by-bone were significant. Here, as in Slud and DeMissie (Citation2011), building on the earlier MS thesis work of DeMissie (Citation2009), we analyse the same data separately by centre taking the Trt-by-bone interaction into account.
The data for subject i in study-centre j consist of equal to the logarithm of survival time
(or log of time until censoring, for the 10 out of 570 patients who were lost to follow-up at a time before death),
indicating
as a failure time rather than censoring, and
the vector of covariates consisting of the constant 1, Trt, bone, liver, Perf, wtloss in Model 1, augmented in Model 2 by Trt*bonCtr, where bonCtr is a recoded covariate obtained by subtracting from bone its study-centre mean. The main Trt effect in Model 2 is still the Trt coefficient due to the centering in bonCtr. The 18 study-centre clusters were obtained from the 26 original hospitals, as in DeMissie (Citation2009), after merging some smaller ones by clustering for similarity of covariate means, so that the smallest number of individuals in any centre became 17. The underlying models we consider here are
(5a) or
(5b) where
are as defined above,
are independent random cluster-effects included only in model (Equation5b
(5b) ), and
are independent random errors distributed as extreme-value, i.e., as the logarithm of a unit exponential variable. These models are called Model 1 (a or b) when the parameter β common to all study-centres is the scalar coefficient of Trt. Model 2 (a or b) differs only by including the Trt*bonCtr covariate, with β the 2-vector of coefficients of Trt and Trt*bonCtr.
Models 1a and 2a are first fitted for each study-centre, and their separately estimated β coefficients and standard errors, obtained from the R function survreg, are exhibited in Table . (Note that in each of Centres 2 and 7, the Trt*bone values are all 0, so that in these centres Trt*bonCtr is an affine function of Trt and is a structural 0.) We then construct the estimators (Equation3
(3) ) in these two models, scalar for Model 1a and 2-vector for Model 2a: these are the estimators that would be produced in a meta-analysis, if Model 1a and 2a estimation results (including
estimated variance matrices for estimates
in 2a) were separately published from studies at distinct centres. For Model 1a, the meta-analytic estimates (Equation3
(3) ) of β and standard error are
. For Models 2a, the meta-analytic coefficient estimates are
, with respective SE's 0.051,0.117. In all of these analyses, the Trt coefficient
is highly significant, indicating that the experimental treatment prolonged life as was found by Gray (Citation1994) and DeMissie (Citation2009), but the separate centres' models seem to yield conflicting information.
Table 1. Estimated parameters and SE's from two models, with coefficients
for Trt and
for Trt*bonCtr, in separate study centres. ‘*’ denotes structural 0.
Models 1a and 2a are fixed effects models for many small-sample datasets. Fitting them separately in each centre reflects an assumed lack of connection across centres j for the coefficients of covariates not involving treatment. In Model 1a, the standard errors of the treatment coefficient are roughly 0.2 in each centre, and the standardised coefficient ranges across centres from
to
, significant (at
, two-sided) in only 4 centres. In Model 2a, the standardised coefficient for Trt ranges from
to
, with 8 significant, and that for Trt*bonCtr ranges from
to 2.8, with only 2 significant. A unified model 1a, assumed to hold in all centres with common β, yields
with SE=0.069, and the unified model 2a with shared 2-vector β yields
with SE=0.068. However, since the
estimates vary considerably across centre in separately fitted models, the unified model should accommodate the differences through a random-effects model like (Equation5b
(5b) ), with independent random Trt effects across cluster. Random treatment effects were previously considered both by Gray (Citation1994) and DeMissie (Citation2009). The unified model (Equation5b
(5b) ), fitted by SAS proc nlmixed to allow random Trt effects by centre, results in the following estimates and standard errors:
The unified models 1b and 2b agree in their estimate of the Trt coefficient, disagree only slightly from the unified models 1a and 2a, but the unified analysis disagrees markedly from the meta-analysis. The unified estimate of the Trt*bonCtr coefficient appears significant both in the unified model and in the meta-analysis. The Trt*bonCtr coefficient can be seen to be very noisily fitted in the individual clusters, and perhaps should also be treated with a random effect in the unified model.
The important point of this example is that a properly specified combined-sample analysis can be expected, under the kind of mixed-effect linear model described here, to give substantially the same results as a meta-analysis under a model with adequately detailed interactions and random effects (Slud & DeMissie, Citation2011). In this setting, as in most real meta-analyses, the separate samples (here, the analyses at individual centres) are too small to identify interactions such as treatment-by-covariate interactions which are clearly significant in unified-model analysis. The operation of meta-analysis, which functionally combines the separate-sample coefficient estimates, shows through goodness-of-fit assessments the necessity of including interaction terms (such as treatment-by-covariate) and random effects where separate-sample coefficients vary considerably.
3. Results from large-sample theory
Throughout the rest of this paper, we assume standard regularity and nondegeneracy conditions about parameters (as in Bickel & Doksum, Citation2007, Theorem 6.2.2, and Van der Vaart, Citation1998, Theorem 5.39) which for finite-dimensional
imply the following. First, joint maximum likelihood (ML) estimators
for
exist and are consistent and locally uniquely determined as solutions of the score or likelihood equations in the jth sample, and
(6) The variance
is the inverse of the
Fisher information matrix
for the jth sample. The upper-left
block of
is denoted by
or by
, and is the smallest possible asymptotic variance (in the sense of matrix ordering if
if and only if L−K is nonnegative definite) within the class of all ‘regular’ estimators (Bickel, Klaassen, Ritov, & Wellner, Citation1998, pp. 17–21, or Van der Vaart, Citation1998, p. 115), which includes (again subject to regularity conditions) all estimators defined as solutions of estimating equations
(7) where the functions
are known and nonrandom
-vector estimating-function summands whose form depends on study j but not on subject-index i. Any efficient regular estimator
defined from
, i.e., one for which the asymptotic variance of
is no larger than
, differs from the ML estimator of β by a remainder of order smaller than
in probability. (This assertion follows from the Hájek-LeCam convolution theorem, Van der Vaart, Citation1998, p. 115.) This notion of efficiency can be extended also to the case where the nuisance parameters
are infinite-dimensional, a notion of ‘first-order optimality’ defined as semiparametric or asymptotic efficiency of regular estimators (Van der Vaart, Citation1998, Section 25.3). Although the information bounds
do depend on the nuisance parameters
, we suppress that dependence to keep the notation as simple as possible.
Suppose that the values of an efficient estimator and a consistent estimator
of
are reported from separate analysis of the jth sample in order to obtain confidence intervals for components or linear combinations of the coordinates of β. The main question of interest in this paper is: under what circumstances will there exist a (smooth) function
of the summary statistics
such that
is efficient ? In the case where the separate-sample estimators
are assumed efficient, this means that the asymptotic variance
of
is the inverse of the per-observation Fisher information
for β based on the combined sample of size
. In the case where the estimators
are obtained from estimating equations as in (Equation4
(4) ), efficiency means that
has asymptotic variance no larger than the best combined-sample estimating equation estimator obtained from a matrix-weighted linear combination of the estimating functions
.
In the next part of this Section, leading up to paragraphs (I)–(VI), we develop notions of single- and combined-sample Fisher information about the parameter β. Recall that the per-observation Fisher information about a parameter θ (here, or
for a vector
combining all of the free parameters in
) is a matrix expectation defined as
where
is a score statistic obtained as the column of partial derivatives of the log-likelihood of the data-sample of size n, with respect to the parameters. (Here and below, we also use the convenient notation
for any vector v.) In this section, we express statistical properties of estimators in terms of the linear algebra of information matrices.
To begin, we review concepts and develop formulas related to separate-sample Fisher information about β. Denote the score statistic for data with respect to β by
and with respect to
by
. The information matrices per observation for the separate-sample parameters are
(8) In the block-decomposition on the right-hand side of (Equation8
(8) ),
is
.
The information about β in the jth sample is defined as the inverse of the upper-left
block of
and is given by
. This linear-algebra fact about inverses of block-decomposed matrices can be found in virtually every book about regression (cf. Draper & Smith, Citation1981, Appendix 2A), and can be interpreted as the expression of the residual variance of
after regression on
. This interpretation is developed in terms of associated Linear Regression theory by Draper and Smith (Citation1981, Section 2.6), or more abstractly in terms of projections by Rao (Citation1973, Sections 4.a.1–2 and 4.a.6).
Since we allow the possibility that some of the coordinates of the parameter vectors are shared or constrained across different samples
, let
denote a parameter vector, of dimension
, consisting of all free parameters among
, so that all
vectors are smooth functions
of
. Then we can write all of the densities
where
is smooth in all components of its parameter arguments. Denote the score statistic for the sample
with respect to the parameter vector
as
. Then, in terms of the
Jacobian matrix
of the
-vector valued function
with respect to
,
By independence of the samples
, and therefore of the score-statistics
for different j, the per-observation Fisher information for the combined parameter
in the combined sample is the symmetric
matrix
(9) where
(10) The corresponding complete-sample information
is then expressed either as
(11) in case the
block is invertible, or more generally via the regression-residual interpretation analogous to that of
above, as
(12) where K in the infimum is an arbitrary
matrix, and inf is understood in the sense of nonnegative-definite matrix ordering.
We now present a series of propositions relating combined-sample information and variance to those of the separate samples.
(I) Kagan and Rao (Citation2003, Lemma 2) establish for finite-dimensional the superadditivity of information
(13) Since inverse information is equal to the smallest attainable asymptotic variance for (‘regular’) estimators under the conditions assumed here, the inequality (Equation13
(13) ) can be interpreted to say that the best possible variance
is at most the asymptotic variance
of the right-hand side of (Equation3
(3) ).
(II) The last statement can be recast as in Janicki (Citation2009, Theorem 6.1.2), to say that additivity of information, or equality in (Equation13(13) ), holds if and only if
in (Equation3
(3) ) has asymptotic variance
. In a setting without nuisance parameters
, Janicki (Citation2009, Theorem 6.1.2) shows for
defined by estimating equations, that an optimal combined estimator
is obtained either as the weighted linear combination (Equation3
(3) ) of estimators, or as solution to the linear combination of jth-sample estimating equations
(14) where
are any estimating functions such that
and
is nonsingular, and the
matrices
are defined by
In these expressions, functions are evaluated at the same β governing the data
.
(III) For an efficient regular combined-sample estimator of the form with g continuously differentiable, Taylor linearisation of g in terms of its
arguments and their Jacobians
(the ‘Delta Method’) shows that
(15) i.e., when
are all of order n, shows that
can be represented as a linear combination of the normalised centred ML estimators
, up to a remainder converging to 0 in probability. Since (Equation3
(3) ) (with variances
now all assumed to exist) is shown in Appendix 1 to be the unique optimal matrix-linear-combination estimator in the sense of minimal asymptotic variance, up to
remainders, it follows that
is equal to the linear combination (Equation3
(3) ) plus a remainder of smaller order than
in probability.
(IV) There is one more case where the additivity of information (equality in (Equation13(13) )) is obvious, namely the case where all
are 0. This case is called adaptive because, by (Equation11
(11) ) or (Equation12
(12) ), the jth sample information
is exactly the same as if
were known in advance. But then
by (Equation10
(10) ), while (Equation12
(12) ) with
implies
. Then equality must hold in (Equation13
(13) ).
(V) We now come to the main result of this paper, which says that when the nuisance parameters in the separate samples are distinct and unrelated, then (Equation13
(13) ) becomes an equality and an efficient estimator of β of the form (Equation3
(3) ) exists. In the multi-sample context described above, under standard regularity conditions, if all
vary freely, unconstrained by one another or by β, so that
and
, then equality holds in (Equation13
(13) ). The restriction to unconstrained nuisance parameters is essential in this statement, as will be shown by example in Section 4.
There are a few different proofs of (V) in different cases. When the separate-sample estimators of the same shared parameter vector β are Maximum Likelihood estimators in the presence of the nuisance parameters
, Lin and Zeng (Citation2010) observe that the maximum profile likelihood estimator for β (i.e., the maximiser of the log-likelihood partially maximised over nuisance parameters) in each of the separate studies and in the combined study is efficient and that the log profile likelihood for the combined study is simply the sum of the log profile likelihoods of the separate studies. Thus the combined-study information is the sum of the separate-study informations. That is the complete proof in the MLE case. Since any separate-study efficient regular estimators
of β are asymptotically equivalent in probability to the corresponding MLEs or profile-likelihood maximisers, according to the parametric Hájek-LeCam convolution theorem, this proof establishes the same result for the combination of any efficient regular separate-study estimators. Further technicalities would have been needed to carry this proof idea forward to the case of freely varying infinite-dimensional nuisance parameters
. Lin and Zeng (Citation2010) did not do that, but our Appendix 3 does. In another direction of generalisation, when the separate-sample estimators
estimate not β but projections
(with common null-space 0), the profile-likelihood proof sketched above yields the same result when coupled with the verification in Appendix 1 that the combined-sample MLE has
.
(VI) The same result in (V) – the optimality of estimator (Equation3(3) ) – also holds when the separate-study estimators
are obtained through the solution of M-estimating equations (Equation4
(4) ), when the notion of optimality is suitably clarified. Here the estimating functions
with values in
are assumed to satisfy standard smoothness with respect to parameters and other regularity conditions, including that
, where expectations are taken and the integrand functions evaluated at the same true parameters
, and where
are nonsingular
matrices. The solution of (Equation4
(4) ) is locally unique, in the vicinity of the true
, with probability converging to 1 for large n, by general estimating equation theory (Van der Vaart, Citation1998, Chapter 5 or Janicki, Citation2009, Chapter 2), and estimators
are regular asymptotically linear with asymptotic distributions
where
. Beyond this point, we maintain the same notations as throughout Section 3, namely that
is the upper-left
block of
and
although these inverse variance matrices are no longer Fisher information.
From now on, assume as in paragraph (V) that the parameters range freely and are not constrained by one another, so that the combined-sample unknown parameters are
. The first step in extending (V) is to extend the optimality result in (II) by viewing the separate-study estimating equations all as estimating equations for the combined parameters
. Let
where
is the
matrix with the block decomposition containing 0's everywhere except for the identity matrix in the upper-left
block and another
identity matrix in the submatrix with consecutive row-indices from
through
and consecutive column-indices from p+1 through
.
Now consider the class of all combined-sample estimating equations defined by
(16) where the
matrices
are continuously differentiable functions of their arguments and where
(17) The same arguments as in standard estimating-equation theory show that for large n, the solution of (Equation16
(16) ) is, with probability approaching 1, locally unique for
in a non-shrinking neighbourhood of the true values, and defines
-consistent asymptotically normal estimators
with asymptotic variance
, where
(18) Then a slight reworking of the proof of Janicki (2009, Theorem 6.1.2), using the fact that the generalised inverse of
is
, shows that the combined estimating equation (Equation16
(16) ) with smallest asymptotic variance in the sense of positive-definite ordering of matrices is achieved when
, and we fix this choice for
in equations (Equation16
(16) )–(Equation18
(18) ) from now on.
Denote by the solution of the estimating equation (Equation16
(16) ), which takes the form
with equations (Equation17
(17) ) and (Equation18
(18) ) giving
(19) Now we return to our objective of comparing the combined-sample information-analogue
for β, which is the inverse of the combined-sample asymptotic variance for its optimal estimating equation-estimator
, with the information-analogue from equation (Equation3
(3) ). The asymptotic variance matrix for
is
, so the combined-sample information-analogue is
. To find
as in equations (Equation9
(9) ) and (Equation11
(11) ), we block-decompose
where the upper-left block
is
and the lower-right is
.
The upper-left block of
in (Equation19
(19) ) is obtained by definition of
as the weighted sum of upper-left blocks of
, or
, that is,
Again by definition of
, the
block
is obtained from (Equation19
(19) ) as
while the
block
is block-diagonal when decomposed in successive blocks of sizes
, with jth diagonal block given by
, for
. Therefore,
(20) Since the inverse asymptotic variance of
is
, we have proved the desired result, that the asymptotic variance of the inverse-variance weighted linear combination (Equation3
(3) ) is the same as the asymptotic variance for the estimator derived from the optimal weighted estimating function (Equation16
(16) ) in this section.
Remark on the Proof of (VI). The proof relies on the M-estimating function property of both in achieving the specific formula in (Equation17
(17) ) for
and because that is the setting where the nuisance-parameter block of the combined-sample information analogue is block-diagonal. The application of (VI) is to meta-analyses where only the structural parameter estimates and their estimated variances are reported in the separate studies. However, the fairly dramatic conclusion is that in the setting of M-estimating functions with unconstrained
's, even if all estimates and variances for nuisance parameters were also available for combined analysis, the asymptotic variances of the combined estimates of structural parameters would be no better.
4. Examples where information additivity fails
It remains to clarify that the asymptotic optimality of weighted linear combinations of separate-sample estimators in the combined sample cannot persist generally when the nuisance parameters in the two samples are coupled, i.e., partially shared or related through common constraints. The matrix weights in the combined estimator (Equation3(3) ) are optimal in the sense of minimising variance, and they lead to a combined estimator with asymptotic information
. So when the lower bound on information is not attained at
— and by (Equation13
(13) ) is smaller with respect to the positive-definite ordering — linear combinations cannot be optimal. An example of a coupled parameterisation is the two-sample location-scale model where
is finite-dimensional and for a common location parameter β and a positive unknown scalar σ,
(21) Thus in this example,
, and we replace
by λ in the notations below. In the case of finite-dimensional λ, straightforward calculation shows that the per-observation combined-sample Fisher information matrix has the form
(22) with the expressions for
given in terms of
in Appendix 2.
The goal of this example is to show that generally the equality in (Equation13(13) ) does not hold, by showing that the reciprocal of the
or upper-left element of
is not equal to the sum of the reciprocals of the upper-left elements of the inverses respectively of
(23) and of
(24) These two displayed information matrices are the sample-1 and 2 total-information matrices divided by the total sample-size n.
We supply below a numerical comparison of the single-sample and combined-sample information bounds for estimating β. Before doing so, we indicate a class of densities within this framework where separate-sample informations do add to give the combined-sample information.
Suppose in the location-scale two-sample setting just described that each density is symmetric (i.e., an even function) and that its derivative with respect to λ is also even. Then it is easy to check that
is odd, and inspection of the integral formulas in Appendix 2 shows that
and
. In that case, estimation of β is adaptive as in (IV) (just as efficient without as with knowledge of the nuisance parameters), with separate-sample per-observation information numbers for β respectively
and
, and combined-sample information
. The t distributions form a special case of this example:
This paragraph shows that within the two-sample location-scale setting, we must look within skewed density classes to find examples where strict inequality holds in (Equation13
(13) ).
So we turn to two-sample location family examples (Equation21(21) ) in which
is chosen to be asymmetric. One such density family is the skew-normal due to Azzalini (Citation1985)
In this family, in the special case where the true value of
, integration shows that the combined-sample per-observation information matrix is
which is nonsingular if and only if
, and in that case the combined-sample information for β is
. However, this example is remarkable in that the single-sample information matrices are
and are both singular, and the single-sample information numbers for β in the presence of nuisance parameters tend to 0 as λ decreases to 0.
Non-zero values of the parameter λ further illustrate the phenomenon of non-additivity of separate-sample information bounds for estimating the common location parameter β when nuisance parameters in the two samples are related. Table shows several numerically calculated values (using the function integrate in R, R Development Core Team, Citation2017) of separate-sample and combined information for estimating a location parameter, all divided by the total sample size n. The table refers to the skew-normal two-sample location-scale problem, where λ is the skew-parameter, σ the scale parameter, and c=0.5 the proportion of observations in sample 1. The point of these examples is that the last two rows do not sum to the combined-sample information , although this is nearly true when
. In the case previously discussed, with
, both of the separate-sample information entries
were 0, while the combined-sample information was
.
Table 2. Information on β in combined and separate skew-normal samples. In all cases, c=0.5.
Examples like these have practical consequences. In paragraph (V) above, suppose that is a subvector of the parameter β, where
. Even if additivity of information (i.e., equality in (Equation13
(13) )) holds for two-sample inference about β, it does not necessarily hold also for inference about γ based on the same data. As an example, consider the same two-sample skew-normal location-scale problem just presented, but now let the common parameter of interest be
, let the sample-1 nuisance parameter
be null (i.e., absent), and replace
by
. Apart from the change in notation, let the single-sample densities be the same as before. Then the nuisance-parameter varies separately and freely over the two samples, and paragraph (V) implies that information additivity does hold for
. But we have seen in the discussion and Table above that inequality (Equation13
(13) ) is strict for the sub-vector β.
5. Discussion: meta-analysis and shared parameters
It is common in statistical applications to model separately collected datasets with shared parameters. Meta-analysis is one approach, within biomedical or social science, to the pooling of information across separate centres or data-collections. Parameters like treatment effects — those that are important for practical decisions — are most often shared across sub-samples, but common nuisance parameters may also naturally arise when cross-classifying variables are adjusted away using models. We have seen in this paper that there is no obstacle to the efficient weighted linear combination of separate-sample ML or M-estimators when nuisance parameters are either absent or vary freely without constraints across samples, but that otherwise, efficient functional combination may be impossible. The clear message for statistical practice is that whenever possible, separate studies which might be combined should report the estimator along with the estimated variance of a parameter vector which includes both β and whichever components of might be shared across the models for separate samples.
Acknowledgments
The authors gratefully acknowledge the Eastern Cooperative Oncology Group as the source for the ECOG EST 1582 dataset, and the suggestion of a referee to expand our treatment of (V) to estimating equations.
Disclosure statement
This paper is released to inform interested parties of research and to encourage discussion. Any views expressed are the authors' and not necessarily those of the U.S. Census Bureau.
Additional information
Notes on contributors
Eric Slud
Eric Slud is Professor in the Statistics Program within the Mathematics Department of the University of Maryland, College Park, and is Area Chief for Mathematical Statistics in the Center for Statistical Research and Methodology of the US Census Bureau.
Ilia Vonta
Ilia Vonta is an Associate Professor at the Department of Mathematics of the School of Applied Mathematical and Physical Sciences of the National Technical University of Athens, Athens, Greece and an Associate Professor-Tutor of the Hellenic Open University.
Abram Kagan
Abram Kagan is Professor in the Statistics Program within the Mathematics Department of the University of Maryland, College Park.
References
- Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12, 171–178.
- Bickel, P., & Doksum, K. (2007). Mathematical statistics (2nd ed., Vol. I). Upper Saddle River, NJ: Pearson Prentice Hall. updated printing.
- Bickel, P., Klaassen, C., Ritov, Y., & Wellner, J. (1998). Efficient and adaptive estimation for semiparametric models. Berlin: Springer.
- Carlin, B., & Louis, T. (2008). Bayesian methods for data analysis (3rd ed.). Boca Raton, FL: Chapman & Hall/CRC.
- DeMissie, M. (2009). Investigating center effects in a multi-center clinical trial study using a parametric proportional hazards meta-analysis model (MS Thesis). Univ. of Maryland Statistics Program, DRUM Digital Repository. Retrieved from https://drum.lib.umd.edu/handle/1903/9588
- Draper, N., & Smith, H. (1981). Applied regression analysis (2nd ed.). New York: Wiley.
- Efron, B. (1996). Empirical Bayes methods for combining likelihoods. Journal of the American Statistical Association, 91, 538–550. doi: 10.1080/01621459.1996.10476919
- Gray, R. (1994). A Bayesian analysis of institutional effects in a multicenter cancer clinical trial. Biometrics, 50, 244–253. doi: 10.2307/2533216
- Hartung, J., Knapp, G., & Sinha, B. (2008). Statistical meta-analysis with applications. Hoboken, NJ: Wiley.
- Janicki, R. (2009). Statistical inference based on estimating functions in exact and misspecified models (Ph.D. thesis). Univ. of Maryland Statistics Program, DRUM Digital Repository. Retrieved from https://drum.lib.umd.edu/handle/1903/9690
- Kagan, A., & Rao, C. R. (2003). Some properties and applications of the efficient Fisher score. Journal of Statistical Planning & Inference, 116, 343–352. doi: 10.1016/S0378-3758(02)00351-8
- Lin, D. Y., & Zeng, D. (2010). On the relative efficiency of using summary statistics versus individual-level data in meta-analysis. Biometrika, 97, 321–332. doi: 10.1093/biomet/asq006
- Liu, D., Liu, R., & Xie, M. (2015). Multivariate meta-analysis of heterogeneous studies using only summary statistics: Efficiency and robustness. Journal of the American Statistical Association, 110, 326–340. doi: 10.1080/01621459.2014.899235
- Olkin, I., & Sampson, A. (1998). Comparison of meta-analysis versus analysis of variance of individual patient data. Biometrics, 54, 347–352. doi: 10.2307/2534018
- R Development Core Team (2017). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org, ISBN 3-900051-07-0.
- Rao, C. R. (1973). Linear statistical inference (2nd ed.). New York: Wiley.
- Rao, J. N. K., & Molina, I. (2015). Small area estimation (2nd ed.). Hoboken, NJ: Wiley.
- Slud, E., & DeMissie, M. (2011). Validity of regression meta-analyses versus pooled analyses of mixed-effect linear models. Mathematics in Engineering, Science and Aerospace, 2(4), 251–266.
- Tsiatis, A. (2006). Semiparametric theory and missing data. New York: Springer.
- Van der Vaart, A. (1998). Asymptotic statistics. Cambridge: Cambridge University Press.
Appendices
Appendix 1. Optimality of the linear combination in equation (3)
We are interested in regular estimators of based on separate-sample efficient estimators
of the projected parameters
, where
are projections onto known subspaces of
. The most important and well-studied case is where the same parameter is estimated in all samples, and
, but it turns out to be almost as easy to consider the general case where
determines β, or equivalently, where
. If
is of dimension less than p, then the projection
is understood as having ‘structural zeroes’ in place of the parameter components
, such as would occur if the β entries were regression coefficients for the (independent, randomly generated) rows of a
design matrix
, and if in the jth sample the rows of
were structurally constrained to be in the range space of
, so that
is almost surely the zero matrix. Further assume (without loss of generality) that the estimators
are constrained to fall in the range space of
, so that
, and that the asymptotic variance matrix
of
exists and is consistently estimated by
. Without loss of generality
, and
is an invertible linear operator from
to itself, with inverse which we denote by
. Then
is a generalised inverse of
(Rao, Citation1973, p. 24), which means that
More specifically, we have by definition of
,
(A1) Similar notations
and generalised-inverse properties also hold for the estimated variance matrices
. The projection operators
need not all have the same range, but we assume that they have no common nontrivial null-space.
We restrict to estimators of the linear form , as in paragraph (III) of Section 3, where
are
matrices that depend continuously on
and the variance-estimators
, and there is no loss of generality in imposing the structural-zero constraint
. It follows from regularity that for large n, the matrices
must satisfy the constraint
(A2) where
denotes the
identity matrix. This holds because regularity (Bickel et al., Citation1998, pp. 17–21, or Van der Vaart, Citation1998, p. 115) of estimators
and of
implies that
and
each converge in distribution to the same respective limits whenever β is replaced by any element
in a neighbourhood of extent
about β.
We now prove that among estimators satisfying (EquationA2
(A2) ) and
, the smallest asymptotic variance matrix with respect to positive-definite ordering is attained only when
is asymptotically equivalent to (differs by
from)
(A3)
If denotes the large-sample in-probability limit of
, for
, then
and the constraint (EquationA2
(A2) ) immediately implies that
. Then the asymptotic variance matrix of the estimator
is, by independence of the samples
, the limit of
(A4) The equality in (EquationA4
(A4) ) follows immediately from the constraint
, noting by (EquationA1
(A1) ) and (EquationA3
(A3) ) that
which is equal to
. The unique variance-minimising property of
follows from the fact that the last matrix on line (EquationA4
(A4) ) is nonnegative definite and is
only when all
. The desired result has been proved.
Appendix 2. Information matrices in Section 4
Section 4 presents a two-sample location-scale problem with respective densities
and with sample sizes
and
with
. The combined-sample information matrix (Equation22
(22) ) for the parameters
is given as
since
is 0. Then straightforward integration yields the entries of
as
where
denote derivatives of
with respect to the first argument x, and all functions
and derivatives are evaluated at
and integrated with respect to the x variable on
.
Similarly one calculates directly the entries of the symmetric matrix
as
From these integral expressions we derive the entries of (Equation22
(22) ) as
along with the separate-sample information expressions (Equation23
(23) ) and (Equation24
(24) ).
Appendix 3. Infinite Dimensional Nuisance Parameters
This Appendix provides an extension to infinite-dimensional nuisance parameters, of the result of paragraph (V) of Section 3 on the efficient functional combination of separate-sample MLEs to provide an efficient combined-sample estimator of a structural (finite-dimensional) parameter β shared across samples. The result states that such an efficient combination exists when the separate-sample nuisance parameters vary freely and independently of one another.
We maintain the notation of Section 3. The nuisance-parameter spaces may now be infinite-dimensional,
. Assume that for all
dimensional vectors
and any
dimensional vectors
with at most finitely many components nonzero, there exist
such that
. We also assume standard regularity and nondegeneracy conditions about finite-dimensional submodels with parameters
(as in Bickel & Doksum, Citation2007, Theorem 6.2.2, and Van der Vaart, Citation1998, Theorem 5.39). In particular, for each finite-dimensional affine subset
, (a set of the form
where
is a vector space) and
, these assumed conditions imply that the joint ML estimators
for
exist and are consistent and locally uniquely determined as solutions of the score or likelihood equations in the jth sample, and are asymptotically normal in the following sense. Let
and
This is a vector space whose dimension
we have assumed to be finite. Let
denote an orthonormal basis of
, and define the operator
by the rule
Then
is a vector space isomorphism, and
(C1) The variance
is the inverse of the
Fisher Information matrix
for the jth sample and depends on the finite-dimensional parameter space
(the submodel). The upper-left
block of
is denoted by
. The infimum in the sense of positive-definite matrix ordering (
if and only if L−K is nonnegative definite)
can be shown to exist (Bickel et al., Citation1998, pp. 17-21, or Van der Vaart, Citation1998, p. 115).
As discussed in Section 3, another expression for is
where
ranges over all vectors in the
dimensional space
, the
matrix K is arbitrary, and we adopt the notation
for any vector
, and denote by
the gradient operator. The inf in the displayed expression is actually achieved, and leads to the expression
where
now ranges over all (sufficiently small) vectors in
differing from 0 in only m coordinates, for some finite m, and K ranges over all
real matrices.
Any efficient regular estimator defined from
, i.e., one for which the asymptotic variance of
is no larger than
, differs from any other such estimator by a remainder of order smaller than
in probability (Hájek-LeCam convolution theorem, Van der Vaart, Citation1998, p. 115). This notion of semiparametric efficiency applies under general conditions to the case discussed here where the nuisance parameters
are infinite-dimensional. Semiparametric efficient regular estimators are discussed in Van der Vaart (Citation1998, Section 25.3). Although the information bounds
depend on the nuisance parameters
, we suppress that dependence to keep the notation as simple as possible.
We next define the notion of combined-sample Fisher information about β, based on all data . Allowing the possibility that some of the coordinates of the parameter vectors
are shared or constrained across different samples
, let
denote a maximal parameter vector consisting of all free parameters among
, so that all
vectors are well-behaved functions of
. Then all of the densities can be written
, where
is smooth in all components of its parameter arguments. By independence of the samples
, and by analogy with the displayed formula above for the separate-sample per-observation Fisher information matrices
, there now exists a variationally defined combined-sample information matrix for β,
(C2) where for some finite
, the vector
ranges over all subvectors of dimension the same as
which differ from
in at most m places; K ranges over all
real matrices; and the inf is in the sense of nonnegative-definite matrix ordering. By these variational considerations, the inequality (Equation13
(13) ) on superadditivity of information, presented in (I) of Section 3 in the finite-dimensional setting, continues to hold in the setting allowing infinite-dimensional
.
We now come to the main result of this Appendix, that when the nuisance parameters in the separate samples are unrelated, then (Equation13
(13) ) becomes an equality. The proof arguments are in the spirit of Bickel et al. (Citation1998), Van der Vaart (Citation1998, Chapter 25), and Tsiatis (Citation2006). It is a general phenomenon that independent samples with freely varying nuisance parameters allow combined optimal estimators by simple linear combination of separate-sample optimal estimators.
Proposition A.1
Under the setting and assumptions above, suppose also that the possibly infinite-dimensional nuisance parameters vary freely, unconstrained by β or by each other. Assume further that as the sample-size n increases, all of the ratios
have limits
. Then the combined-sample semiparametric per-observation information bound is
.
Proof.
We define several Hilbert subspaces of the space of p-vector square-integrable random variables defined measurably from the combined samples
, where P is the probability measure corresponding to the true parameter values
. First, for
, define the closed linear spaces
of (assumed square-integrable) random p-vectors with coordinates spanned by directional derivatives
evaluated at t=0, where
and
is any vector with at most finitely many entries nonzero such that
for all sufficiently small t. Denote by
the span of all the spaces
. Define
to be the closed linear span of the subset of these vectors of directional derivatives for which
, and
Inner products on the spaces
of p-vectors are:
. Since all elements of
have mean 0, the inner product is the sum of componentwise covariances.
Let denote the linear projection within
onto the closed linear space
. The separate-sample information bounds
can generally (Bickel et al., Citation1998; Tsiatis, Citation2006; Van der Vaart, Citation1998) be interpreted as the variances of the ‘efficient influence functions’, the projection of the respective p-vector scores
onto the orthogonal complement of
within
, denoted
, or (equivalently, because each sample is independent identically distributed)
The Hilbert space
is called the tangent space of the finite-dimensional submodels. Because of the assumption that all of the parameters
vary freely and unconstrained,
can be expressed as the direct sum
Since the subspaces
are mutually orthogonal, being formed from independent random variables for different
, for each
,
It follows immediately that the projection of in
orthogonal to
, which is the efficient influence function for the combined-sample data problem, is also precisely the same as
The norm-squared or variance of this projection is, by mutual independence of
, equal to a sum of variance-covariance matrices of projections:
Thus
is asymptotically equal to
, completing the proof.