419
Views
0
CrossRef citations to date
0
Altmetric
SHORT COMMUNICATIONS

Combining estimators of a common parameter across samples

, &
Pages 158-171 | Received 21 Mar 2018, Accepted 28 Sep 2018, Published online: 24 Oct 2018

ABSTRACT

In many settings, multiple data collections and analyses on the same topic are summarised separately through statistical estimators of parameters and variances, and yet there are scientific reasons for sharing some statistical parameters across these different studies. This paper summarises what is known from large-sample theory about when estimators of a common structural parameter from several independent samples can be combined functionally, or more specifically linearly, to obtain an asymptotically efficient estimator from the combined sample. The main idea is that such combination can be done when the separate-sample nuisance parameters, if any exist, vary freely and independently of one another. The issues are illustrated using data from a multi-centre lung cancer clinical trial. Examples are presented to show that separate estimators cannot always be combined in this way, and that the functionally combined separate estimators may have low or 0 efficiency compared to the unified analysis that could be performed by pooling the datasets.

1. Introduction

In many fields of social and biomedical science, multiple studies estimating the same parameter are conducted and summarised separately. The parameter of interest is often a measure of effectiveness, that is, of difference in response between groups with and without a certain treatment. Effectiveness may be quantified by the positive effect of a biomedical treatment or of an intervention in a social-science context such as education or criminal justice. The most common version of the problem, addressed by Meta-Analysis (Hartung, Knapp, & Sinha, Citation2008), arises when many independent and uncoordinated studies are not individually large enough to make a definitive statement about positivity of the single (usually scalar) parameter θ which has scientific or policy interest. Another setting in which analyses are combined is the less common one of large similar studies (say, done in different geographic regions) with a shared parameter, where there may or may not be shared nuisance parameters. Whether large or small, few or many, the separate studies might differ in the precise characteristics of the population investigated, the criteria of inclusion in each study, or in the measurements collected either because of the choice of auxiliary variables or because of the definitions and research methods used.

The primary objective here, as in the numerous examples of meta-analysis in Hartung et al. (Citation2008), is to combine the results of many moderate-sample studies to distill a consensus parameter estimate, or a corresponding test of significance of specific component parameter(s) representing treatment effectiveness. Since it is not always possible to gain simultaneous access to the raw unit-level data of previously published studies, researchers seeking a definitive test or estimator of treatment effectiveness often attempt to combine the study results through a function of their separate summary statistics rather than through a re-analysis of all pooled study subject-level data based on a unified model.

The active field of statistical meta-analysis is occupied with methods of simultaneously modelling disparate studies or the estimated parameters from those studies in such a way that they can be combined (Hartung et al., Citation2008). Much of this effort has gone into models that account for differences in study methodology through random effects. Random-effect, empirical-Bayes, and hierarchical Bayes methods have all proved useful in this effort to combine study results. Efforts to ‘borrow strength’ across distinct experimental entities are ubiquitous in the random-effect and Bayes community (Carlin & Louis, Citation2008; Efron, Citation1996), but arise also under the heading of Small Area Estimation in the survey world (Rao & Molina, Citation2015). The validity of analyses depending on models to combine different experiments with shared parameters and random effects can always be questioned, but sometimes such analyses turn out to be surprisingly robust (Slud & DeMissie, Citation2011). We discuss and illustrate some of the relevant modelling issues in Section 2 below.

This paper studies the problem of combining studies using a frequentist large-sample approach based on the standard tools of Fisher information and large-sample asymptotics. We begin by giving notations, statements and explanations of relevant results.

Suppose that k large independent samples of independent vectors of data are observed, for , with the goal of estimating the common parameter efficiently. The densities of individual data-vectors are assumed known except for the parameters , where the nuisance parameters may not be present for all j, but k and the parameter dimensions are fixed while the sample-sizes all tend to ∞. Assume that the separate samples are summarised only through estimators of β, along with estimators of their variances, and that it is desired to estimate β as efficiently as possible from these statistics. The vectors might then be treated as independent data with approximate means β and with known variance matrices . We refer to as separate-sample estimators and to estimators within a unified model of the k-sample data considered together as combined-sample estimators.

When β is scalar, the best unbiased linear-combination estimator with respect to mean-squared error is well known to have . This estimator has often been viewed in Meta-Analysis (Hartung et al., Citation2008, Chapter 4) as arising from the model (1) where the variances are treated as known. This model is treated as the source of the meta-analytic estimator (Equation3) below also in the p-vector setting.

Olkin and Sampson (Citation1998) considered a balanced ANOVA model (2) which fits into our setting, where , the components of the p-vectors are uncorrelated and all have mean 0 and (known or unknown) variance , and are the least-squares estimators of the β vectors in terms of the jth-sample data . Olkin and Sampson (Citation1998) prove – in a setting extending (Equation2) to unbalanced ANOVA – that the best linear estimator (Equation3) below in model (Equation1) is identical to the least squares estimator of β within the combined unit-level model (Equation2). When the errors are normal, the Gauss-Markov Theorem implies that the linear-combination estimator (Equation3) is Maximum Likelihood and therefore efficient within (Equation2).

Two recent articles, Lin and Zeng (Citation2010) and Liu, Liu, and Xie (Citation2015), have treated in different ways the problem of efficiently combining separate-study estimators of a shared parameter when there are nuisance parameters. Lin and Zeng (Citation2010) discusses the parametric combination of separate-study estimators when only the separate-study estimators and variance estimators are available. They prove the result given as (V) below, in the case where the same parameter β is estimated by maximum likelihood in each study and all nuisance parameters vary freely and unconstrained between studies. They do not cover the case where nuisance parameters are infinite-dimensional or where the separate studies may estimate only projections of β, a situation that occurs naturally in meta-analysis, as illustrated in the example of Section 2 below. These extensions are covered in the present paper, in Section 3 and Appendices 1 and  3. The second issue, that separate studies may estimate only functions of a common parameter, is treated fully by Liu et al. (Citation2015) in a more general setting where nuisance parameters may constrain each other across studies. The paper of Liu et al. (Citation2015), using the idea of Confidence distributions, establishes under general regularity conditions the form of the optimal combination of the separate-study estimators of all of the parameters, not just the shared structural parameters but the nuisance parameters too. That paper is therefore less directly relevant to meta-analysis in practice, since it is very uncommon for investigators in separate studies to report the estimated nuisance parameters as well as joint variance estimates of all structural and nuisance parameters, as was noted by Lin and Zeng (Citation2010, 1st paragraph of Section 2.2).

In the rest of this paper, we allow the possibility that in the jth sample, the estimand may be not the full p-vector parameter β but rather a projection to a known subspace of the p-dimensional parameter space, with ‘structural zeroes’ in place of the other coordinates . In case the range of the projection has dimension , the jth sample estimator is assumed to be either an efficient estimator or one derived by solving an estimating equation of , with asymptotic variance , where is consistently estimated by . Both and are assumed to have range-spaces the same as that of , and respective generalised-inverses which are inverses of on and are the operator on the orthogonal-complement space . Appendix 1 proves under general large-sample regularity conditions, that if are asymptotically efficient estimators of from the separate samples, then the overall p-vector estimator (3) is an efficient estimator of β from the combined sample , i.e., has minimal asymptotic variance, if there exists a regular efficient estimator depending (smoothly) on alone. Alternatively, this efficiency is proved to hold in Section 3 under the restriction that the nuisance parameters range freely without any constraints connecting them for different j. In a further reformulation in Section 3, it is shown that when the nuisance parameters range freely without any constraints connecting them for different j, and the estimators are obtained by solving M-estimating equations (4) the estimator is efficient in the sense of having the same asymptotic variance as the best estimating-equation estimator linearly combining the estimating functions with matrix coefficients. Through examples in Section 4, we show that when the parameters are related by constraints, the estimator (Equation3) is generally not efficient and may have efficiency 0, which means intuitively that its large-sample variance has larger order of magnitude than the best possible estimator based on the combined sample.

Before going on to theoretical developments, we illustrate these ideas in the next section using data from a real clinical trial.

2. Data example from a multi-centre clinical trial

Large randomised clinical trials are often conducted simultaneously at different medical centres. They are generally governed by the same clinical protocols — including formal entry criteria, randomisation methods, baseline and cross-sectional measurements to be collected, and study endpoints — but can differ in many of the same ways as completely separate studies: the patient populations from which study participants are recruited, slight differences in the way entry criteria and medical procedures are applied by medical personnel at the different centres, and the patient management strategies of individual physicians associated with the different centres. However, since the study design is shared by all the centres, such large clinical trials can be excellent test-beds for methods purporting to estimate shared parameters by combining analyses done in separate smaller studies. We describe the separate and combined analyses for just such a study, the Eastern Cooperative Oncology Group (ECOG) EST 1582 clinical trial of two different combination chemotherapies for treatment of small cell lung cancer, which has previously been analysed by Gray (Citation1994) and studied by meta-analysis in Slud and DeMissie (Citation2011). The standard therapy in this trial was CAV, a combination of cyclophosphamide, adriamycin and vincristine, and the experimental treatment regimen (CAV-HEM) alternated cycles of CAV with hexamethylmelamine, etoposide and methotrexate. Allocation to these two treatment arms was randomised and equal.

The covariates in the study were binary indicators: Trt for experimental treatment, bone for bone metastasis, liver for liver metastasis, wtloss for weight-loss prior to study entry, and a measure Perf of performance status at baseline. These covariates entered significantly into an overall Bayesian proportional-hazards analysis by Gray (Citation1994), who found that both a coefficient for Trt and one for an interaction term Trt-by-bone were significant. Here, as in Slud and DeMissie (Citation2011), building on the earlier MS thesis work of DeMissie (Citation2009), we analyse the same data separately by centre taking the Trt-by-bone interaction into account.

The data for subject i in study-centre j consist of equal to the logarithm of survival time (or log of time until censoring, for the 10 out of 570 patients who were lost to follow-up at a time before death), indicating as a failure time rather than censoring, and the vector of covariates consisting of the constant 1, Trt, bone, liver, Perf, wtloss in Model 1, augmented in Model 2 by Trt*bonCtr, where bonCtr is a recoded covariate obtained by subtracting from bone its study-centre mean. The main Trt effect in Model 2 is still the Trt coefficient due to the centering in bonCtr. The 18 study-centre clusters were obtained from the 26 original hospitals, as in DeMissie (Citation2009), after merging some smaller ones by clustering for similarity of covariate means, so that the smallest number of individuals in any centre became 17. The underlying models we consider here are (5a) or (5b) where are as defined above, are independent random cluster-effects included only in model (Equation5b), and are independent random errors distributed as extreme-value, i.e., as the logarithm of a unit exponential variable. These models are called Model 1 (a or b) when the parameter β common to all study-centres is the scalar coefficient of Trt. Model 2 (a or b) differs only by including the Trt*bonCtr covariate, with β the 2-vector of coefficients of Trt  and Trt*bonCtr.

Models 1a and 2a are first fitted for each study-centre, and their separately estimated β coefficients and standard errors, obtained from the R function survreg, are exhibited in Table . (Note that in each of Centres 2 and 7, the Trt*bone values are all 0, so that in these centres Trt*bonCtr is an affine function of Trt and is a structural 0.) We then construct the estimators (Equation3) in these two models, scalar for Model 1a and 2-vector for Model 2a: these are the estimators that would be produced in a meta-analysis, if Model 1a and 2a estimation results (including estimated variance matrices for estimates in 2a) were separately published from studies at distinct centres. For Model 1a, the meta-analytic estimates (Equation3) of β and standard error are . For Models 2a, the meta-analytic coefficient estimates are , with respective SE's 0.051,0.117. In all of these analyses, the Trt coefficient is highly significant, indicating that the experimental treatment prolonged life as was found by Gray (Citation1994) and DeMissie (Citation2009), but the separate centres' models seem to yield conflicting information.

Table 1. Estimated parameters and SE's from two models, with coefficients for Trt and for Trt*bonCtr, in separate study centres. ‘*’ denotes structural 0.

Models 1a and 2a are fixed effects models for many small-sample datasets. Fitting them separately in each centre reflects an assumed lack of connection across centres j for the coefficients of covariates not involving treatment. In Model 1a, the standard errors of the treatment coefficient are roughly 0.2 in each centre, and the standardised coefficient ranges across centres from to , significant (at , two-sided) in only 4 centres. In Model 2a, the standardised coefficient for Trt ranges from to , with 8 significant, and that for Trt*bonCtr ranges from to 2.8, with only 2 significant. A unified model 1a, assumed to hold in all centres with common β, yields with SE=0.069, and the unified model 2a with shared 2-vector β yields with SE=0.068. However, since the estimates vary considerably across centre in separately fitted models, the unified model should accommodate the differences through a random-effects model like (Equation5b), with independent random Trt effects across cluster. Random treatment effects were previously considered both by Gray (Citation1994) and DeMissie (Citation2009). The unified model (Equation5b), fitted by SAS proc nlmixed to allow random Trt effects by centre, results in the following estimates and standard errors: The unified models 1b and 2b agree in their estimate of the Trt coefficient, disagree only slightly from the unified models 1a and 2a, but the unified analysis disagrees markedly from the meta-analysis. The unified estimate of the Trt*bonCtr coefficient appears significant both in the unified model and in the meta-analysis. The Trt*bonCtr coefficient can be seen to be very noisily fitted in the individual clusters, and perhaps should also be treated with a random effect in the unified model.

The important point of this example is that a properly specified combined-sample analysis can be expected, under the kind of mixed-effect linear model described here, to give substantially the same results as a meta-analysis under a model with adequately detailed interactions and random effects (Slud & DeMissie, Citation2011). In this setting, as in most real meta-analyses, the separate samples (here, the analyses at individual centres) are too small to identify interactions such as treatment-by-covariate interactions which are clearly significant in unified-model analysis. The operation of meta-analysis, which functionally combines the separate-sample coefficient estimates, shows through goodness-of-fit assessments the necessity of including interaction terms (such as treatment-by-covariate) and random effects where separate-sample coefficients vary considerably.

3. Results from large-sample theory

Throughout the rest of this paper, we assume standard regularity and nondegeneracy conditions about parameters (as in Bickel & Doksum, Citation2007, Theorem 6.2.2, and Van der Vaart, Citation1998, Theorem 5.39) which for finite-dimensional imply the following. First, joint maximum likelihood (ML) estimators for exist and are consistent and locally uniquely determined as solutions of the score or likelihood equations in the jth sample, and (6) The variance is the inverse of the Fisher information matrix for the jth sample. The upper-left block of is denoted by or by , and is the smallest possible asymptotic variance (in the sense of matrix ordering if if and only if LK is nonnegative definite) within the class of all ‘regular’ estimators (Bickel, Klaassen, Ritov, & Wellner, Citation1998, pp. 17–21, or Van der Vaart, Citation1998, p. 115), which includes (again subject to regularity conditions) all estimators defined as solutions of estimating equations (7) where the functions are known and nonrandom -vector estimating-function summands whose form depends on study j but not on subject-index i. Any efficient regular estimator defined from , i.e., one for which the asymptotic variance of is no larger than , differs from the ML estimator of β by a remainder of order smaller than in probability. (This assertion follows from the Hájek-LeCam convolution theorem, Van der Vaart, Citation1998, p. 115.) This notion of efficiency can be extended also to the case where the nuisance parameters are infinite-dimensional, a notion of ‘first-order optimality’ defined as semiparametric or asymptotic efficiency of regular estimators (Van der Vaart, Citation1998, Section 25.3). Although the information bounds do depend on the nuisance parameters , we suppress that dependence to keep the notation as simple as possible.

Suppose that the values of an efficient estimator and a consistent estimator of are reported from separate analysis of the jth sample in order to obtain confidence intervals for components or linear combinations of the coordinates of β. The main question of interest in this paper is: under what circumstances will there exist a (smooth) function of the summary statistics such that is efficient ? In the case where the separate-sample estimators are assumed efficient, this means that the asymptotic variance of is the inverse of the per-observation Fisher information for β based on the combined sample of size . In the case where the estimators are obtained from estimating equations as in (Equation4), efficiency means that has asymptotic variance no larger than the best combined-sample estimating equation estimator obtained from a matrix-weighted linear combination of the estimating functions.

In the next part of this Section, leading up to paragraphs (I)–(VI), we develop notions of single- and combined-sample Fisher information about the parameter β. Recall that the per-observation Fisher information about a parameter θ (here, or for a vector combining all of the free parameters in ) is a matrix expectation defined as where is a score statistic obtained as the column of partial derivatives of the log-likelihood of the data-sample of size n, with respect to the parameters. (Here and below, we also use the convenient notation for any vector v.) In this section, we express statistical properties of estimators in terms of the linear algebra of information matrices.

To begin, we review concepts and develop formulas related to separate-sample Fisher information about β. Denote the score statistic for data with respect to β by and with respect to by . The information matrices per observation for the separate-sample parameters are (8) In the block-decomposition on the right-hand side of (Equation8), is .

The information about β in the jth sample is defined as the inverse of the upper-left block of and is given by . This linear-algebra fact about inverses of block-decomposed matrices can be found in virtually every book about regression (cf. Draper & Smith, Citation1981, Appendix 2A), and can be interpreted as the expression of the residual variance of after regression on . This interpretation is developed in terms of associated Linear Regression theory by Draper and Smith (Citation1981, Section 2.6), or more abstractly in terms of projections by Rao (Citation1973, Sections 4.a.1–2 and 4.a.6).

Since we allow the possibility that some of the coordinates of the parameter vectors are shared or constrained across different samples , let denote a parameter vector, of dimension , consisting of all free parameters among , so that all vectors are smooth functions of . Then we can write all of the densities where is smooth in all components of its parameter arguments. Denote the score statistic for the sample with respect to the parameter vector as . Then, in terms of the Jacobian matrix of the -vector valued function with respect to , By independence of the samples , and therefore of the score-statistics for different j, the per-observation Fisher information for the combined parameter in the combined sample is the symmetric matrix (9) where (10) The corresponding complete-sample information is then expressed either as (11) in case the block is invertible, or more generally via the regression-residual interpretation analogous to that of above, as (12) where K in the infimum is an arbitrary matrix, and inf is understood in the sense of nonnegative-definite matrix ordering.

We now present a series of propositions relating combined-sample information and variance to those of the separate samples.

(I) Kagan and Rao (Citation2003, Lemma 2) establish for finite-dimensional the superadditivity of information (13) Since inverse information is equal to the smallest attainable asymptotic variance for (‘regular’) estimators under the conditions assumed here, the inequality (Equation13) can be interpreted to say that the best possible variance is at most the asymptotic variance of the right-hand side of (Equation3).

(II) The last statement can be recast as in Janicki (Citation2009, Theorem 6.1.2), to say that additivity of information, or equality in (Equation13), holds if and only if in (Equation3) has asymptotic variance . In a setting without nuisance parameters , Janicki (Citation2009, Theorem 6.1.2) shows for defined by estimating equations, that an optimal combined estimator is obtained either as the weighted linear combination (Equation3) of estimators, or as solution to the linear combination of jth-sample estimating equations (14) where are any estimating functions such that and is nonsingular, and the matrices are defined by In these expressions, functions are evaluated at the same β governing the data .

(III) For an efficient regular combined-sample estimator of the form with g continuously differentiable, Taylor linearisation of g in terms of its arguments and their Jacobians (the ‘Delta Method’) shows that (15) i.e., when are all of order n, shows that can be represented as a linear combination of the normalised centred ML estimators , up to a remainder converging to 0 in probability. Since (Equation3) (with variances now all assumed to exist) is shown in Appendix 1 to be the unique optimal matrix-linear-combination estimator in the sense of minimal asymptotic variance, up to remainders, it follows that is equal to the linear combination (Equation3) plus a remainder of smaller order than in probability.

(IV) There is one more case where the additivity of information (equality in (Equation13)) is obvious, namely the case where all are 0. This case is called adaptive because, by (Equation11) or (Equation12), the jth sample information is exactly the same as if were known in advance. But then by (Equation10), while (Equation12) with implies . Then equality must hold in (Equation13).

(V) We now come to the main result of this paper, which says that when the nuisance parameters in the separate samples are distinct and unrelated, then (Equation13) becomes an equality and an efficient estimator of β of the form (Equation3) exists. In the multi-sample context described above, under standard regularity conditions, if all vary freely, unconstrained by one another or by β, so that  and  ,  then equality holds in (Equation13). The restriction to unconstrained nuisance parameters is essential in this statement, as will be shown by example in Section 4.

There are a few different proofs of (V) in different cases. When the separate-sample estimators of the same shared parameter vector β are Maximum Likelihood estimators in the presence of the nuisance parameters , Lin and Zeng (Citation2010) observe that the maximum profile likelihood estimator for β (i.e., the maximiser of the log-likelihood partially maximised over nuisance parameters) in each of the separate studies and in the combined study is efficient and that the log profile likelihood for the combined study is simply the sum of the log profile likelihoods of the separate studies. Thus the combined-study information is the sum of the separate-study informations. That is the complete proof in the MLE case. Since any separate-study efficient regular estimators of β are asymptotically equivalent in probability to the corresponding MLEs or profile-likelihood maximisers, according to the parametric Hájek-LeCam convolution theorem, this proof establishes the same result for the combination of any efficient regular separate-study estimators. Further technicalities would have been needed to carry this proof idea forward to the case of freely varying infinite-dimensional nuisance parameters . Lin and Zeng (Citation2010) did not do that, but our Appendix 3 does. In another direction of generalisation, when the separate-sample estimators estimate not β but projections (with common null-space 0), the profile-likelihood proof sketched above yields the same result when coupled with the verification in Appendix 1 that the combined-sample MLE has .

(VI) The same result in (V) – the optimality of estimator (Equation3) – also holds when the separate-study estimators are obtained through the solution of M-estimating equations (Equation4), when the notion of optimality is suitably clarified. Here the estimating functions with values in are assumed to satisfy standard smoothness with respect to parameters and other regularity conditions, including that , where expectations are taken and the integrand functions evaluated at the same true parameters , and where are nonsingular matrices. The solution of (Equation4) is locally unique, in the vicinity of the true , with probability converging to 1 for large n, by general estimating equation theory (Van der Vaart, Citation1998, Chapter 5 or Janicki, Citation2009, Chapter 2), and estimators are regular asymptotically linear with asymptotic distributions where . Beyond this point, we maintain the same notations as throughout Section 3, namely that is the upper-left block of and although these inverse variance matrices are no longer Fisher information.

From now on, assume as in paragraph (V) that the parameters range freely and are not constrained by one another, so that the combined-sample unknown parameters are . The first step in extending (V) is to extend the optimality result in (II) by viewing the separate-study estimating equations all as estimating equations for the combined parameters . Let where is the matrix with the block decomposition containing 0's everywhere except for the identity matrix in the upper-left block and another identity matrix in the submatrix with consecutive row-indices from through and consecutive column-indices from p+1 through .

Now consider the class of all combined-sample estimating equations defined by (16) where the matrices are continuously differentiable functions of their arguments and where (17) The same arguments as in standard estimating-equation theory show that for large n, the solution of (Equation16) is, with probability approaching 1, locally unique for in a non-shrinking neighbourhood of the true values, and defines -consistent asymptotically normal estimators with asymptotic variance , where (18) Then a slight reworking of the proof of Janicki (2009, Theorem 6.1.2), using the fact that the generalised inverse of is , shows that the combined estimating equation (Equation16) with smallest asymptotic variance in the sense of positive-definite ordering of matrices is achieved when , and we fix this choice for in equations (Equation16)–(Equation18) from now on.

Denote by the solution of the estimating equation (Equation16), which takes the form with equations (Equation17) and (Equation18) giving (19) Now we return to our objective of comparing the combined-sample information-analogue for β, which is the inverse of the combined-sample asymptotic variance for its optimal estimating equation-estimator , with the information-analogue from equation (Equation3). The asymptotic variance matrix for is , so the combined-sample information-analogue is . To find as in equations (Equation9) and (Equation11), we block-decompose where the upper-left block is and the lower-right is .

The upper-left block of in (Equation19) is obtained by definition of as the weighted sum of upper-left blocks of , or , that is, Again by definition of , the block is obtained from (Equation19) as while the block is block-diagonal when decomposed in successive blocks of sizes , with jth diagonal block given by , for . Therefore, (20) Since the inverse asymptotic variance of is , we have proved the desired result, that the asymptotic variance of the inverse-variance weighted linear combination (Equation3) is the same as the asymptotic variance for the estimator derived from the optimal weighted estimating function (Equation16) in this section.

Remark on the Proof of (VI). The proof relies on the M-estimating function property of both in achieving the specific formula in (Equation17) for and because that is the setting where the nuisance-parameter block of the combined-sample information analogue is block-diagonal. The application of (VI) is to meta-analyses where only the structural parameter estimates and their estimated variances are reported in the separate studies. However, the fairly dramatic conclusion is that in the setting of M-estimating functions with unconstrained 's, even if all estimates and variances for nuisance parameters were also available for combined analysis, the asymptotic variances of the combined estimates of structural parameters would be no better.

4. Examples where information additivity fails

It remains to clarify that the asymptotic optimality of weighted linear combinations of separate-sample estimators in the combined sample cannot persist generally when the nuisance parameters in the two samples are coupled, i.e., partially shared or related through common constraints. The matrix weights in the combined estimator (Equation3) are optimal in the sense of minimising variance, and they lead to a combined estimator with asymptotic information . So when the lower bound on information is not attained at — and by (Equation13) is smaller with respect to the positive-definite ordering — linear combinations cannot be optimal. An example of a coupled parameterisation is the two-sample location-scale model where is finite-dimensional and for a common location parameter β and a positive unknown scalar σ, (21) Thus in this example, , and we replace by λ in the notations below. In the case of finite-dimensional λ, straightforward calculation shows that the per-observation combined-sample Fisher information matrix has the form (22) with the expressions for given in terms of in Appendix 2.

The goal of this example is to show that generally the equality in (Equation13) does not hold, by showing that the reciprocal of the or upper-left element of is not equal to the sum of the reciprocals of the upper-left elements of the inverses respectively of (23) and of (24) These two displayed information matrices are the sample-1 and 2 total-information matrices divided by the total sample-size n.

We supply below a numerical comparison of the single-sample and combined-sample information bounds for estimating β. Before doing so, we indicate a class of densities within this framework where separate-sample informations do add to give the combined-sample information.

Suppose in the location-scale two-sample setting just described that each density is symmetric (i.e., an even function) and that its derivative with respect to λ is also even. Then it is easy to check that is odd, and inspection of the integral formulas in Appendix 2 shows that and . In that case, estimation of β is adaptive as in (IV) (just as efficient without as with knowledge of the nuisance parameters), with separate-sample per-observation information numbers for β respectively and , and combined-sample information . The t distributions form a special case of this example: This paragraph shows that within the two-sample location-scale setting, we must look within skewed density classes to find examples where strict inequality holds in (Equation13).

So we turn to two-sample location family examples (Equation21) in which is chosen to be asymmetric. One such density family is the skew-normal due to Azzalini (Citation1985) In this family, in the special case where the true value of , integration shows that the combined-sample per-observation information matrix is which is nonsingular if and only if , and in that case the combined-sample information for β is . However, this example is remarkable in that the single-sample information matrices are and are both singular, and the single-sample information numbers for β in the presence of nuisance parameters tend to 0 as λ decreases to 0.

Non-zero values of the parameter λ further illustrate the phenomenon of non-additivity of separate-sample information bounds for estimating the common location parameter β when nuisance parameters in the two samples are related. Table  shows several numerically calculated values (using the function integrate in R, R Development Core Team, Citation2017) of separate-sample and combined information for estimating a location parameter, all divided by the total sample size n. The table refers to the skew-normal two-sample location-scale problem, where λ is the skew-parameter, σ the scale parameter, and c=0.5 the proportion of observations in sample 1. The point of these examples is that the last two rows do not sum to the combined-sample information , although this is nearly true when . In the case previously discussed, with , both of the separate-sample information entries were 0, while the combined-sample information was .

Table 2. Information on β in combined and separate skew-normal samples. In all cases, c=0.5.

Examples like these have practical consequences. In paragraph (V) above, suppose that is a subvector of the parameter β, where . Even if additivity of information (i.e., equality in (Equation13)) holds for two-sample inference about β, it does not necessarily hold also for inference about γ based on the same data. As an example, consider the same two-sample skew-normal location-scale problem just presented, but now let the common parameter of interest be , let the sample-1 nuisance parameter be null (i.e., absent), and replace by . Apart from the change in notation, let the single-sample densities be the same as before. Then the nuisance-parameter varies separately and freely over the two samples, and paragraph (V) implies that information additivity does hold for . But we have seen in the discussion and Table  above that inequality (Equation13) is strict for the sub-vector β.

5. Discussion: meta-analysis and shared parameters

It is common in statistical applications to model separately collected datasets with shared parameters. Meta-analysis is one approach, within biomedical or social science, to the pooling of information across separate centres or data-collections. Parameters like treatment effects — those that are important for practical decisions — are most often shared across sub-samples, but common nuisance parameters may also naturally arise when cross-classifying variables are adjusted away using models. We have seen in this paper that there is no obstacle to the efficient weighted linear combination of separate-sample ML or M-estimators when nuisance parameters are either absent or vary freely without constraints across samples, but that otherwise, efficient functional combination may be impossible. The clear message for statistical practice is that whenever possible, separate studies which might be combined should report the estimator along with the estimated variance of a parameter vector which includes both β and whichever components of might be shared across the models for separate samples.

Acknowledgments

The authors gratefully acknowledge the Eastern Cooperative Oncology Group as the source for the ECOG EST 1582 dataset, and the suggestion of a referee to expand our treatment of (V) to estimating equations.

Disclosure statement

This paper is released to inform interested parties of research and to encourage discussion. Any views expressed are the authors' and not necessarily those of the U.S. Census Bureau.

Additional information

Notes on contributors

Eric Slud

Eric Slud is Professor in the Statistics Program within the Mathematics Department of the University of Maryland, College Park, and is Area Chief for Mathematical Statistics in the Center for Statistical Research and Methodology of the US Census Bureau.

Ilia Vonta

Ilia Vonta is an Associate Professor at the Department of Mathematics of the School of Applied Mathematical and Physical Sciences of the National Technical University of Athens, Athens, Greece and an Associate Professor-Tutor of the Hellenic Open University.

Abram Kagan

Abram Kagan is Professor in the Statistics Program within the Mathematics Department of the University of Maryland, College Park.

References

  • Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12, 171–178.
  • Bickel, P., & Doksum, K. (2007). Mathematical statistics (2nd ed., Vol. I). Upper Saddle River, NJ: Pearson Prentice Hall. updated printing.
  • Bickel, P., Klaassen, C., Ritov, Y., & Wellner, J. (1998). Efficient and adaptive estimation for semiparametric models. Berlin: Springer.
  • Carlin, B., & Louis, T. (2008). Bayesian methods for data analysis (3rd ed.). Boca Raton, FL: Chapman & Hall/CRC.
  • DeMissie, M. (2009). Investigating center effects in a multi-center clinical trial study using a parametric proportional hazards meta-analysis model (MS Thesis). Univ. of Maryland Statistics Program, DRUM Digital Repository. Retrieved from https://drum.lib.umd.edu/handle/1903/9588
  • Draper, N., & Smith, H. (1981). Applied regression analysis (2nd ed.). New York: Wiley.
  • Efron, B. (1996). Empirical Bayes methods for combining likelihoods. Journal of the American Statistical Association, 91, 538–550. doi: 10.1080/01621459.1996.10476919
  • Gray, R. (1994). A Bayesian analysis of institutional effects in a multicenter cancer clinical trial. Biometrics, 50, 244–253. doi: 10.2307/2533216
  • Hartung, J., Knapp, G., & Sinha, B. (2008). Statistical meta-analysis with applications. Hoboken, NJ: Wiley.
  • Janicki, R. (2009). Statistical inference based on estimating functions in exact and misspecified models (Ph.D. thesis). Univ. of Maryland Statistics Program, DRUM Digital Repository. Retrieved from https://drum.lib.umd.edu/handle/1903/9690
  • Kagan, A., & Rao, C. R. (2003). Some properties and applications of the efficient Fisher score. Journal of Statistical Planning & Inference, 116, 343–352. doi: 10.1016/S0378-3758(02)00351-8
  • Lin, D. Y., & Zeng, D. (2010). On the relative efficiency of using summary statistics versus individual-level data in meta-analysis. Biometrika, 97, 321–332. doi: 10.1093/biomet/asq006
  • Liu, D., Liu, R., & Xie, M. (2015). Multivariate meta-analysis of heterogeneous studies using only summary statistics: Efficiency and robustness. Journal of the American Statistical Association, 110, 326–340. doi: 10.1080/01621459.2014.899235
  • Olkin, I., & Sampson, A. (1998). Comparison of meta-analysis versus analysis of variance of individual patient data. Biometrics, 54, 347–352. doi: 10.2307/2534018
  • R Development Core Team (2017). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org, ISBN 3-900051-07-0.
  • Rao, C. R. (1973). Linear statistical inference (2nd ed.). New York: Wiley.
  • Rao, J. N. K., & Molina, I. (2015). Small area estimation (2nd ed.). Hoboken, NJ: Wiley.
  • Slud, E., & DeMissie, M. (2011). Validity of regression meta-analyses versus pooled analyses of mixed-effect linear models. Mathematics in Engineering, Science and Aerospace, 2(4), 251–266.
  • Tsiatis, A. (2006). Semiparametric theory and missing data. New York: Springer.
  • Van der Vaart, A. (1998). Asymptotic statistics. Cambridge: Cambridge University Press.

Appendices

Appendix 1. Optimality of the linear combination in equation (3)

We are interested in regular estimators of based on separate-sample efficient estimators of the projected parameters , where are projections onto known subspaces of . The most important and well-studied case is where the same parameter is estimated in all samples, and , but it turns out to be almost as easy to consider the general case where determines β, or equivalently, where . If is of dimension less than p, then the projection is understood as having ‘structural zeroes’ in place of the parameter components , such as would occur if the β entries were regression coefficients for the (independent, randomly generated) rows of a design matrix , and if in the jth sample the rows of were structurally constrained to be in the range space of , so that is almost surely the zero matrix. Further assume (without loss of generality) that the estimators are constrained to fall in the range space of , so that , and that the asymptotic variance matrix of exists and is consistently estimated by . Without loss of generality , and is an invertible linear operator from to itself, with inverse which we denote by . Then is a generalised inverse of (Rao, Citation1973, p. 24), which means that More specifically, we have by definition of , (A1) Similar notations and generalised-inverse properties also hold for the estimated variance matrices . The projection operators need not all have the same range, but we assume that they have no common nontrivial null-space.

We restrict to estimators of the linear form , as in paragraph (III) of Section 3, where are matrices that depend continuously on and the variance-estimators , and there is no loss of generality in imposing the structural-zero constraint . It follows from regularity that for large n, the matrices must satisfy the constraint (A2) where denotes the identity matrix. This holds because regularity (Bickel et al., Citation1998, pp. 17–21, or Van der Vaart, Citation1998, p. 115) of estimators and of implies that and each converge in distribution to the same respective limits whenever β is replaced by any element in a neighbourhood of extent about β.

We now prove that among estimators satisfying (EquationA2) and , the smallest asymptotic variance matrix with respect to positive-definite ordering is attained only when is asymptotically equivalent to (differs by from) (A3)

If denotes the large-sample in-probability limit of , for , then and the constraint (EquationA2) immediately implies that . Then the asymptotic variance matrix of the estimator is, by independence of the samples , the limit of (A4) The equality in (EquationA4) follows immediately from the constraint , noting by (EquationA1) and (EquationA3) that which is equal to . The unique variance-minimising property of follows from the fact that the last matrix on line (EquationA4) is nonnegative definite and is only when all . The desired result has been proved.

Appendix 2. Information matrices in Section 4

Section 4 presents a two-sample location-scale problem with respective densities and with sample sizes and with . The combined-sample information matrix (Equation22) for the parameters is given as since is 0. Then straightforward integration yields the entries of as where denote derivatives of with respect to the first argument x, and all functions and derivatives are evaluated at and integrated with respect to the x variable on .

Similarly one calculates directly the entries of the symmetric matrix as From these integral expressions we derive the entries of (Equation22) as along with the separate-sample information expressions (Equation23) and (Equation24).

Appendix 3. Infinite Dimensional Nuisance Parameters

This Appendix provides an extension to infinite-dimensional nuisance parameters, of the result of paragraph (V) of Section 3 on the efficient functional combination of separate-sample MLEs to provide an efficient combined-sample estimator of a structural (finite-dimensional) parameter β shared across samples. The result states that such an efficient combination exists when the separate-sample nuisance parameters vary freely and independently of one another.

We maintain the notation of Section 3. The nuisance-parameter spaces may now be infinite-dimensional, . Assume that for all dimensional vectors and any dimensional vectors with at most finitely many components nonzero, there exist such that . We also assume standard regularity and nondegeneracy conditions about finite-dimensional submodels with parameters (as in Bickel & Doksum, Citation2007, Theorem 6.2.2, and Van der Vaart, Citation1998, Theorem 5.39). In particular, for each finite-dimensional affine subset , (a set of the form where is a vector space) and , these assumed conditions imply that the joint ML estimators for exist and are consistent and locally uniquely determined as solutions of the score or likelihood equations in the jth sample, and are asymptotically normal in the following sense. Let and This is a vector space whose dimension we have assumed to be finite. Let denote an orthonormal basis of , and define the operator by the rule Then is a vector space isomorphism, and (C1) The variance is the inverse of the Fisher Information matrix for the jth sample and depends on the finite-dimensional parameter space (the submodel). The upper-left block of is denoted by . The infimum in the sense of positive-definite matrix ordering ( if and only if LK is nonnegative definite) can be shown to exist (Bickel et al., Citation1998, pp. 17-21, or Van der Vaart, Citation1998, p. 115).

As discussed in Section 3, another expression for is where ranges over all vectors in the dimensional space , the matrix K is arbitrary, and we adopt the notation for any vector , and denote by the gradient operator. The inf in the displayed expression is actually achieved, and leads to the expression where now ranges over all (sufficiently small) vectors in differing from 0 in only m coordinates, for some finite m, and K ranges over all real matrices.

Any efficient regular estimator defined from , i.e., one for which the asymptotic variance of is no larger than , differs from any other such estimator by a remainder of order smaller than in probability (Hájek-LeCam convolution theorem, Van der Vaart, Citation1998, p. 115). This notion of semiparametric efficiency applies under general conditions to the case discussed here where the nuisance parameters are infinite-dimensional. Semiparametric efficient regular estimators are discussed in Van der Vaart (Citation1998, Section 25.3). Although the information bounds depend on the nuisance parameters , we suppress that dependence to keep the notation as simple as possible.

We next define the notion of combined-sample Fisher information about β, based on all data . Allowing the possibility that some of the coordinates of the parameter vectors are shared or constrained across different samples , let denote a maximal parameter vector consisting of all free parameters among , so that all vectors are well-behaved functions of . Then all of the densities can be written , where is smooth in all components of its parameter arguments. By independence of the samples , and by analogy with the displayed formula above for the separate-sample per-observation Fisher information matrices , there now exists a variationally defined combined-sample information matrix for β, (C2) where for some finite , the vector ranges over all subvectors of dimension the same as which differ from in at most m places; K ranges over all real matrices; and the inf is in the sense of nonnegative-definite matrix ordering. By these variational considerations, the inequality (Equation13) on superadditivity of information, presented in (I) of Section 3 in the finite-dimensional setting, continues to hold in the setting allowing infinite-dimensional .

We now come to the main result of this Appendix, that when the nuisance parameters in the separate samples are unrelated, then (Equation13) becomes an equality. The proof arguments are in the spirit of Bickel et al. (Citation1998), Van der Vaart (Citation1998, Chapter 25), and Tsiatis (Citation2006). It is a general phenomenon that independent samples with freely varying nuisance parameters allow combined optimal estimators by simple linear combination of separate-sample optimal estimators.

Proposition A.1

Under the setting and assumptions above, suppose also that the possibly infinite-dimensional nuisance parameters vary freely, unconstrained by β or by each other. Assume further that as the sample-size n increases, all of the ratios have limits . Then the combined-sample semiparametric per-observation information bound is .

Proof.

We define several Hilbert subspaces of the space of p-vector square-integrable random variables defined measurably from the combined samples , where P is the probability measure corresponding to the true parameter values . First, for , define the closed linear spaces of (assumed square-integrable) random p-vectors with coordinates spanned by directional derivatives evaluated at t=0, where and is any vector with at most finitely many entries nonzero such that for all sufficiently small t. Denote by the span of all the spaces . Define to be the closed linear span of the subset of these vectors of directional derivatives for which , and Inner products on the spaces of p-vectors are: . Since all elements of have mean 0, the inner product is the sum of componentwise covariances.

Let denote the linear projection within onto the closed linear space . The separate-sample information bounds can generally (Bickel et al., Citation1998; Tsiatis, Citation2006; Van der Vaart, Citation1998) be interpreted as the variances of the ‘efficient influence functions’, the projection of the respective p-vector scores onto the orthogonal complement of within , denoted , or (equivalently, because each sample is independent identically distributed) The Hilbert space is called the tangent space of the finite-dimensional submodels. Because of the assumption that all of the parameters vary freely and unconstrained, can be expressed as the direct sum Since the subspaces are mutually orthogonal, being formed from independent random variables for different , for each ,

It follows immediately that the projection of in orthogonal to , which is the efficient influence function for the combined-sample data problem, is also precisely the same as The norm-squared or variance of this projection is, by mutual independence of , equal to a sum of variance-covariance matrices of projections: Thus is asymptotically equal to , completing the proof.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.