Search in:

Journal of the American Statistical Association Volume 115, 2020 - Issue 531

Submit an article Journal homepage

3,060

Views

CrossRef citations to date

Altmetric

Theory and Methods

Combining Multiple Observational Data Sources to Estimate Causal Effects

Shu YangDepartment of Statistics, North Carolina State University, Raleigh, NC; Correspondence[email protected]

http://orcid.org/0000-0001-7703-707X View further author information

Peng DingDepartment of Statistics, University of California, Berkeley, CAView further author information

Pages 1540-1554 | Received 14 Jul 2018, Accepted 15 Apr 2019, Published online: 11 Jun 2019

Cite this article
https://doi.org/10.1080/01621459.2019.1609973
CrossMark

Sample our Mathematics & Statistics journals, sign in here to start your FREE access for 14 days

Full Article
Figures & data
References
Supplemental
Citations
Metrics
Reprints & Permissions
Read this article /doi/full/10.1080/01621459.2019.1609973?needAccess=true

Abstract

The era of big data has witnessed an increasing availability of multiple data sources for statistical analyses. We consider estimation of causal effects combining big main data with unmeasured confounders and smaller validation data with supplementary information on these confounders. Under the unconfoundedness assumption with completely observed confounders, the smaller validation data allow for constructing consistent estimators for causal effects, but the big main data can only give error-prone estimators in general. However, by leveraging the information in the big main data in a principled way, we can improve the estimation efficiencies yet preserve the consistencies of the initial estimators based solely on the validation data. Our framework applies to asymptotically normal estimators, including the commonly used regression imputation, weighting, and matching estimators, and does not require a correct specification of the model relating the unmeasured confounders to the observed variables. We also propose appropriate bootstrap procedures, which makes our method straightforward to implement using software routines for existing estimators. Supplementary materials for this article are available online.

Keywords:

Calibration
Causal inference
Inverse probability weighting
Missing confounder
Two-phase sampling

Acknowledgments

We thank the editor, the associate editor, and four anonymous reviewers for suggestions which improved the article significantly. We are grateful to Professor Yi-Hau Chen for providing the data and offering help and advice in interpreting the data. Drs. Lo-Hua Yuan and Xinran Li offered helpful comments. Dr. Yang is partially supported by the National Science Foundation grant DMS 1811245, National Cancer Institute grant P01 CA142538, and Oak Ridge Associated Universities. Dr. Ding is partially supported by the National Science Foundation grant DMS 1713152.

Supplementary materials

The online supplementary material contains technical details and proofs. The R package “Integrative CI” is available at https://github.com/shuyang1987/IntegrativeCI to perform the proposed estimators.

Additional information

Funding

Directorate for Mathematical and Physical Sciences;Division of Mathematical Sciences;

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Related Research Data

A scalable bootstrap for massive data

Source: arXiv

Optimal Matching for Observational Studies

Source: Informa UK Limited

Guided Bayesian imputation to adjust for confounding when combining heterogeneous data sources in comparative effectiveness research

Source: Oxford University Press (OUP)

Contribution to the Theory of Sampling Human Populations

Source: Informa UK Limited

Combining Micro and Macro Data in Microeconometric Models

Source: Oxford University Press (OUP)

Causal inference and the data-fusion problem.

Source: eScholarship, University of California

Addressing complications of intention-to-treat analysis in the combined presence of all-or-none treatment-noncompliance and subsequent missing outcomes

Source: Oxford University Press (OUP)

Semiparametric linear transformation model with differential measurement error and validation sampling

Source: Elsevier BV

Using empirical likelihood methods to obtain range restricted weights in regression estimators for surveys

Source: Oxford University Press (OUP)

Optimal Matching for Observational Studies

Source: Informa UK Limited

Causal inference in outcome-dependent two-phase sampling designs

Source: Wiley

Adjustment for Missing Confounders in Studies Based on Observational Databases: 2-Stage Calibration Combining Propensity Scores From Primary and Validation Data

Source: Oxford University Press (OUP)

Adjusting Effect Estimates for Unmeasured Confounding with Validation Data using Propensity Score Calibration

Source: Oxford University Press (OUP)

A Model-Calibration Approach to Using Complete Auxiliary Information From Survey Data

Source: Informa UK Limited

Bootstrap Inference of Matching Estimators for Average Treatment Effects

Source: Taylor & Francis

Sampling Statistics

Source: Wiley

Subsampling

Source: Springer New York

Estimating causal effects of treatments in randomized and nonrandomized studies.

Source: American Psychological Association

Nonparametric Estimation of Average Treatment Effects under Exogeneity: A Review

Source: MIT Press - Journals

Randomization Does Not Justify Logistic Regression

Source: Institute of Mathematical Statistics

Multiple Imputation for Nonresponse in Surveys

Source: John Wiley & Sons, Inc.

Combining Multiple Observational Data Sources to Estimate Causal Effects

Source: Taylor & Francis

Matching As An Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme

Source: Oxford University Press (OUP)

To Adjust or Not to Adjust? Sensitivity Analysis of M-Bias and Butterfly-Bias

Source: DE GRUYTER

Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy

Source: The Institute of Mathematical Statistics

Doubly robust estimation in missing data and causal inference models

Source: Wiley

The central role of the propensity score in observational studies for causal effects

Source: Oxford University Press

Using calibration weighting to adjust for nonresponse under a plausible model

Source: Oxford University Press (OUP)

Survey Calibration Estimators and Semiparametric Models

Source: Wiley

Matched Sampling for Causal Effects

Source: Cambridge University Press

Incorporating auxiliary information for improved prediction in high-dimensional datasets: an ensemble of shrinkage approaches

Source: Oxford University Press (OUP)

Assessing Sensitivity to an Unobserved Binary Covariate in an Observational Study with Binary Outcome

Source: Wiley

Propensity Score Calibration in the Absence of Surrogacy

Source: Oxford University Press (OUP)

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-Level Information From External Big Data Sources

Source: Taylor & Francis

Adjustment for Missing Confounders Using External Validation Data and Propensity Scores

Source: Informa UK Limited

Information Recovery in a Study With Surrogate Endpoints

Source: Informa UK Limited

Large Sample Theory for Semiparametric Regression Models with Two-Phase, Outcome Dependent Sampling

Source: The Institute of Mathematical Statistics

A unified approach to regression analysis under double-sampling designs

Source: Wiley

Miscellanea. Combining parametric and empirical likelihoods

Source: Oxford University Press (OUP)

Calibration Estimators in Survey Sampling

Source: Informa UK Limited

Observational Studies

Source: Springer New York

Comparison of multiple imputation and two-phase logistic regression to analyse two-phase case-control studies with rich phase 1: a simulation study

Source: Informa UK Limited

Double/debiased machine learning for treatment and structural parameters

Source: Wiley-Blackwell

Estimation of Regression Coefficients When Some Regressors are not Always Observed

Source: Informa UK Limited

Double/debiased machine learning for treatment and structural parameters

Source: Wiley-Blackwell

Calibrated propensity score method for survey nonresponse in cluster sampling

Source: Oxford University Press (OUP)

A Generalization of Sampling Without Replacement from a Finite Universe

Source: Informa UK Limited

A Generalization of Sampling Without Replacement from a Finite Universe

Source: Informa UK Limited

Linking provided by

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Combining Multiple Observational Data Sources to Estimate Causal Effects

Related Research Data

Information for

Open access

Opportunities

Help and information

Combining Multiple Observational Data Sources to Estimate Causal Effects

Abstract

Acknowledgments

Supplementary materials

Additional information

Funding

Reprints and Corporate Permissions

Academic Permissions

Related Research Data

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature