2,003
Views
8
CrossRef citations to date
0
Altmetric
Theory and Methods

Individual Data Protected Integrative Regression Analysis of High-Dimensional Heterogeneous Data

, &
Pages 2105-2119 | Received 31 Jul 2019, Accepted 13 Mar 2021, Published online: 19 May 2021
 

Abstract

Evidence-based decision making often relies on meta-analyzing multiple studies, which enables more precise estimation and investigation of generalizability. Integrative analysis of multiple heterogeneous studies is, however, highly challenging in the ultra high-dimensional setting. The challenge is even more pronounced when the individual-level data cannot be shared across studies, known as DataSHIELD contraint. Under sparse regression models that are assumed to be similar yet not identical across studies, we propose in this paper a novel integrative estimation procedure for data-Shielding High-dimensional Integrative Regression (SHIR). SHIR protects individual data through summary-statistics-based integrating procedure, accommodates between-study heterogeneity in both the covariate distribution and model parameters, and attains consistent variable selection. Theoretically, SHIR is statistically more efficient than the existing distributed approaches that integrate debiased LASSO estimators from the local sites. Furthermore, the estimation error incurred by aggregating derived data is negligible compared to the statistical minimax rate and SHIR is shown to be asymptotically equivalent in estimation to the ideal estimator obtained by sharing all data. The finite-sample performance of our method is studied and compared with existing approaches via extensive simulation settings. We further illustrate the utility of SHIR to derive phenotyping algorithms for coronary artery disease using electronic health records data from multiple chronic disease cohorts.

Supplementary Material

In the Supplement, we provide some justifications for Conditions 1 and 6, present detailed proofs of Theorems 1–3, outline theoretical analyses of SHIR for various penalty functions, and present additional simulation results.

Notes

1 Commonly used summary statistics include the locally fitted regression coefficient and its Hessian matrix in the low-dimensional arametric regression models (see, e.g., Duan et al. Citation2019, Citation2020).

Additional information

Funding

The research of Yin Xia was supported in part by NSFC Grants 12022103, 11771094, and 11690013. The research of Tianxi Cai and Molei Liu were partially supported by the Translational Data Science Center for a Learning Health System at Harvard Medical School and Harvard T.H. Chan School of Public Health.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 343.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.