2,003
Views
8
CrossRef citations to date
0
Altmetric
Theory and Methods

Individual Data Protected Integrative Regression Analysis of High-Dimensional Heterogeneous Data

, &
Pages 2105-2119 | Received 31 Jul 2019, Accepted 13 Mar 2021, Published online: 19 May 2021

References

  • Akaike, H. (1974), “A New Look at the Statistical Model Identification,” IEEE Transactions on Automatic Control, 19, 716–723. DOI: 10.1109/TAC.1974.1100705.
  • Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z. (2018), “Distributed Testing and Estimation Under Sparse High Dimensional Models,” The Annals of Statistics, 46, 1352–1382. DOI: 10.1214/17-AOS1587.
  • Bhat, H. S., and Kumar, N. (2010), On the Derivation of the Bayesian Information Criterion, Los Angels: School of Natural Sciences, University of California.
  • Bühlmann, P., and Van De Geer, S. (2011), Statistics for High-dimensional Data: Methods, Theory and Applications. Springer Science & Business Media.
  • Caner, M., and Kock, A. B. (2018a), “Asymptotically Honest Confidence Regions for High Dimensional Parameters by the Desparsified Conservative Lasso,” Journal of Econometrics, 203, 143–168. DOI: 10.1016/j.jeconom.2017.11.005.
  • Caner, M., and Kock, A. B. (2018b), “High Dimensional Linear GMM,” arXiv preprint arXiv:1811.08779.
  • Chen, X., and Xie, M.-g. (2014), “A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data,” Statistica Sinica, 24, 1655–1684.
  • Chen, Y., Dong, G., Han, J., Pei, J., Wah, B. W., and Wang, J. (2006), “Regression Cubes With Lossless Compression and Aggregation,” IEEE Transactions on Knowledge and Data Engineering, 18, 1585–1599.
  • Cheng, X., Lu, W., and Liu, M. (2015), “Identification of Homogeneous and Heterogeneous Variables in Pooled Cohort Studies,” Biometrics, 71, 397–403. DOI: 10.1111/biom.12285.
  • Doiron, D., Burton, P., Marcon, Y., Gaye, A., Wolffenbuttel, B. H., Perola, M., Stolk, R. P., Foco, L., Minelli, C., Waldenberger, M., Holle, R., Kvaløy, K., Hillege, H. L., Tassé A.-M., Ferretti, V., Fortier, I. (2013), “Data Harmonization and Federated Analysis of Population-based Studies: The BioSHaRE Project,” Emerging Themes in Epidemiology, 10, 12. DOI: 10.1186/1742-7622-10-12.
  • Duan, R., Boland, M. R., Liu, Z., Liu, Y., Chang, H. H., Xu, H., Chu, H., Schmid, C. H., Forrest, C. B., Holmes, J. H., Schuemie, M. J., Berlin, J. A., Moore, J. H., Chen, Y. (2020), “Learning From Electronic Health Records Across Multiple Sites: A Communication-efficient and Privacy-preserving Distributed Algorithm,” Journal of the American Medical Informatics Association, 27, 376–385. DOI: 10.1093/jamia/ocz199.
  • Duan, R., Boland, M. R., Moore, J. H., and Chen, Y. (2019), “Odal: A One-shot Distributed Algorithm to Perform Logistic Regressions on Electronic Health Records Data From Multiple Clinical Sites,” in PSB, R. B. Altman, A. K. Dunker, L. Hunter, M. D. Ritchie, T. Murray and T. E. Klein, Kohala Coast, Hawaii, USA: World Scientific Publishing Conference, pp. 30–41.
  • Fan, J., Guo, Y., and Wang, K. (2019), “Communication-efficient Accurate Statistical Estimation,” arXiv preprint arXiv:1906.04870.
  • Fan, J., and Lv, J. (2008), “Sure Independence Screening for Ultrahigh Dimensional Feature Space,” Journal of the Royal Statistical Society, Series B, 70, 849–911. DOI: 10.1111/j.1467-9868.2008.00674.x.
  • Foster, D. P., and George, E. I. (1994), “The Risk Inflation Criterion for Multiple Regression,” The Annals of Statistics, 1947–1975. DOI: 10.1214/aos/1176325766.
  • Friedman, J., Hastie, T., and Tibshirani, R. (2010), “A Note on the Group Lasso and a Sparse Group Lasso,” arXiv preprint arXiv:1001.0736.
  • Gaye, A., Marcon, Y., Isaeva, J., LaFlamme, P., Turner, A., Jones, E. M., Minion, J., Boyd, A. W., Newby, C. J., Nuotio, M.-L., Wilson, R., Butters, O., Murtagh, B., Demir, I., Doiron, D., Giepmans, L., Wallace, S. E., Budin-Ljøsne, I., Oliver Schmidt, C., Boffetta, P., Boniol, M., Bota, M., Carter, K. W., deKlerk, N., Dibben, C., Francis, R. W., Hiekkalinna, T., Hveem, K., Kvaløy, K., Millar, S., Perry, I. J., Peters, A., Phillips, C. M., Popham, F., Raab, G., Reischl, E., Sheehan, N., Waldenberger, M., Perola, M., van den Heuvel, E., Macleod, J., Knoppers, B. M., Stolk, R. P., Fortier, I., Harris, J.R., Woffenbuttel, B. H., Murtagh, M. J., Ferretti, V., Burton, P. R.(2014), “DataSHIELD: Taking the Analysis to the Data, not the Data to the Analysis,” International Journal of Epidemiology, 43, 1929–1944. DOI: 10.1093/ije/dyu188.
  • Han, J., and Liu, Q. (2016), “Bootstrap Model Aggregation for Distributed Statistical Learning,” in Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon and R. Garnett, San Diego, CA, USA: Curran Associates, Inc., pp. 1795–1803.
  • He, Q., Zhang, H. H., Avery, C. L., and Lin, D. (2016), “Sparse Meta-analysis With High-dimensional Data,” Biostatistics, 17, 205–220. DOI: 10.1093/biostatistics/kxv038.
  • Huang, C., and Huo, X. (2015), “A Distributed One-step Estimator,” arXiv preprint arXiv:1511.01443.
  • Huang, J., and Zhang, T. (2010), “The Benefit of Group Sparsity,” The Annals of Statistics, 38, 1978–2004. DOI: 10.1214/09-AOS778.
  • Janková, J., and Van De Geer, S. (2016), “Confidence Regions for High-dimensional Generalized Linear Models Under Sparsity,” arXiv preprint arXiv:1610.01353.
  • Javanmard, A., and Montanari, A. (2014), “Confidence Intervals and Hypothesis Testing for High-dimensional Regression,” The Journal of Machine Learning Research, 15, 2869–2909.
  • Jones, E., Sheehan, N., Masca, N., Wallace, S., Murtagh, M., and Burton, P. (2012), “DataSHIELD–shared Individual-level Analysis Without Sharing the Data: A Biostatistical Perspective,” Norsk Epidemiologi, 21.
  • Jordan, M. I., Lee, J. D., and Yang, Y. (2019), “Communication-efficient Distributed Statistical Inference,” Journal of the American Statistical Association, 526, 668–681. DOI: 10.1080/01621459.2018.1429274.
  • Kho, A. N., Pacheco, J. A., Peissig, P. L., Rasmussen, L., Newton, K. M., Weston, N., Crane, P. K., Pathak, J., Chute, C. G., Bielinski, S. J., Kullo, I. J., Li, R., Manolio, T. A., Chisholm, R. L., Denny, J. C. (2011), “Electronic Medical Records for Genetic Research: Results of the Emerge Consortium,” Science Translational Medicine, 3, 79re1–79re1. DOI: 10.1126/scitranslmed.3001807.
  • Kim, Y., Kwon, S., and Choi, H. (2012), “Consistent Model Selection Criteria on High Dimensions,” Journal of Machine Learning Research, 13, 1037–1057.
  • Lee, J. D., Liu, Q., Sun, Y., and Taylor, J. E. (2017), “Communication-efficient Sparse Regression,” Journal of Machine Learning Research, 18, 1–30.
  • Li, W., Liu, H., Yang, P., and Xie, W. (2016), “Supporting Regularized Logistic Regression Privately and Efficiently,” PloS One, 11, e0156479. DOI: 10.1371/journal.pone.0156479.
  • Liao, K. P., Ananthakrishnan, A. N., Kumar, V., Xia, Z., Cagan, A., Gainer, V. S., Goryachev, S., Chen, P., Savova, G. K., Agniel, D., et al. (2015). “Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease Across 3 Chronic Disease Cohorts,” PLoS One, 10, e0136651. DOI: 10.1371/journal.pone.0136651.
  • Liu, M., Xia, Y., Cai, T., and Cho, K. (2020), “Integrative High Dimensional Multiple Testing With Heterogeneity Under Data Sharing Constraints,” arXiv preprint arXiv:2004.00816.
  • Liu, Q., and Ihler, A. T. (2014), “Distributed Estimation, Information Loss and Exponential Families,” in Advances in Neural Information Processing Systems, pp. 1098–1106.
  • Lounici, K., Pontil, M., Van De Geer, S., Tsybakov, A. B. (2011), “Oracle Inequalities and Optimal Inference Under Group Sparsity,” The Annals of Statistics, 39, 2164–2204. DOI: 10.1214/11-AOS896.
  • Lu, C.-L., Wang, S., Ji, Z., Wu, Y., Xiong, L., Jiang, X., and Ohno-Machado, L. (2015), “Webdisco: A Web Service for Distributed Cox Model Learning Without Patient-level Data Sharing,” Journal of the American Medical Informatics Association, 22, 1212–1219. DOI: 10.1093/jamia/ocv083.
  • Maity, S., Sun, Y., and Banerjee, M. (2019), “Communication-efficient Integrative Regression in High-dimensions,” arXiv preprint arXiv:1912.11928.
  • Minsker, S. (2019), “Distributed Statistical Estimation and Rates of Convergence in Normal Approximation,” Electronic Journal of Statistics, 13, 5213–5252. DOI: 10.1214/19-EJS1647.
  • Nardi, Y., Rinaldo, A. (2008), “On the Asymptotic Properties of the Group Lasso Estimator for Linear Models,” Electronic Journal of Statistics, 2, 605–633. DOI: 10.1214/08-EJS200.
  • Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B. (2012), “A Unified Framework for High-dimensional Analysis of m-estimators With Decomposable Regularizers,” Statistical Science, 27, 538–557. DOI: 10.1214/12-STS400.
  • Raskutti, G., Wainwright, M. J., and Yu, B. (2011), “Minimax Rates of Estimation for High-dimensional Linear Regression Over lq-Balls,” IEEE Transactions on Information Theory, 57, 6976–6994. DOI: 10.1109/TIT.2011.2165799.
  • Rivasplata, O. (2012), “Subgaussian Random Variables: An Expository Note,” Internet publication, PDF.
  • Tang, L., Zhou, L., and Song, P. X.-K. (2016), “Method of Divide-and-combine in Regularized Generalized Linear Models for Big Data,” arXiv preprint arXiv:1611.06208.
  • Vaiter, S., Deledalle, C., Peyré, G., Fadili, J., and Dossal, C. (2012), “The Degrees of Freedom of the Group Lasso,” arXiv preprint arXiv:1205.1481.
  • Van de Geer, S., Bühlmann, P., Ritov, Y., Dezeure, R. (2014), “On Asymptotically Optimal Confidence Regions and Tests for High-dimensional Models,” The Annals of Statistics, 42, 1166–1202. DOI: 10.1214/14-AOS1221.
  • Van de Geer, S. A. (2008), “High-dimensional Generalized Linear Models and the Lasso,” The Annals of Statistics, 36, 614–645. DOI: 10.1214/009053607000000929.
  • Vershynin, R. (2018), High-dimensional Probability: An Introduction With Applications in Data Science, Vol. 47. Cambridge, UK: Cambridge University Press.
  • Wang, H., and Leng, C. (2007), “Unified Lasso Estimation by Least Squares Approximation,” Journal of the American Statistical Association, 102, 1039–1048. DOI: 10.1198/016214507000000509.
  • Wang, H., and Leng, C. (2008), “A Note on Adaptive Group Lasso,” Computational Statistics & Data Analysis, 52, 5277–5286.
  • Wang, H., Li, B., and Leng, C. (2009), “Shrinkage Tuning Parameter Selection With a Diverging Number of Parameters,” Journal of the Royal Statistical Society, Series B, 71, 671–683. DOI: 10.1111/j.1467-9868.2008.00693.x.
  • Wang, J., Kolar, M., Srebro, N., and Zhang, T. (2017), “Efficient Distributed Learning With Sparsity,” in Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 3636–3645. JMLR. org.
  • Wang, X., Peng, P., and Dunson, D. B. (2014), “Median Selection Subset Aggregation for Parallel Inference,” In Advances in Neural Information Processing Systems, pp. 2195–2203.
  • Wolfson, M., Wallace, S. E., Masca, N., Rowe, G., Sheehan, N. A., Ferretti, V., LaFlamme, P., Tobin, M. D., Macleod, J., Little, J., Fortier, I., Knoppers, B. M., Burton, P. R. (2010), “DataSHIELD: Resolving a Conflict in Contemporary Bioscienceperforming a Pooled Analysis of Individual-level Data Without Sharing the Data,” International Journal of Epidemiology, 39, 1372–1382. DOI: 10.1093/ije/dyq111.
  • Wu, Y., Jiang, X., Kim, J., and Ohno-Machado, L. (2012), “Grid Binary Logistic Regression (glore): Building Shared Models Without Sharing Data,” Journal of the American Medical Informatics Association, 19, 758–764. DOI: 10.1136/amiajnl-2012-000862.
  • Yu, S., Liao, K. P., Shaw, S. Y., Gainer, V. S., Churchill, S. E., Szolovits, P., Murphy, S. N., Kohane, I. S., and Cai, T. (2015), “Toward High-throughput Phenotyping: Unbiased Automated Feature Extraction and Selection From Knowledge Sources,” Journal of the American Medical Informatics Association, 22, 993–1000. DOI: 10.1093/jamia/ocv034.
  • Yuan, M., and Lin, Y. (2006), “Model Selection and Estimation in Regression With Grouped Variables,” Journal of the Royal Statistical Society, Series B, 68, 49–67. DOI: 10.1111/j.1467-9868.2005.00532.x.
  • Zeng, Q. T., Goryachev, S., Weiss, S., Sordo, M., Murphy, S. N., and Lazarus, R. (2006), “Extracting Principal Diagnosis, Co-morbidity and Smoking Status for Asthma Research: Evaluation of a Natural Language Processing System,” BMC Medical Informatics and Decision Making, 6, Article no. 30.
  • Zhang, C.-H., and Zhang, S. S. (2014), “Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models,” Journal of the Royal Statistical Society, Series B, 76, 217–242. DOI: 10.1111/rssb.12026.
  • Zhang, Y., Li, R., and Tsai, C.-L. (2010), “Regularization Parameter Selections Via Generalized Information Criterion,” Journal of the American Statistical Association, 105, 312–323. DOI: 10.1198/jasa.2009.tm08013.
  • Zhao, P., and Yu, B. (2006), “On Model Selection Consistency of Lasso,” Journal of Machine Learning Research, 7, 2541–2563.
  • Zhou, N., and Zhu, J. (2010), “Group Variable Selection Via a Hierarchical Lasso and Its Oracle Property,” arXiv preprint arXiv:1006.2871.
  • Zöller, D., Lenz, S., and Binder, H. (2018), “Distributed Multivariable Modeling for Signature Development Under Data Protection Constraints,” arXiv preprint arXiv:1803.00422.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.