1,940
Views
12
CrossRef citations to date
0
Altmetric
Theory and Methods

Distributed Estimation for Principal Component Analysis: An Enlarged Eigenspace Analysis

, , &
Pages 1775-1786 | Received 05 Apr 2020, Accepted 01 Feb 2021, Published online: 06 Apr 2021
 

Abstract

The growing size of modern datasets brings many challenges to the existing statistical estimation approaches, which calls for new distributed methodologies. This article studies distributed estimation for a fundamental statistical machine learning problem, principal component analysis (PCA). Despite the massive literature on top eigenvector estimation, much less is presented for the top-L-dim (L > 1) eigenspace estimation, especially in a distributed manner. We propose a novel multi-round algorithm for constructing top-L-dim eigenspace for distributed data. Our algorithm takes advantage of shift-and-invert preconditioning and convex optimization. Our estimator is communication-efficient and achieves a fast convergence rate. In contrast to the existing divide-and-conquer algorithm, our approach has no restriction on the number of machines. Theoretically, the traditional Davis–Kahan theorem requires the explicit eigengap assumption to estimate the top-L-dim eigenspace. To abandon this eigengap assumption, we consider a new route in our analysis: instead of exactly identifying the top-L-dim eigenspace, we show that our estimator is able to cover the targeted top-L-dim population eigenspace. Our distributed algorithm can be applied to a wide range of statistical problems based on PCA, such as principal component regression and single index model. Finally, we provide simulation studies to demonstrate the performance of the proposed distributed estimator.

Supplementary Materials

In the supplementary material, we first provide detailed proofs of all our theoretical results in Section 3. We then study two applications of our distributed PCA: principal component regression and single index model. We provide theoretical findings and conduct additional numerical experiments for these two applications.

Acknowledgments

The authors would like to thank the editor-in-chief, the anonymous associated editor, and two anonymous referees for many useful suggestions and feedback that greatly improve the article.

Funding

Xi Chen is supported by NSF grant IIS-1845444. Jason D. Lee is supported by NSF grant CCF-2002272. Yun Yang is supported by NSF grant DMS-1810831.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.