70
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Limitations of clustering with PCA and correlated noise

, , , , &
Received 08 Dec 2022, Accepted 05 Mar 2024, Published online: 05 May 2024
 

Abstract

It is now common to have a modest to large number of features on individuals with complex diseases. Unsupervised analyses, such as clustering with and without preprocessing by Principle Component Analysis (PCA), is widely used in practice to uncover subgroups in a sample. However, in many modern studies features are often highly correlated and noisy (e.g. SNP's, -omics, quantitative imaging markers, and electronic health record data). The practical performance of clustering approaches in these settings remains unclear. Through extensive simulations and empirical examples applying Gaussian Mixture Models and related clustering methods, we show these approaches (including variants of kmeans, VarSelLCM, HDClassifier, and Fisher-EM) can have very poor performance in many settings. We also show the poor performance is often driven by either an explicit or implicit assumption by the clustering algorithm that high variance features are relevant while lower variance features are irrelevant, called the variance as relevance assumption. We develop practical pre-processing approaches that improve analysis performance in some cases. This work offers practical guidance on the strengths and limitations of unsupervised clustering approaches in modern data analysis applications.

Disclosure statement

WL holds stock in SomaLogic. NEC owns stock in Illumina. LAM has NIH funding to support sarcoidosis research, has funding from the Foundation for Sarcoidosis Research (FSR), is a member of the FSR Scientific Advisory Board, and has served or will serve on advisory boards or as a consultant for aTYR Pharma Inc, Novartis Pharmaceutical, CSL Behring and Boehringer Ingelheim. JA, TEF and KK have no competing interests to declare.

Additional information

Funding

This work was supported by the National Institutes of Health under Grants R01 HL114587, R01 HL142049, R01 HL152735, and T32 HL007085. Data from the GRADS study was supported under Grants U01 HL112707, U01 HL112707, U01 HL112694, U01 HL112695, U01 HL112696, U01 HL112702, U01 HL112708, U01 HL112711, U01 HL112712.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 1,209.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.