70
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Limitations of clustering with PCA and correlated noise

, , , , &
Received 08 Dec 2022, Accepted 05 Mar 2024, Published online: 05 May 2024

References

  • Hennig C, Meila M, Murtagh F, et al. Handbook of cluster analysis. Boca Raton: CRC Press; 2015.
  • Chang W-C. On using principal components before separating a mixture of two multivariate normal distributions. J R Stat Soc Ser C (Appl Stat). 1983;32(3):267–275.
  • Yeung KY, Ruzzo WL. Principal component analysis for clustering gene expression data. Bioinformatics. 2001;17(9):763–774. doi: 10.1093/bioinformatics/17.9.763
  • Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2(12):e190. doi: 10.1371/journal.pgen.0020190
  • Lee C, Abdool A, Huang C-H. PCA-based population structure inference with generic clustering algorithms. BMC Bioinformatics. 2009;10(1):1–13. doi: 10.1186/1471-2105-10-1
  • Ahlquist JS, Breunig C. Model-based clustering and typologies in the social sciences. Polit Anal. 2012;20(1):92–112. doi: 10.1093/pan/mpr039
  • Maugeri A, Barchitta M, Basile G, et al. Applying a hierarchical clustering on principal components approach to identify different patterns of the SARS-CoV-2 epidemic across Italian regions. Sci Rep. 2021;11(1):7082. doi: 10.1038/s41598-021-86703-3
  • Gilbert N, Mewis RE, Sutcliffe OB. Classification of fentanyl analogues through principal component analysis (PCA) and hierarchical clustering of GC–MS data. Forensic Chem. 2020;21:100287. doi: 10.1016/j.forc.2020.100287
  • Kiselev VY, Kirschner K, Schaub MT, et al. Sc3: consensus clustering of single-cell RNA-seq data. Nat Methods. 2017;14(5):483–486. doi: 10.1038/nmeth.4236
  • Žurauskienė J, Yau C. pcaReduce: hierarchical clustering of single cell transcriptional profiles. BMC Bioinformatics. 2016;17:1–11. doi: 10.1186/s12859-015-0844-1
  • Ji Z, Ji H. TSCAN: pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 2016;44(13):e117–e117. doi: 10.1093/nar/gkw430
  • Allaoui M, Kherfi ML, Cheriet A. Considerably improving clustering algorithms using UMAP dimensionality reduction technique: a comparative study. In: El Moataz A, Mammass D, Mansouri A, Nouboud F, editor. Image and signal processing. ICISP 2020. Lecture Notes in Computer Science, vol 12119. Cham: Springer; 2020.
  • Honda K, Notsu A, Ichihashi H. Fuzzy PCA-guided robust k-means clustering. IEEE Trans Fuzzy Syst. 2009;18(1):67–79. doi: 10.1109/TFUZZ.2009.2036603
  • Kuesten C, Bi J, Zanetti HD, et al. Sparse hierarchical clustering based on menopause rating scale severity of symptoms collected from perimenopausal and postmenopausal us women in a menopause tablet perceptual efficacy study. J Sens Stud. 2023;38(2):e12814. doi: 10.1111/joss.v38.2
  • Moller DR, Koth LL, Maier LA, et al. Rationale and design of the genomic research in alpha-1 antitrypsin deficiency and sarcoidosis (grads) study. sarcoidosis protocol. Ann Am Thorac Soc. 2015;12(10):1561–1571. doi: 10.1513/AnnalsATS.201503-172OT
  • Haralick RM, Shanmugam K, Dinstein IH. Textural features for image classification. IEEE Trans Syst Man Cybern. 1973(SMC-36):610–621. doi: 10.1109/TSMC.1973.4309314
  • Kolossváry M, Karády J, Szilveszter B, et al. Radiomic features are superior to conventional quantitative computed tomographic metrics to identify coronary plaques with napkin-ring sign. Circ Cardiovasc Imaging. 2017;10(12):e006843. doi: 10.1161/CIRCIMAGING.117.006843
  • Kolossváry M, Kellermayer M, Merkely B, et al. Cardiac computed tomography radiomics. J Thorac Imaging. 2018;33(1):26–34. doi: 10.1097/RTI.0000000000000268
  • Godbole S, Labaki WW, Pratte KA, et al. A metabolomic severity score for airflow obstruction and emphysema. Metabolites. 2022;12(5):368. doi: 10.3390/metabo12050368
  • Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–537. doi: 10.1126/science.286.5439.531
  • Pollard KS, Dudoit S, van der Laan MJ. Multiple testing procedures: the multtest package and applications to Genomics. In: Gentleman R, Carey VJ, Huber W, Irizarry RA, Dudoit S, editors. Bioinformatics and computational biology solutions using R and bioconductor. Statistics for Biology and Health. New York, NY: Springer; 2005. doi: 10.1007/0-387-29362-0_15
  • Weinstein JN, Collisson EA, Mills GB, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45(10):1113–1120. doi: 10.1038/ng.2764
  • [Dataset] Gene expression cancer RNA-Seq, 2016. UCI Machine Learning Repository.
  • MacQueen J., et al. Some methods for classification and analysis of multivariate observations. In: Lucien Lecam and Jerzy Neyman, editors. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Vol. 1. Oakland, CA, USA: University of California Press; 1967. pp. 281–297.
  • R Core Team. R: a language and environment for statistical computing; 2013.
  • Witten DM, Tibshirani R. A framework for feature selection in clustering. J Am Stat Assoc. 2010;105(490):713–726. doi: 10.1198/jasa.2010.tm09415
  • Witten DM, Tibshirani R. sparcl: perform sparse hierarchical clustering and sparse k-means clustering. R Package Version. 2013;1(3). https://CRAN.R-project.org/package=sparcl
  • Marbac M, Sedki M. Variable selection for model-based clustering using the integrated complete-data likelihood. Stat Comput. 2017;27(4):1049–1063. doi: 10.1007/s11222-016-9670-1
  • Marbac M, Sedki M, Patin T. Variable selection for mixed data clustering: application in human population genomics. J Classif. 2020;37(1):124–142. doi: 10.1007/s00357-018-9301-y
  • Bouveyron C, Girard S, Schmid C. High-dimensional data clustering. Comput Stat Data Anal. 2007;52(1):502–519. doi: 10.1016/j.csda.2007.02.009
  • Bergé L, Bouveyron C, Girard S. HDclassif: an r package for model-based clustering and discriminant analysis of high-dimensional data. J Stat Softw. 2012;46(6):1–29.
  • Bouveyron C, Brunet C. Simultaneous model-based clustering and visualization in the fisher discriminative subspace. Stat Comput. 2012;22(1):301–324. doi: 10.1007/s11222-011-9249-9
  • Bouveyron C, Brunet-Saumard C. Discriminative variable selection for clustering with the sparse Fisher-EM algorithm. Comput Stat. 2014;29(3-4):489–513. doi: 10.1007/s00180-013-0433-6
  • Celeux G, Govaert G. Gaussian parsimonious clustering models. Pattern Recognit. 1995;28(5):781–793. doi: 10.1016/0031-3203(94)00125-6
  • Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B (Stat Methodol). 2001;63(2):411–423. doi: 10.1111/1467-9868.00293
  • Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65. doi: 10.1016/0377-0427(87)90125-7
  • Fop M, Murphy TB. Variable selection methods for model-based clustering. Stat Surv. 2018;12:18–65. doi: 10.1214/18-SS119
  • Bouveyron C, Brunet-Saumard C. Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal. 2014;71:52–78. doi: 10.1016/j.csda.2012.12.008
  • Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF. A review of unsupervised feature selection methods. Artif Intell Rev. 2020;53(2):907–948. doi: 10.1007/s10462-019-09682-y
  • McLachlan GJ, Bean RW, Peel D. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics. 2002;18(3):413–422. doi: 10.1093/bioinformatics/18.3.413
  • Shapiro SS, Wilk MB. An analysis of variance test for normality (complete samples). Biometrika. 1965;52(3-4):591–611. doi: 10.1093/biomet/52.3-4.591
  • Razali NM, Wah YB, et al. Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. J Stat Model Anal. 2011;2(1):21–33.
  • Jin J, Wang W. Influential features PCA for high dimensional clustering. Ann Stat. 2016;44(6):2323–2359.
  • Kuhn M. Caret: Classification and Regression Training. R package version 6.0-91; 2022.
  • Orestes Cerdeira J, Duarte Silva P, Cadima J, et al. subselect: Selecting Variable Subsets. R package version 0.15.2; 2020.
  • Vavrek MJ. Fossil: palaeoecological and palaeogeographical analysis tools. Palaeontol Electron. 2011;14(1):16.
  • Scrucca L, Fop M, Murphy TB, et al. mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J. 2016;8(1):289. doi: 10.32614/RJ-2016-021
  • Baik J, Arous GB, Péché S. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann Probab. 2005;33(5):1643–1697. doi: 10.1214/009117905000000233
  • Cai TT, Han X, Pan G. Limiting laws for divergent spiked eigenvalues and largest nonspiked eigenvalue of sample covariance matrices. Ann Stat. 2020;48(3):1255–1280.
  • Lesieur T, De Bacco C, Banks J, et al. Phase transitions and optimal algorithms in high-dimensional Gaussian mixture clustering. In: 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE: Monticello, IL; 2016. pp. 601–608.
  • Banks J, Moore C, Vershynin R, et al. Information-theoretic bounds and phase transitions in clustering, sparse PCA, and submatrix localization. IEEE Trans Inf Theory. 2018;64(7):4872–4894. doi: 10.1109/TIT.2018.2810020

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.