ABSTRACT
Archetypal analysis and nonnegative matrix factorization (NMF) are staples in a statistician's toolbox for dimension reduction and exploratory data analysis. We describe a geometric approach to both NMF and archetypal analysis by interpreting both problems as finding extreme points of the data cloud. We also develop and analyze an efficient approach to finding extreme points in high dimensions. For modern massive datasets that are too large to fit on a single machine and must be stored in a distributed setting, our approach makes only a small number of passes over the data. In fact, it is possible to obtain the NMF or perform archetypal analysis with just two passes over the data.
KEYWORDS:
Acknowledgments
The authors gratefully acknowledge Trevor Hastie, Jason Lee, Philip Pauerstein, Michael Saunders, Jonathan Taylor, Jennifer Tsai, and Lexing Ying for their insightful comments. Trevor Hastie suggested the group-lasso approach to selecting extreme points. A. Damle was supported by a NSF Graduate Research Fellowship DGE-1147470. Y. Sun was partially support by the NIH grant U01GM102098.