Abstract
Scientists often embed cells into a lower-dimensional space when studying single-cell RNA-seq data for improved downstream analyses such as developmental trajectory analyses, but the statistical properties of such nonlinear embedding methods are often not well understood. In this article, we develop the exponential-family SVD (eSVD), a nonlinear embedding method for both cells and genes jointly with respect to a random dot product model using exponential-family distributions. Our estimator uses alternating minimization, which enables us to have a computationally efficient method, prove the identifiability conditions and consistency of our method, and provide statistically principled procedures to tune our method. All these qualities help advance the single-cell embedding literature, and we provide extensive simulations to demonstrate that the eSVD is competitive compared to other embedding methods. We apply the eSVD via Gaussian distributions where the standard deviations are proportional to the means to analyze a single-cell dataset of oligodendrocytes in mouse brains. Using the eSVD estimated embedding, we then investigate the cell developmental trajectories of the oligodendrocytes. While previous results are not able to distinguish the trajectories among the mature oligodendrocyte cell types, our diagnostics and results demonstrate there are two major developmental trajectories that diverge at mature oligodendrocytes. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplementary materials.
Supplementary Materials
The following describe the sections in the supplementary materials. In Appendix A, we discuss the publicly-available code and data for reproducibility. In Appendix B, we discuss additional details of the eSVD, including its initialization, tuning procedure, usage for various exponential-family distributions and high-level comparisons to other methods. In Appendix C, we formally describe the analysis pipelines we used when analyzing the oligodendrocytes in Sections 2 and 7. In Appendix D and E, we formalize the statistical theory for estimating the matrix of natural parameters and specialize our theory for the curved Gaussian distribution respectively, alluded to in Section 5. In Appendix F, we describe additional simulation setups and results, extending those described in Section 6. In Appendix G, we describe our modifications to Slingshot and our method for constructing the uncertainty tube in detail. In Appendix H, we describe our additional analysis results of the oligodendrocytes from Section 7, including results of highly informative genes and empirical results when using other methods to analyze this dataset, as well as a clustering analysis of a second single-cell dataset. Appendix I and J contains the proofs for all our theoretical results.