Abstract
Predicting survival times of patients with the proteomic profile of bodily fluids, such as plasma and serum, has been of interest in biomedical research. In this article, we consider the same with patient serum using matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) data of non small cell lung cancer patients. Due to much larger dimension of features in a mass spectrum compared to the study sample size, traditional linear regression modeling of survival times with high number of proteomic features is not feasible. Hence, we consider latent factor and regularized/penalized methods for fitting such models in order to predict patient survival from the mass spectrometry features. Extensive numerical studies involving both simulated as well as real mass spectrometry data are used to compare four popular regression methods, namely, partial least squares (PLS), sparse partial least square (SPLS), least absolute shrinkage and selection operator (LASSO), and elastic net regularization, on processed spectra. Right censoring is handled through a residual-based multiple imputation. The results measured by means squared error of fit and prediction, vary considerably on the methods used, the tuning parameters of the methods and selected features after preprocessing. Overall, more complex methods such as the elastic net and SPLS result in better performances provided the operational parameters are chosen carefully via cross validation. For survival time prediction, we recommend using the elastic net based on a selected set of features.
[Supplementary materials are available for this article. Go to the publisher's online edition of Communications in Statistics—Simulation and Computation for the following free supplemental resource: a file containing tables and figures showing the mean squared error of fit in a simulated model, the estimated mean squared error of fit for the Milan, NSCLC data, median value of the optimum number of steps or number of components based on minimization of EMSEP, mean squared error of prediction in a simulated model, observed versus fitted values in Milan NSCLC data and feature identification.]
Mathematics Subject Classification:
Acknowledgments
This work was supported by grants from the National Science Foundation (NSF-DMS-0805559 to Susmita Datta) and the National Institutes of Health (NIH-CA133844 to Susmita Datta). We thank David P. Carbone for kindly providing us the Milan NSCLC Data. We thankfully acknowledge Johannes Voortman and Thang V. Pham for graciously sharing the Nethrlands NSCLC Data with us. We thank the referees for their constructive comments.