Abstract
Gene expression profiling has played an important role in cancer risk classification and has shown promising results. Since gene expression profiling often involves determination of a set of top rank genes for analysis, it is important to evaluate how modeling performance varies with the number of selected top ranked genes incorporated in the model. We used a colon data set collected at Moffitt Cancer Center as an example of the study, and ranked genes based on the univariate Cox proportional hazards model. A set of top ranked genes was selected for evaluation. The selection was done by choosing the top k ranked genes for k = 1 to 12,500. An analysis indicated a considerable variation of classification outcomes when the number of top ranked genes was changed. We developed a predictive risk probability approach to accommodate this variation by identifying a range number of top ranked genes. For each number of top ranked genes, the procedure classifies each patient as having high risk (score = 1) or low risk (score = 0). The categorizations are then averaged, giving a risk score between 0 and 1, thus providing a ranking for the patient's need for further treatment. This approach was applied to the colon data set and demonstrated the strength of this approach by three criteria: First, a univariate Cox proportional hazards model showed a highly statistically significant level (log-rank χ 2 statistics = 110 with p -value <10 −16) for the predictive risk probability classification. Second, the survival tree model used the risk probability to partition patients into five risk groups showing a good separation of survival curves (log-rank χ 2 statistics = 215). In addition, utilization of the risk group status identified a small set of risk genes that may be practical for biological validation. Third, analysis of resampling the risk probability suggested the variation pattern of the log-rank χ 2 in the colon cancer data set was unlikely caused by chance.
ACKNOWLEDGMENTS
This work was supported by grants from the National Institutes of Health (2P30CA76292-08, R01CA112215-03, and R01CA098522-05). The authors acknowledge the use of the services provided by Research Computing, University of South Florida.
The views presented in this article are those of the authors and do not necessarily represent those of the U.S. Food and Drug Administration.
Notes
Censoring rate = 58%.