Abstract
We apply robust classification algorithms to high-dimensional genomic data to find biomarkers, by analyzing variable importance, that enable a better diagnosis of disease, an earlier intervention, or a more effective assignment of therapies. The goal is to use variable importance ranking to isolate a set of important genes that can be used to classify life-threatening diseases with respect to prognosis or type to maximize efficacy or minimize toxicity in personalized treatment of such diseases. A ranking method and present several other methods to select a set of important genes to use as genomic biomarkers is proposed, and the performance of the selection procedures in patient classification by cross-validation is evaluated. The various selection algorithms are applied to published high-dimensional genomic data sets using several well-known classification methods. For each data set, a set of genes selected on the basis of variable importance that performed the best in classification is reported. That classification algorithm with the proposed ranking method is shown to be competitive with other selection methods for discovering genomic biomarkers underlying both adverse and efficacious outcomes for improving individualized treatment of patients for life-threatening diseases.
ACKNOWLEDGMENTS
Hojin Moon's research was partially supported by the Scholarly and Creative Activities Committee (SCAC) Award from CSULB. Hongshik Ahn's research was partially supported by the Faculty Research Participation Program at the NCTR administered by the Oak Ridge Institute for Science and Education through an interagency agreement between USDOE and USFDA.
Notes
1 k∗ = 1 with CERPWFM, CERPMDI, RFMDA, RFMDI, SVMRFE, BW; k∗ = 3 with the t-test. Values in boldface indicate lymphoma data.
1 k∗ = 1 with SVMRFE; k = 3∗ with CERPWFM, CERPMDI, RFMDA, RFMDI; k∗ = 5 with BW, the t-test.
Values in boldface indicate pediatric AML data.
Note: Since the selected genes from the t-test and the BW ratio are the same, only the t-test is reported. DLDA classification algorithm is used for illustration. PPV and NPV stand for positive and negative predictive values, respectively.
T = Set of genes selected by the t-test; C = set of genes selected by CERP; T ∩ C = common set of genes selected by the t-test and CERP; T ∪ C = combined set of genes selected by the t-test or CERP; (T − C) ∪ (C − T) = combined mutually exclusive set of genes selected by the t-test or CERP.