Abstract
In this study, the authors propose a new feature selection scheme, the incremental forward feature selection, which is inspired by incremental reduced support vector machines. In their method, a new feature is added into the current selected feature subset if it will bring in the most extra information. This information is measured by using the distance between the new feature vector and the column space spanned by current feature subset. The incremental forward feature selection scheme can exclude highly linear correlated features that provide redundant information and might degrade the efficiency of learning algorithms. The method is compared with the weight score approach and the 1-norm support vector machine on two well-known microarray gene expression data sets, the acute leukemia and colon cancer data sets. These two data sets have a very few observations but huge number of genes. The linear smooth support vector machine was applied to the feature subsets selected by these three schemes respectively and obtained a slightly better classification results in the 1-norm support vector machine and incremental forward feature selection. Finally, the authors claim that the rest of genes still contain some useful information. The previous selected features are iteratively removed from the data sets and the feature selection and classification steps are repeated for four rounds. The results show that there are many distinct feature subsets that can provide enough information for classification tasks in these two microarray gene expression data sets.
Notes
Golub = Golub et al., Citation1999; Weston (2001) = Weston et al., Citation2001; Guyon = Guyon et al., Citation2002; Zhu = Zhu et al., Citation2004; N/A = denote not available results.
Weston (2001) = Weston et al., Citation2001; Guyon = Guyon et al., Citation2002; Weston (2003) = Weston et al., Citation2003.
Round 1 = select genes from the original data set; Round 2 = select genes from the remaining genes of Round 1; Round 3 = select genes from the remaining genes of Round 2; Round 4 = select genes from the remaining genes of Round 3.