Abstract
This paper presents a new feature selection framework based on the L0-norm, in which data are summarized by their moments of the class conditional densities. However, discontinuity of the L0-norm makes it difficult to find the optimal solution. We apply a proper approximation of the L0-norm and a bound on the misclassification probability involving the mean and covariance of the dataset, to derive a robust difference of convex functions (DC) program formulation, while the DC optimization algorithm is used to solve the problem effectively. Furthermore, a kernelized version of this problem is also presented in this work. Experimental results on both real and synthetic datasets show that the proposed formulations can select fewer features than the traditional Minimax Probability Machine and the L1-norm state.