4,959
Views
33
CrossRef citations to date
0
Altmetric
Research Article

Clustering-based undersampling with random over sampling examples and support vector machine for imbalanced classification of breast cancer diagnosis

&
 

Abstract

To overcome the two-class imbalanced classification problem existing in the diagnosis of breast cancer, a hybrid of Random Over Sampling Example, K-means and Support vector machine (RK-SVM) model is proposed which is based on sample selection. Random Over Sampling Example (ROSE) is utilized to balance the dataset and further improve the diagnosis accuracy by Support Vector Machine (SVM). As there is one different sample selection factor via clustering that encourages selecting the samples near the class boundary. The purpose of clustering here is to reduce the risk of removing useful samples and improve the efficiency of sample selection. To test the performance of the new hybrid classifier, it is implemented on breast cancer datasets and the other three datasets from the University of California Irvine (UCI) machine learning repository, which are commonly used datasets in class imbalanced learning. The extensive experimental results show that our proposed hybrid method outperforms most of the competitive algorithms in term of G-mean and accuracy indices. Additionally, experimental results show that this method also performs superiorly for binary problems.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This work is supported by National Natural Science Foundation of China under Grant No. 51866015, and Shaanxi Technology Committee Industrial Public Relation Project (No. 2018GY-146).