250
Views
0
CrossRef citations to date
0
Altmetric
Article: 2355424 | Received 22 Oct 2023, Accepted 23 Apr 2024, Published online: 20 May 2024

ABSTRACT

Imbalanced classification problems are of great significance in life, and there have been many methods to deal with them, e.g. eXtreme Gradient Boosting (XGBoost), Logistic Regression (LR), Decision Trees (DT), and Support Vector Machine (SVM). Recently, a novel Generalization-Memorization Machine (GMM) was proposed to maintain good generalization ability with zero empirical for binary classification. This paper proposes a Weighted Generalization Memorization Machine (WGMM) for imbalanced classification. By improving the memory cost function and memory influence function of GMM, our WGMM also maintains zero empirical risk with well generalization ability for imbalanced classification learning. The new adaptive memory influence function in our WGMM achieves that samples are described individually and not affected by other training samples from different category. We conduct experiments on 31 datasets and compare the WGMM with some other classification methods. The results exhibit the effectiveness of the WGMM.

Introduction

Imbalanced classification problems often prioritize the minority class, which typically holds more crucial information. Numerous real-world applications encounter imbalanced scenarios, including but not limited to network intrusion detection (Chen et al. Citation2022), cancer detection (Lilhore et al. Citation2022; Seo et al. Citation2022), mineral exploration (Xiong and Zuo Citation2020), illegal credit card transactions (Sudha and Akila Citation2021), bank fraud detection (Abdelhamid, Khaoula, and Atika Citation2014), and Advertisement click-through rate prediction (Zhang, Fu, and Xiao Citation2017). Two main categories of methods are used to address imbalanced classification problems. The first category, known as data-level methods, focuses on transforming imbalanced data into balanced data. This includes oversampling methods (Chawla et al. Citation2002; He and Garcia Citation2009; He et al. Citation2008; Nguyen, Cooper, and Kamei Citation2011; Zheng Citation2020); and undersampling methods (Hart Citation1968; Sui, Wei, and Zhao Citation2015). The second category, algorithm-level methods, adjusts the weights of majority and minority classes within models. Examples include cost-sensitive methods (Bach, Heckerman, and Horvitz Citation2006), kernel adaptation methods (Mathew et al. Citation2017), and hyperplane shifting methods (Datta and Das Citation2015). Including methods such as Logistic Regression (LR) (Luo et al. Citation2019), Cost-Sensitive Decision Trees (Krawczyk, Woźniak, and Schaefer Citation2014), Cost-Sensitive Neural Networks (Krawczyk and Woźniak Citation2015) and Support Vector Machine (SVM), among which the support vector machine algorithm is better at solving imbalanced problems due to the strong generalization ability of SVM.

The SVM, by minimizing the sum of empirical and expected risks, has found extensively used in practical real-world applications, including face recognition (Chaabane et al. Citation2022), cancer detection (Alfian et al. Citation2022; Hussain et al. Citation2011; Seo et al. Citation2022); voice recognition (Harvianto et al. Citation2016) and handwritten character recognition (Hamdan and Sathesh Citation2021; Kibria et al. Citation2020; Kumar, Sharma, and Jindal Citation2011; Nasien, Haron, and Yuhaniz Citation2010). However, the classic SVM cannot always guarantee zero empirical risk, which classifies all training samples correctly. For the SVM, Vapnik proposed a generalization-memorization kernel (Vapnik and Izmailov Citation2021) to correctly classify training samples and have good generalization ability. Subsequently, a generalization-memorization machine (GMM) (Wang and Shao Citation2022) was proposed to account for the mechanism of the generalization-memorization kernel. The GMM enhances memorization by incorporating a memory cost function and improves generalization through a memory influence function. Since the memory influence function is predefined uniformly on the training set, the inessential samples would obtain some larger effects on prediction, especially for imbalance problems.

In this paper, we propose a Weighted Generalization Memorization Machine (WGMM) to deal with the imbalanced classification problem. Our WGMM employs distinct memory cost functions for the majority and minority classes. While preserves zero empirical risk. Furthermore, our WGMM also introduces a self-adaptive memory influence function to adapt to various imbalance problems.

The structure of this paper is as follows: In the next section, a brief overview of GMM and imbalanced classification methods is given. The third part establishes our WGMM model. The last two sections present numerical experiments and conclusions.

Related Works

Imbalanced Classification

There are two primary approaches to deal with imbalanced classification problems: data-level (DL) preprocessing methods (Chawla et al. Citation2002; Hart Citation1968; He and Garcia Citation2009; He et al. Citation2008; Nguyen, Cooper, and Kamei Citation2011; Sui, Wei, and Zhao Citation2015; Zheng Citation2020) and algorithm-level (AL) methods (Bach, Heckerman, and Horvitz Citation2006; Batuwita and Palade Citation2010; Datta and Das Citation2015; Imam, Ting, and Kamruzzaman Citation2006; Iranmehr, Masnadi-Shirazi, and Vasconcelos Citation2019; Mathew et al. Citation2017). Data-level preprocessing methods aim to mitigate class imbalance by adjusting the training data through sample addition or removal to achieve class balance before model training. The oversampling or undersampling is always concerned. Thereinto, oversampling methods rebalance classes by either replicating or generating samples within the minority class. For instance, Random oversampling (ROS) (He and Garcia Citation2009) involves duplicating samples from the minority class, while Synthetic Minority Over-sampling Technique (SMOTE) (Chawla et al. Citation2002) generates artificial samples to balance the training data by linearly interpolating samples from the minority class. Furthermore, other SMOTE variants exist, such as SVM-SMOTE (Nguyen, Cooper, and Kamei Citation2011), Borderline-SMOTE (Zheng Citation2020), Kmeans-SMOTE (Zheng Citation2020) and ADASYN (He et al. Citation2008). Undersampling methods balance the dataset by removing instances from the majority class. For instance, Random undersampling (RUS) (Sui, Wei, and Zhao Citation2015) entails the random removal of instances from the majority class until a specified class balance is achieved.

Algorithm level methods aim to construct a particular classifier to handle imbalanced classification problems. There have been many approaches based on this strategy, such as Fuzzy Support Vector Machines for Class Imbalance Learning (FSVM-CIL) (Batuwita and Palade Citation2010), Cost-sensitive support vector machines (CSSVM) (Iranmehr, Masnadi-Shirazi, and Vasconcelos Citation2019) and z-SVM (Imam, Ting, and Kamruzzaman Citation2006). Introducing fuzzy membership values enables FSVM-CIL to prioritize the minority class while accounting for the majority class, resulting in improved performance on imbalanced datasets. CSSVM assigns distinct misclassification costs to various classes, with a primary focus on prioritizing minority classes. z-SVM embraces a cost-sensitive approach, where it assigns class-specific costs to misclassifications, offering significant value when a class is rare or holds greater importance.

Generalization-Memorization Machine

Recently, the generalization-memorization machine (GMM), the SVM with a new memory mechanism, have been proposed for classification in the n-dimensional real space Rn. Suppose T={(xi,yi)|i=1,2,,m} is the training set, where xiRn is the input sample, and yi={+1,1} is the corresponding output. We organize the training set into X and Y, where XRn\timesm is the sample matrix composed of xi, Y is a diagonal matrix composed of labels, and the element on the diagonal is Yii=yi with i=1,2,,m.

GMM considers the optimization problem as

(1) minw,b,c,d12w2+λ2c2+Ci=1mdis.t.yi(φ(xi)Tw+b+j=1myjcjδ(xi,xj))1di,di0,i=1,,m,(1)

where is L2 norm, φ() is a mapping, λ and C are positive parameters, and di is a slack variable. c is a column vector composed of cj(j=1,,m), which is called memory cost and δ(xi,xj) is called memory impact of xj on xi.

The goal of this optimization problem is to find a hyperplane that correctly classifies the training datasets and has the largest margin. The first term 12w2 is half the square of the margin, and by minimizing this term we are trying to maximize the margin. The second term λ2c2 is the regularization term of the memory cost function, where λ control the minimization of the sample memory cost function. Increasing λ will increase the penalty on memory cost, thus placing more emphasis on classification accuracy. Ci=1mdi is the minimum training sample re-memory terms. In this way, the goal is to control the size and complexity of the model. yi(φ(xi)Tw+b+j=1myjcjδ(xi,xj))1di is established indicating that each training sample is correctly classified.

For a new sample x, outside the training set, it would be classified into positive class if φ(xi)Tw+b+j=1myjc(xj)δ(xi,xj)>0, and otherwise, it is classified into the negative class.

GMM has been proved that it could obtain zero empirical risk, and its classification performance in numerical experiments is much higher than the SVMs.

Weighted Generalization Memorization Machine (WGMM)

Building upon the foundation of retaining all training samples, one of the advantages of GMM is that it can achieve superior test accuracy than SVM. However, GMM remains susceptible to issues related to sample imbalance. To mitigate this, we have enhanced GMM by optimizing both the memory cost function and the memory influence function, resulting in the development of our WGMM.

Model Formation

We introduce a weighted memory cost function component into the original objective function, with the objective of capturing distinct costs associated with various samples.

The training set includes a positive sample set T1={(xi,yi)|i=1,2,,p} and a negative sample set T2={(xk,yk)|k=1,2,,q} with p+q=m. Without loss of generality, consider the positive samples as the minority categories within the imbalanced dataset, with the assumption that the positive sample set X1 represents the minority class and the negative class sample set X2 represents the majority class, where X1Rn×p and X2Rn×q represent sample matrix composed of xi.

The optimization problem of WGMM can be expressed as follows:

(2) minw,b,c,d12w2+λ12c12+λ22c22+λ3i=1pdi+λ4k=1qdks.t.yi(φ(xi)Tw+b+j=1myjcjδ(xi,xj))1di,yk(φ(xk)Tw+b+j=1myjcjδ(xk,xj))1dk,di0,i=1,,p,dk0,k=1,,q,(2)

where λ1, λ2, λ3 and λ4 are positive parameters, c1 and c2 represent the memory cost functions for remembering positive samples and negative samples, c1 is a column vector consisting of ci=c(xi)(i=1,,p), ci is the memory cost function of the ith positive sample, c2 is a column vector composed of ck=c(xk)(k=1,,q), ck is the memory cost function of the kth negative sample. di=d(xi)(i=1,,p) is the memory cost function of the ith positive sample, dk=d(xk)(k=1,,q) is the memory cost function of the kth negative sample.

The objective of formula (2) seeks the large margin with memory costs as lower as possible, and it controls the complexity of the model meanwhile. According to the constraints of formula (2), the memory cost function is a variable, and the memory cost function is split into c1 and c2, and multiply them by different parameters. The constraints within the optimization problem are separately defined for the positive and negative classes, allowing for individual control over the memory costs of each class, thus reflecting the importance of each respective class. GMM requires the predefinition of the influence function δ(xi,xj)(i,j=1,,p). Subsequently, this paper will introduce an adaptive influence function, in alignment with sample adaptation, which will be elaborated on in the following section. The retention of memory for all training samples becomes crucial in compliance with the constraints specified in formula (2).

The decision function of our WGMM is

(3) g(x)=φ(x)Tw+b+j=1myjcjδ(xj,xi)+yidi,ifx=xi,xiX,φ(x)Tw+b+j=1myjcjδ(xj,xi),otherwise.(3)

Formula (3) defines the decision function as a piecewise function. the decision function is a piecewise function. When the input test sample belongs to the training sample, the model employs the first function to make decisions. j=1myjcjδ(xj,xi) represents the comprehensive influence of memory training samples on predictions. The function δ impacts memory capacity of the model. So yidi denotes the item re-memorized for the sample, enhancing memory accuracy, thereby guaranteeing a training accuracy rate be 1. Conversely, when the input sample is not part of the training set, di has no practical effect on x. In such cases, the model utilizes the second function as the testing function.

We now delve into the dual problem associated with (2). The original problem (2) can be written in matrix form

(4) minw,b,c,d12w2+λ12c12+λ22c22+λ3e1Td1+λ4e2Td2s.t.φ(X1)Tw+be1+Δ11c1Δ12c2e1d1,φ(X2)Twbe2Δ21c1+Δ22c2e2d2,d10,d20,(4)

where d1 and d2 are the memory cost functions for remembering positive samples and negative samples, d1 is a column vector composed of di, d2 is a column vector composed of dk, Δ11Rp×p with element δ(xi,xj)(i,j=1,,p), Δ12Rp×q, Δ21Rq×p, Δ22Rq×q are defined similarly to Δ11. e1 is a vector of one with an p dimension, and e2 is a vector of one with a q dimension, and elements of e1 and e2 are all ones.

In order to solve optimization problem (4), take it as the original optimization problem and apply Lagrangian duality to obtain the optimal solution of the original problem. We construct a LaGrange function

(5) L=12w2+λ12c12+λ22c22+λ3e1Td1+λ4e2Td2αT(φ(X1)Tw+be1+Δ11c1Δ12c2e1+d1)+βT(φ(X2)Tw+be2+Δ21c1Δ22c2+e2d2)+γ1Td1+γ2Td2.(5)

The LaGrange multiplier vectors α, γ1Rp and β, γ2Rq are introduced for each inequality constraint. Next, we present the Karush–Kuhn–Tucker (KKT) conditions (Fletcher Citation2000) for (4). The partial derivatives of (5) with respect to variables w, b, c1, c2 d1 and d2 are

(6) Lw=wφ(X1)α+φ(X2)β=0,(6)
(7) ∂L∂b=e1Tα+e2Tβ=0,(7)
(8) ∂Lc1=λ1c1Δ11Tα+Δ12Tβ=0,(8)
(9) ∂Lc2=λ1c2Δ22Tβ+Δ12Tα=0,(9)
(10) ∂Ld1=λ3e1αγ1=0,(10)

and

(11) Ld2=λ4e2βγ2=0.(11)

(6–11) are calculated mathematically to get

(12) w=φ(X1)αφ(X2)β,(12)
(13) c1=1λ1(Δ11TαΔ21Tβ),(13)
(14) c2=1λ2(Δ22TβΔ12Tα),(14)
(15) γ1=λ3e1α0,αλ3e1,α=λ3e1γ1,(15)

and

(16) γ2=λ4e2β0,βλ4e2,β=λ4e2γ1.(16)

Substitute (12–16) into EquationEquation (4) to get the dual problem

(17) minα,b12αT(K(X1,X1)+1λ1Δ11Δ11T+1λ2Δ12Δ12T)α+12βT(K(X2,X2)+1λ1Δ21Δ21T+1λ2Δ22Δ22T)β12αT(K(X1,X2)+1λ1Δ11Δ21T+1λ2Δ12Δ22T)β12βT(K(X2,X1)+1λ1Δ21Δ11T+1λ2Δ22Δ12T)αe1Tαe2Tβs.t.e1Tαe2Tβ=0,0αλ3e1,0βλ4e2,(17)

where K(,) represents a generalized Gaussian kernel with a parameter σ. Furthermore, when examining the solutions for α and β, where α has l non-zero components and β has s non-zero components. From the KKT condition, we can deduce

(18) b1=e1φ(X1)TwΔ11c1+Δ12c2,(18)

and

(19) b2=e2φ(X2)TwΔ21c1+Δ22c2,(19)

where b1 contains l elements, and b2 contains s elements. From EquationEquations (18) and (Equation19), we can find

(20) b=e1Tb1+e2Tb2l+s.(20)

New Memory Influence Function

Memorizing different datasets will have an impact on the generalization ability of the model. Therefore, this paper has made the following improvements: An adaptive memory influence function based on Euclidean distance is proposed to adapt different data sets to the model, that is, for each individual example, a distinct memory influence function is introduced.

The memory influence function chosen within GMM is

(21) δ(xi,xj)=expxixj22σ2,(21)

where σ is a positive parameter and needs to be selected. However, it lacks the inherent ability to adapt to diverse datasets, potentially impacting the model performance. To address this limitation, we are undertaking improvements. In the Gaussian function, the probability of numerical distribution within μ+3σ is 0.9974. This insight guides us in defining the influence range of each sample point. We consider the sample as the neighborhood center and designate the radius of the field as half of the Euclidean distance (d) to the nearest heterogeneous point. This approach confines the memory influence function to this specific field. More precisely, we set d2=3σ, allowing us to calculate σ=d6, by substituting it into the Gaussian function, we obtain

(22) δ(xi,xj)=exp18xixj2d2.(22)

At this time, the influence function does not incorporate any parameters; instead, it is calculated based on the characteristics of the sample distribution. Introducing an adaptive memory influence function rooted in Euclidean distance allows for the individual characterization of each sample feature. This adaptation enhances the ability of the model to accommodate various datasets.

The comparative diagrams for the new influence function we developed and the influence function used in GMM are presented below:

illustrates the influence range of the influence function for both WGMM and GMM. The black circle with the sample as the center is a schematic diagram of the influence range of the sample. provides a schematic depiction of the novel influence function within WGMM. This reveals that diverse samples exhibit varying influence ranges, each with distinct sizes. represents GMM, where the influence ranges for various samples are uniformly sized. During practical model classification, the influence ranges between samples may overlap and converge.

Figure 1. An illustrative example employing synthetic data is used to demonstrate the value range of the memory influence function in linear WGMM and linear GMM. The schematic illustrates classification performance of WGMM when (a) is λ1=4, λ2=2.30, λ3=11.31 and λ4=4 utilizing the influence range defined in EquationEquation (22) as the memory influence function. (b) is a schematic diagram of the classification performance of GMM when λ=11.31 and μ=111.43 and the influence range of the RBF kernel as a memory influence function.

Figure 1. An illustrative example employing synthetic data is used to demonstrate the value range of the memory influence function in linear WGMM and linear GMM. The schematic illustrates classification performance of WGMM when (a) is λ1=4, λ2=2.30, λ3=11.31 and λ4=4 utilizing the influence range defined in EquationEquation (22)(22) δ(xi,xj)=exp−18∥xi−xj∥2d2.(22) as the memory influence function. (b) is a schematic diagram of the classification performance of GMM when λ=11.31 and μ=111.43 and the influence range of the RBF kernel as a memory influence function.

Finally, we summarize the procedure of our WGMM with the new memory influence function in Algorithm 1.

Experiments

In this section, we describe our experimental study. Section 4.1 lists the datasets and methods used for the experiments. Section 4.2 provides the metrics used to evaluate the performance of all methods, while Sections 4.3 and 4.4 respectively present the specific experiment details and the analysis and summary of the experimental results.

Datasets

Based on 31 datasets obtained from (Rosales-Pérez, Garca, and Herrera Citation2022), our model is evaluated. presents datasets information, including data dimension (n), data volume (m), the size of the positive samples (p), the size of the negative class samples (q), and imbalance ratio (IR) (Lu, Cheung, and Tang Citation2019) represents the ratio of positive samples to negative samples and is arranged in ascending order of IR values, ranging from small to large. Observing the performance of WGMM in different imbalanced growth scenarios is essential.

Table 1. Description of 31 datasets.

Performance Metrics

We employed the Geometric Mean (GM) evaluation index from reference (Luque et al. Citation2019) as the primary evaluation metric in this paper. The evaluation of our model is based on the confusion matrix, as defined in reference (Caelen Citation2017), which helps in computing key parameters: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). Specifically, TP represents the count of correctly classified positive samples, FP represents the count of misclassified positive samples, TN represents the count of correctly classified negative samples, and FN represents the count of misclassified negative samples. According to the insights presented in reference (Luque et al. Citation2019), sensitivity and specificity are

(23) sen=TPTP+FN,(23)
(24) spe=TNTN+FP.(24)

Then the GM is

(25) GM=senspe.(25)

Reference Methods

The dataset was divided using 10-fold cross-validation. In 10-fold cross-validation, the dataset is randomly separated into 10 distinct subsets. For each fold, one subset is employed as the test set, while the others serve as the training set. Next, 10-fold cross-validation is utilized to select the optimal parameters from within the specified parameter range. Subsequently, these optimal parameters are employed in 20 rounds of 10-fold cross-validation. The number of folds used in each iteration varies, resulting in a total of 200 tests for each dataset.

The comparison methods selected in this paper are LR (LaValley Citation2008), FSVM-CIL (Batuwita and Palade Citation2010), GMM (Wang and Shao Citation2022), SMOTE-SVM (Chawla et al. Citation2002) and RUS-SVM (He and Garcia Citation2009). These methods represent advancements in recent years. Linear experiments involving these methods were conducted on 31 datasets, while 17 datasets were chosen for nonlinear experiments. The experiment presents certain challenges related to hyperparameter selection. For the regularization coefficient, we selected a parameter range represented as {2i|i=8,6,,6}. In the case of the kernel function parameters in the nonlinear experiments, the range is {2i|i=10,8,,10}. All these models were implemented on a personal computer equipped with an Intel Core dual-core processor (dual 4.2 GHz) and 32GB of RAM using MATLAB 2017a. The Quadratic Programming Problems for these models were solved using the same algorithm and tolerance.

Experiments Results and Discussion

In this section, we offer a comparative analysis of the performance of WGMM and alternative methods. We conduct experiments using 31 datasets for linear analysis and 17 datasets for nonlinear studies. presents the results of the linearity experiments conducted by each approach. presents a selection of datasets that illustrate the non-linear experimental findings for each method. And illustrates the overall effects of each approach on all datasets. Meanwhile, displays the optimal parameters λ1 and λ2 selected by the WGMM method during the experiments. And portrays the model results with different parameters using example data. Additionally, highlights the relationship between data parameters and indicators.

Figure 2. (a) depicts the GM distribution for 31 datasets across 5 linear algorithms. (b) illustrates the GM distribution for 17 datasets using 5 nonlinear algorithms.

Figure 2. (a) depicts the GM distribution for 31 datasets across 5 linear algorithms. (b) illustrates the GM distribution for 17 datasets using 5 nonlinear algorithms.

Figure 3. (4) the optimal reference ratio between λ1 and λ2 in the formula. (a) is the linear WGMM algorithm. (b) is the nonlinear WGMM algorithm.

Figure 3. (4) the optimal reference ratio between λ1 and λ2 in the formula. (a) is the linear WGMM algorithm. (b) is the nonlinear WGMM algorithm.

Figure 4. Results of dataset yeast-1-4-5-8, where the 1st and 4th dimensions are selected.

Figure 4. Results of dataset yeast-1-4-5-8, where the 1st and 4th dimensions are selected.

Figure 5. On the datasets of different IR, the GM value of the WGMM method in the selected parameter range. (a)glass0, (b)vehicle2, (c)ecoli1, (d)ecoli3, (e)abalone9–18,(f)winequality-red-4, (g)yeast6,(h)poker-8–9 vs 6, (i)poker-8 vs 6.

Figure 5. On the datasets of different IR, the GM value of the WGMM method in the selected parameter range. (a)glass0, (b)vehicle2, (c)ecoli1, (d)ecoli3, (e)abalone9–18,(f)winequality-red-4, (g)yeast6,(h)poker-8–9 vs 6, (i)poker-8 vs 6.

Table 2. GM comparison of linear experiments.

Table 3. GM comparison of nonlinear experiments.

displays the evaluation results for linear models using the GM index. It is evident that WGMM exhibits the best overall performance, particularly when the Imbalance Ratio (IR) is large. Moreover, our method is generally better than LR. In addition, GMM has a GM value of 0.00 on 9 datasets. It can be found that the test accuracy of the minority class on these datasets is 0.00. GMM performs poorly on imbalanced data. In terms of variance, WGMM falls within an acceptable range, and RUS-SVM excels, with variance at 0.00 on most datasets.

presents the evaluation of nonlinear experiments using the GM index. Notably, the FSVM-CIL model performs well on datasets with a small Imbalance Ratio (IR), and WGMM and FSVM-CIL show comparable performance. However, as the IR increases, WGMM emerges as the leading performer, exhibiting significantly superior results compared to other methods.

Comparing , we find:

  1. In , out of the 17 datasets, 11 of them exhibit a distinctive pattern. Under the GM metric, our linear algorithm outperforms the nonlinear model, which can be attributed to the influence of data characteristics.

  2. The datasets ecoli-0-3-4-7 vs 5-6, ecoli-0-1-4-7 vs 2-3-5-6, glass-0-1-6 vs 5, yeast-1-4- 5-8 vs 7 and glass5 exhibited a GM value of 0.00 in the GMM linear experiments. While substantial improvements were observed in the nonlinear experiments, the results remained less than satisfactory.

consists of two boxplots. represents a comparative boxplot of GM values for five methods in the linear experiment. Notably, WGMM exhibits higher median, first quartile, and third quartile values compared to the other methods, with shorter whiskers. This indicates the superior performance of WGMM. Conversely, the GMM method shows a lower box, longer whiskers, and lower outliers, clearly highlighting the subpar performance of unweighted methods in imbalanced experiments. represents the boxplot comparison of the five methods in the nonlinear experiment. Overall, both FSVM-CIL and WGMM methods perform better. Their boxes are short and skewed upward, signifying larger GM values. While the median and first quartile of FSVM-CIL are slightly greater than WGMM, the third quartile of WGMM is larger, and its upper whisker is longer. This suggests that predictive results of WGMM are notably larger.

In formula (4), λ1 and λ2 represent the coefficients of the positive and negative class memory cost functions, respectively, with λ1 referring to the positive class. Owing to the imbalance in sample sizes between the positive and negative classes, applying the same weight would introduce bias into performance of the model, favoring the class with more samples. Hence, it becomes essential to define different weights. The weight for the minority class is increased to balance the model’s performance on both types of samples. is the optimal reference comparison diagram for λ1 and λ2 for each dataset in the context of the linear experiment. is the optimal reference comparison diagram for each dataset comparing λ1 and λ2 in the context of nonlinear experiments. Most of the datasets show that λ1>λ2. This implies that greater weight is assigned to the minority samples, thereby achieving data balance through varying weights. In situations of class imbalance, the model attains optimal classification.

selects the 1st and 4th dimensions of the dataset yeast-1-4-5-8 vs 7 for training and testing. represents the optimal parameters, specifically λ1=22, λ2=24, λ3=20 and λ4=22. shows the result when the weights of the cost memory function are equal, that is, λ1=λ2=24. It is evident that , the trained classifier exhibits a bias toward the majority class samples, favoring the negative class. This underscores the importance of weighting.

is the linear experimental GM change plot of different datasets for different λ1 and λ2. We can see the following points.

  1. Different datasets correspond to different optimal parameters λ1 and λ2, which may be affected by the data structure and IR value.

  2. In many data sets, when λ1>λ2, the GM value is larger. And the GM value of the model with non-optimal parameters will be partially 0.00.

  3. The larger the IR, the fewer parameter groups have a GM value greater than 0.5, and the more parameter groups have a GM value of 0.00.

From the above results, it can be seen that WGMM is a competitive choice.

Conclusions

In this study, our WGMM sets weight parameters for the memory cost functions of positive and negative class samples, respectively. A new adaptive memory influence function is proposed. Implementation samples are described individually without increasing the weight of the entire category. It will not be affected by different training samples and can adapt to different imbalance scenarios. Ensure zero empirical risk while retaining generalization capabilities.

We conduct a comprehensive evaluation of WGMM on 31 imbalanced datasets, benchmarking its performance against alternative methods. Experimental results demonstrate the efficacy of WGMM in tackling imbalanced classification problems and showcasing robust generalization abilities. Notably, in scenarios with a substantial IR, our model exhibits pronounced characteristics and excels across various evaluation metrics.

Regarding the different parameters set for the positive and negative class memory impact functions, experiments found that the optimal parameters of the minority class sample memory impact function are greater than or equal to the optimal parameters of the majority class sample memory impact function. The adaptive memory influence function enables the model to adapt well to different data, and the effect is better than other methods.

These findings strongly affirm WGMM as an effective approach for addressing imbalanced classification challenges. Its superiority in performance and adaptability across different imbalanced datasets without the need for parameter tuning positions WGMM as a potent tool for practical applications.

Future research avenues may involve further optimizing performance of WGMM and exploring its applicability in diverse fields and tasks. In summary, this study provides novel insights and effective solutions for resolving imbalanced classification problems.

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable suggestions.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Data Availability Statement

The data that support the findings of this study are available in [IEEE] at [doi: 10.1109/TCYB.2022.3163974], reference (Rosales-Perez, Garca, and Herrera Citation2022). These data were derived from the following resources available in the public domain: [https://ieeexplore.ieee.org/document/9756639].

Supplementary Material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/08839514.2024.2355424

Additional information

Funding

The work was supported by the National Natural Science Foundation of China [61966024, 62366035]; National Natural Science Foundation of China [62106112]; Natural Science Foundation of Inner Mongolia Autonomous Region [2023MS01006].

References

  • Abdelhamid, D., S. Khaoula, and O. Atika. 2014. Automatic bank fraud detection using support vector machines. In The International Conference on Computing Technology and Information Management, Dubai, UAE, January.
  • Alfian, G., M. Syafrudin, I. Fahrurrozi, N. L. Fitriyani, F. T. D. Atmaji, T. Widodo, N. Bahiyah, F. Benes, and J. Rhee. 2022. Predicting breast cancer from risk factors using SVM and extra-trees-based feature selection method. Computers 11 (9):136. doi:10.3390/computers11090136.
  • Bach, F. R., D. Heckerman, and E. Horvitz. 2006. Considering cost asymmetry in learning classifiers. Journal of Machine Learning Research 7:1713–20.
  • Batuwita, R., and V. Palade. 2010. FSVM-CIL: Fuzzy support vector machines for class imbalance learning. IEEE Transactions on Fuzzy Systems 18 (3):558–71. doi:10.1109/TFUZZ.2010.2042721.
  • Caelen, O. 2017. A Bayesian interpretation of the confusion matrix. Annals of Mathematics and Artificial Intelligence 81 (3–4):429–50. doi:10.1007/s10472-017-9564-8.
  • Chaabane, S. B., M. Hijji, R. Harrabi, and H. Seddik. 2022. Face recognition based on statistical features and SVM classifier. Multimedia Tools & Applications 81 (6):8767–84. doi:10.1007/s11042-021-11816-w.
  • Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. The Journal of Artificial Intelligence Research 16:321–57. doi:10.1613/jair.953.
  • Chen, C., X. Xu, G. Wang, and L. Yang. 2022. Network intrusion detection model based on neural network feature extraction and PSO-SVM. In 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, April, 1462–65.
  • Datta, S., and S. Das. 2015. Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. Neural Networks 70:39–52. doi:10.1016/j.neunet.2015.06.005.
  • Fletcher, R. 2000. Practical methods of optimization. New Jersey, USA: JohnWiley & Sons.
  • Hamdan, Y. B., and A. Sathesh. 2021. Construction of statistical SVM based recognition model for handwritten character recognition. Journal of Information Technology and Digital World 3 (2):92–107. doi:10.36548/jitdw.2021.2.003.
  • Hart, P. 1968. The condensed nearest neighbor rule (corresp.). IEEE Transactions on Information Theory 14 (3):515–16. doi:10.1109/TIT.1968.1054155.
  • Harvianto, H., L. Ashianti, J. Jupiter, and S. Junaedi. 2016. Analysis and voice recognition in Indonesian language using MFCC and SVM method. ComTech: Computer, Mathematics and Engineering Applications 7 (2):131–39. doi:10.21512/comtech.v7i2.2252.
  • He, H., Y. Bai, E. A. Garcia, and S. Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, China, June, 1322–28.
  • He, H., and E. A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21 (9):1263–84. doi:10.1109/TKDE.2008.239.
  • Hussain, M., S. K. Wajid, A. Elzaart, and M. Berbar. 2011. A comparison of SVM kernel functions for breast cancer detection. In 2011 Eighth International Conference Computer Graphics, Imaging and Visualization, Singapore, August, 145–50.
  • Imam, T., K. M. Ting, and J. Kamruzzaman. 2006. z-SVM: An SVM for improved classification of imbalanced data. In AI 2006: Advances in Artificial Intelligence: 19th Australian Joint Conference on Artificial Intelligence, Proceedings, Hobart, Australia, Vol. 19, 264–73.
  • Iranmehr, A., H. Masnadi-Shirazi, and N. Vasconcelos. 2019. Cost-sensitive support vector machines. Neurocomputing 343:50–64. doi:10.1016/j.neucom.2018.11.099.
  • Kibria, M. R., A. Ahmed, Z. Firdawsi, and M. A. Yousuf. 2020. Bangla compound character recognition using support vector machine (SVM) on advanced feature sets. In 2020 IEEE Region 10 Symposium, Dhaka, Bangladesh, 965–68.
  • Krawczyk, B. and M. Woźniak. 2015. Cost-sensitive neural network with roc-based moving threshold for imbalanced classification. In 16th International Conference, Wroclaw, Poland, October 14–16, 2015, Proceedings 16, 45-52.
  • Krawczyk, B., M. Woźniak, and G. Schaefer. 2014. Cost-sensitive decision tree ensembles for effective imbalanced classification. Applied Soft Computing 14:554–62. doi:10.1016/j.asoc.2013.08.014.
  • Kumar, M., R. K. Sharma, and M. K. Jindal. 2011. SVM based offline handwritten Gurmukhi character recognition. SCAKD Proceedings 758:51–62.
  • LaValley, M. P. 2008. Logistic regression. Circulation 117 (18):2395–99. doi:10.1161/CIRCULATIONAHA.106.682658.
  • Lilhore, U. K., S. Simaiya, H. Pandey, V. Gautam, A. Garg, and P. Ghosh. 2022. Breast Cancer Detection in the IoT Cloud-based Healthcare Environment Using Fuzzy Cluster Segmentation and SVM Classifier. In Ambient Communications and Computer Systems. Lecture Notes in Networks and Systems, ed. Y. Hu, S. Tiwari, M. C. Trivedi, and K. K. Mishra. Vol. 356. Singapore: Springer.
  • Lu, Y., Y. M. Cheung, and Y. Y. Tang. 2019. Bayes imbalance impact index: A measure of class imbalanced data set for classification problem. IEEE Transactions on Neural Networks and Learning Systems 31 (9):3525–39. doi:10.1109/TNNLS.2019.2944962.
  • Luo, H., X. Pan, Q. Wang, S. Ye, and Y. Qian 2019. Logistic regression and random forest for effective imb alanced classification. In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Milwaukee, WI, USA. 1:916–17.
  • Luque, A., A. Carrasco, A. Martin, and A. de Las Heras. 2019. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition 91:216–31. doi:10.1016/j.patcog.2019.02.023.
  • Mathew, J., C. K. Pang, M. Luo, and W. H. Leong. 2017. Classification of imbalanced data by oversampling in kernel space of support vector machines. IEEE Transactions on Neural Networks and Learning Systems 29 (9):4065–76. doi:10.1109/TNNLS.2017.2751612.
  • Nasien, D., H. Haron, and S. S. Yuhaniz. 2010. Support Vector Machine (SVM) for English handwritten character recognition. In 2010 2nd International Conference on Computer Engineering and Applications, Bali Island, 1:249–52.
  • Nguyen, H. M., E. W. Cooper, and K. Kamei. 2011. Borderline over-sampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms 3 (1):4–21. doi:10.1504/IJKESDP.2011.039875.
  • Rosales-Pérez, A., S. Garca, and F. Herrera. 2022. Handling imbalanced classification problems with support vector machines via evolutionary bilevel optimization. IEEE Transactions on Cybernetics 53 (8):4735–47. doi:10.1109/TCYB.2022.3163974.
  • Seo, H., L. Brand, L. S. Barco, and H. Wang. 2022. Scaling multi-instance support vector machine to breast cancer detection on the BreaKHis dataset. Bioinformatics 38 (Supplement_1):92–100. doi:10.1093/bioinformatics/btac267.
  • Sudha, C., and D. Akila. 2021. Credit card fraud detection system based on operational & transaction features using svm and random forest classifiers. In 2021 2nd International Conference on Computation, Automation and Knowledge Management (ICCAKM), Dubai, United Arab Emirates, 133–38.
  • Sui, Y., Y. Wei, and D. Zhao. 2015. Computer-aided lung nodule recognition by SVM classifier based on combination of random undersampling and SMOTE. Computational & Mathematical Methods in Medicine 2015:1–13. doi:10.1155/2015/368674.
  • Vapnik, V., and R. Izmailov. 2021. Reinforced SVM method and memorization mechanisms. Pattern Recognition 119:108018. doi:10.1016/j.patcog.2021.108018.
  • Wang, Z., and Y. H. Shao. 2022. Generalization-Memorization Machines. arXiv preprint arXiv:2207.03976.
  • Xiong, Y., and R. Zuo. 2020. Recognizing multivariate geochemical anomalies for mineral exploration by combining deep learning and one-class support vector machine. Computers & Geosciences 140:104484. doi:10.1016/j.cageo.2020.104484.
  • Zhang, S., Q. Fu, and W. Xiao. 2017. Advertisement click-through rate prediction based on the weighted-ELM and adaboost algorithm. Scientific Programming 2017:2938369. doi:10.1155/2017/2938369.
  • Zheng, X. 2020. SMOTE variants for imbalanced binary classification: Heart disease prediction. Ann Arbor, Michigan, USA: Los Angeles ProQuest Dissertations Publishing. University of California.