Full article: Handling Imbalanced Classification Problems by Weighted Generalization Memorization Machine

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Imbalanced classification problems are of great significance in life, and there have been many methods to deal with them, e.g. eXtreme Gradient Boosting (XGBoost), Logistic Regression (LR), Decision Trees (DT), and Support Vector Machine (SVM). Recently, a novel Generalization-Memorization Machine (GMM) was proposed to maintain good generalization ability with zero empirical for binary classification. This paper proposes a Weighted Generalization Memorization Machine (WGMM) for imbalanced classification. By improving the memory cost function and memory influence function of GMM, our WGMM also maintains zero empirical risk with well generalization ability for imbalanced classification learning. The new adaptive memory influence function in our WGMM achieves that samples are described individually and not affected by other training samples from different category. We conduct experiments on 31 datasets and compare the WGMM with some other classification methods. The results exhibit the effectiveness of the WGMM.

Introduction

Imbalanced classification problems often prioritize the minority class, which typically holds more crucial information. Numerous real-world applications encounter imbalanced scenarios, including but not limited to network intrusion detection (Chen et al. Citation2022), cancer detection (Lilhore et al. Citation2022; Seo et al. Citation2022), mineral exploration (Xiong and Zuo Citation2020), illegal credit card transactions (Sudha and Akila Citation2021), bank fraud detection (Abdelhamid, Khaoula, and Atika Citation2014), and Advertisement click-through rate prediction (Zhang, Fu, and Xiao Citation2017). Two main categories of methods are used to address imbalanced classification problems. The first category, known as data-level methods, focuses on transforming imbalanced data into balanced data. This includes oversampling methods (Chawla et al. Citation2002; He and Garcia Citation2009; He et al. Citation2008; Nguyen, Cooper, and Kamei Citation2011; Zheng Citation2020); and undersampling methods (Hart Citation1968; Sui, Wei, and Zhao Citation2015). The second category, algorithm-level methods, adjusts the weights of majority and minority classes within models. Examples include cost-sensitive methods (Bach, Heckerman, and Horvitz Citation2006), kernel adaptation methods (Mathew et al. Citation2017), and hyperplane shifting methods (Datta and Das Citation2015). Including methods such as Logistic Regression (LR) (Luo et al. Citation2019), Cost-Sensitive Decision Trees (Krawczyk, Woźniak, and Schaefer Citation2014), Cost-Sensitive Neural Networks (Krawczyk and Woźniak Citation2015) and Support Vector Machine (SVM), among which the support vector machine algorithm is better at solving imbalanced problems due to the strong generalization ability of SVM.

The SVM, by minimizing the sum of empirical and expected risks, has found extensively used in practical real-world applications, including face recognition (Chaabane et al. Citation2022), cancer detection (Alfian et al. Citation2022; Hussain et al. Citation2011; Seo et al. Citation2022); voice recognition (Harvianto et al. Citation2016) and handwritten character recognition (Hamdan and Sathesh Citation2021; Kibria et al. Citation2020; Kumar, Sharma, and Jindal Citation2011; Nasien, Haron, and Yuhaniz Citation2010). However, the classic SVM cannot always guarantee zero empirical risk, which classifies all training samples correctly. For the SVM, Vapnik proposed a generalization-memorization kernel (Vapnik and Izmailov Citation2021) to correctly classify training samples and have good generalization ability. Subsequently, a generalization-memorization machine (GMM) (Wang and Shao Citation2022) was proposed to account for the mechanism of the generalization-memorization kernel. The GMM enhances memorization by incorporating a memory cost function and improves generalization through a memory influence function. Since the memory influence function is predefined uniformly on the training set, the inessential samples would obtain some larger effects on prediction, especially for imbalance problems.

In this paper, we propose a Weighted Generalization Memorization Machine (WGMM) to deal with the imbalanced classification problem. Our WGMM employs distinct memory cost functions for the majority and minority classes. While preserves zero empirical risk. Furthermore, our WGMM also introduces a self-adaptive memory influence function to adapt to various imbalance problems.

The structure of this paper is as follows: In the next section, a brief overview of GMM and imbalanced classification methods is given. The third part establishes our WGMM model. The last two sections present numerical experiments and conclusions.

Related Works

Imbalanced Classification

There are two primary approaches to deal with imbalanced classification problems: data-level (DL) preprocessing methods (Chawla et al. Citation2002; Hart Citation1968; He and Garcia Citation2009; He et al. Citation2008; Nguyen, Cooper, and Kamei Citation2011; Sui, Wei, and Zhao Citation2015; Zheng Citation2020) and algorithm-level (AL) methods (Bach, Heckerman, and Horvitz Citation2006; Batuwita and Palade Citation2010; Datta and Das Citation2015; Imam, Ting, and Kamruzzaman Citation2006; Iranmehr, Masnadi-Shirazi, and Vasconcelos Citation2019; Mathew et al. Citation2017). Data-level preprocessing methods aim to mitigate class imbalance by adjusting the training data through sample addition or removal to achieve class balance before model training. The oversampling or undersampling is always concerned. Thereinto, oversampling methods rebalance classes by either replicating or generating samples within the minority class. For instance, Random oversampling (ROS) (He and Garcia Citation2009) involves duplicating samples from the minority class, while Synthetic Minority Over-sampling Technique (SMOTE) (Chawla et al. Citation2002) generates artificial samples to balance the training data by linearly interpolating samples from the minority class. Furthermore, other SMOTE variants exist, such as SVM-SMOTE (Nguyen, Cooper, and Kamei Citation2011), Borderline-SMOTE (Zheng Citation2020), Kmeans-SMOTE (Zheng Citation2020) and ADASYN (He et al. Citation2008). Undersampling methods balance the dataset by removing instances from the majority class. For instance, Random undersampling (RUS) (Sui, Wei, and Zhao Citation2015) entails the random removal of instances from the majority class until a specified class balance is achieved.

Algorithm level methods aim to construct a particular classifier to handle imbalanced classification problems. There have been many approaches based on this strategy, such as Fuzzy Support Vector Machines for Class Imbalance Learning (FSVM-CIL) (Batuwita and Palade Citation2010), Cost-sensitive support vector machines (CSSVM) (Iranmehr, Masnadi-Shirazi, and Vasconcelos Citation2019) and z-SVM (Imam, Ting, and Kamruzzaman Citation2006). Introducing fuzzy membership values enables FSVM-CIL to prioritize the minority class while accounting for the majority class, resulting in improved performance on imbalanced datasets. CSSVM assigns distinct misclassification costs to various classes, with a primary focus on prioritizing minority classes. z-SVM embraces a cost-sensitive approach, where it assigns class-specific costs to misclassifications, offering significant value when a class is rare or holds greater importance.

Generalization-Memorization Machine

Recently, the generalization-memorization machine (GMM), the SVM with a new memory mechanism, have been proposed for classification in the n-dimensional real space $R^{n}$ . Suppose $T = {(x_{i}, y_{i}) | i = 1, 2, \dots, m}$ is the training set, where $x_{i} \in R^{n}$ is the input sample, and $y_{i} = {+ 1, - 1}$ is the corresponding output. We organize the training set into $X$ and $Y$ , where $X \in R^{n \timesm}$ is the sample matrix composed of $x_{i}$ , $Y$ is a diagonal matrix composed of labels, and the element on the diagonal is $Y_{ii} = y_{i}$ with $i = 1, 2, \dots, m$ .

GMM considers the optimization problem as

(1)

\begin{matrix} \min_{w, b, c, d_{\cdot}} & \frac{1}{2} ∥ w ∥^{2} + \frac{λ}{2} ∥ c ∥^{2} + C \sum_{i = 1}^{m} d_{i} \\ s . t . & y_{i} (φ {(x_{i})}^{T} w + b + \sum_{j = 1}^{m} y_{j} c_{j} δ (x_{i}, x_{j})) \geq 1 - d_{i}, \\ d_{i} \geq 0, i = 1, \dots, m, \end{matrix}

(1)

where $∥ \cdot ∥$ is $L_{2}$ norm, $φ (\cdot)$ is a mapping, $λ$ and $C$ are positive parameters, and $d_{i}$ is a slack variable. $c$ is a column vector composed of $c_{j} (j = 1, \dots, m)$ , which is called memory cost and $δ (x_{i}, x_{j})$ is called memory impact of $x_{j}$ on $x_{i}$ .

The goal of this optimization problem is to find a hyperplane that correctly classifies the training datasets and has the largest margin. The first term $\frac{1}{2} ∥ w ∥^{2}$ is half the square of the margin, and by minimizing this term we are trying to maximize the margin. The second term $\frac{λ}{2} ∥ c ∥^{2}$ is the regularization term of the memory cost function, where $λ$ control the minimization of the sample memory cost function. Increasing $λ$ will increase the penalty on memory cost, thus placing more emphasis on classification accuracy. $C \sum_{i = 1}^{m} d_{i}$ is the minimum training sample re-memory terms. In this way, the goal is to control the size and complexity of the model. $y_{i} (φ (x_{i})^{T} w + b + \sum_{j = 1}^{m} y_{j} c_{j} δ (x_{i}, x_{j})) \geq 1 - d_{i}$ is established indicating that each training sample is correctly classified.

For a new sample $x$ , outside the training set, it would be classified into positive class if $φ (x_{i})^{T} w + b + \sum_{j = 1}^{m} y_{j} c (x_{j}) δ (x_{i}, x_{j}) > 0$ , and otherwise, it is classified into the negative class.

GMM has been proved that it could obtain zero empirical risk, and its classification performance in numerical experiments is much higher than the SVMs.

Weighted Generalization Memorization Machine (WGMM)

Building upon the foundation of retaining all training samples, one of the advantages of GMM is that it can achieve superior test accuracy than SVM. However, GMM remains susceptible to issues related to sample imbalance. To mitigate this, we have enhanced GMM by optimizing both the memory cost function and the memory influence function, resulting in the development of our WGMM.

Model Formation

We introduce a weighted memory cost function component into the original objective function, with the objective of capturing distinct costs associated with various samples.

The training set includes a positive sample set $T_{1} = {(x_{i}, y_{i}) | i = 1, 2, \dots, p}$ and a negative sample set $T_{2} = {(x_{k}, y_{k}) | k = 1, 2, \dots, q}$ with $p + q = m$ . Without loss of generality, consider the positive samples as the minority categories within the imbalanced dataset, with the assumption that the positive sample set $X_{1}$ represents the minority class and the negative class sample set $X_{2}$ represents the majority class, where $X_{1} \in R^{n \times p}$ and $X_{2} \in R^{n \times q}$ represent sample matrix composed of $x_{i}$ .

The optimization problem of WGMM can be expressed as follows:

(2)

\begin{matrix} \min_{w, b, c, d_{\cdot}} & \frac{1}{2} ∥ w ∥^{2} + \frac{λ_{1}}{2} ∥ c_{1} ∥^{2} + \frac{λ_{2}}{2} ∥ c_{2} ∥^{2} + λ_{3} \sum_{i = 1}^{p} d_{i} + λ_{4} \sum_{k = 1}^{q} d_{k} \\ s . t . & y_{i} (φ {(x_{i})}^{T} w + b + \sum_{j = 1}^{m} y_{j} c_{j} δ (x_{i}, x_{j})) \geq 1 - d_{i}, \\ y_{k} (φ {(x_{k})}^{T} w + b + \sum_{j = 1}^{m} y_{j} c_{j} δ (x_{k}, x_{j})) \geq 1 - d_{k}, \\ d_{i} \geq 0, i = 1, \dots, p, \\ d_{k} \geq 0, k = 1, \dots, q, \end{matrix}

(2)

where $λ_{1}$ , $λ_{2}$ , $λ_{3}$ and $λ_{4}$ are positive parameters, $c_{1}$ and $c_{2}$ represent the memory cost functions for remembering positive samples and negative samples, $c_{1}$ is a column vector consisting of $c_{i} = c (x_{i}) (i = 1, \dots, p)$ , $c_{i}$ is the memory cost function of the $i$ th positive sample, $c_{2}$ is a column vector composed of $c_{k} = c (x_{k}) (k = 1, \dots, q)$ , $c_{k}$ is the memory cost function of the $k$ th negative sample. $d_{i} = d (x_{i}) (i = 1, \dots, p)$ is the memory cost function of the $i$ th positive sample, $d_{k} = d (x_{k}) (k = 1, \dots, q)$ is the memory cost function of the $k$ th negative sample.

The objective of formula (2) seeks the large margin with memory costs as lower as possible, and it controls the complexity of the model meanwhile. According to the constraints of formula (2), the memory cost function is a variable, and the memory cost function is split into $c_{1}$ and $c_{2}$ , and multiply them by different parameters. The constraints within the optimization problem are separately defined for the positive and negative classes, allowing for individual control over the memory costs of each class, thus reflecting the importance of each respective class. GMM requires the predefinition of the influence function $δ (x_{i}, x_{j}) (i, j = 1, \dots, p)$ . Subsequently, this paper will introduce an adaptive influence function, in alignment with sample adaptation, which will be elaborated on in the following section. The retention of memory for all training samples becomes crucial in compliance with the constraints specified in formula (2).

The decision function of our WGMM is

(3)

\begin{matrix} g (x) = \{\begin{matrix} φ {(x)}^{T} w + b + \sum_{j = 1}^{m} y_{j} c_{j} δ (x_{j}, x_{i}) + y_{i} d_{i}, & if x = x_{i}, \exists x_{i} \in X, \\ φ {(x)}^{T} w + b + \sum_{j = 1}^{m} y_{j} c_{j} δ (x_{j}, x_{i}), & otherwise . \end{matrix} \end{matrix}

(3)

Formula (3) defines the decision function as a piecewise function. the decision function is a piecewise function. When the input test sample belongs to the training sample, the model employs the first function to make decisions. $\sum_{j = 1}^{m} y_{j} c_{j} δ (x_{j}, x_{i})$ represents the comprehensive influence of memory training samples on predictions. The function $δ$ impacts memory capacity of the model. So $y_{i} d_{i}$ denotes the item re-memorized for the sample, enhancing memory accuracy, thereby guaranteeing a training accuracy rate be 1. Conversely, when the input sample is not part of the training set, $d_{i}$ has no practical effect on $x$ . In such cases, the model utilizes the second function as the testing function.

We now delve into the dual problem associated with (2). The original problem (2) can be written in matrix form

(4)

\begin{matrix} \min_{w, b, c, d} & \frac{1}{2} ∥ w ∥^{2} + \frac{λ_{1}}{2} ∥ c_{1} ∥^{2} + \frac{λ_{2}}{2} ∥ c_{2} ∥^{2} + λ_{3} e_{1}^{T} d_{1} + λ_{4} e_{2}^{T} d_{2} \\ s . t . & φ {(X_{1})}^{T} w + b e_{1} + Δ_{11} c_{1} - Δ_{12} c_{2} \geq e_{1} - d_{1}, \\ - φ {(X_{2})}^{T} w - b e_{2} - Δ_{21} c_{1} + Δ_{22} c_{2} \geq e_{2} - d_{2}, \\ d_{1} \geq 0, \\ d_{2} \geq 0, \end{matrix}

(4)

where $d_{1}$ and $d_{2}$ are the memory cost functions for remembering positive samples and negative samples, $d_{1}$ is a column vector composed of $d_{i}$ , $d_{2}$ is a column vector composed of $d_{k}$ , $Δ_{11} \in R^{p \times p}$ with element $δ (x_{i}, x_{j}) (i, j = 1, \dots, p)$ , $Δ_{12} \in R^{p \times q}$ , $Δ_{21} \in R^{q \times p}$ , $Δ_{22} \in R^{q \times q}$ are defined similarly to $Δ_{11}$ . $e_{1}$ is a vector of one with an $p$ dimension, and $e_{2}$ is a vector of one with a $q$ dimension, and elements of $e_{1}$ and $e_{2}$ are all ones.

In order to solve optimization problem (4), take it as the original optimization problem and apply Lagrangian duality to obtain the optimal solution of the original problem. We construct a LaGrange function

(5)

\begin{matrix} L & = \frac{1}{2} ∥ w ∥^{2} + \frac{λ_{1}}{2} ∥ c_{1} ∥^{2} + \frac{λ_{2}}{2} ∥ c_{2} ∥^{2} + λ_{3} e_{1}^{T} d_{1} + λ_{4} e_{2}^{T} d_{2} \\ - α^{T} (φ {(X_{1})}^{T} w + b e_{1} + Δ_{11} c_{1} - Δ_{12} c_{2} - e_{1} + d_{1}) \\ + β^{T} (φ {(X_{2})}^{T} w + b e_{2} + Δ_{21} c_{1} - Δ_{22} c_{2} + e_{2} - d_{2}) \\ + {γ_{1}}^{T} d_{1} + {γ_{2}}^{T} d_{2} . \end{matrix}

(5)

The LaGrange multiplier vectors $α$ , $γ_{1} \in R^{p}$ and $β$ , $γ_{2} \in R^{q}$ are introduced for each inequality constraint. Next, we present the Karush–Kuhn–Tucker (KKT) conditions (Fletcher Citation2000) for (4). The partial derivatives of (5) with respect to variables $w$ , $b$ , $c_{1}$ , $c_{2}$ $d_{1}$ and $d_{2}$ are

(6)

\frac{\partial L}{\partial w} = w - φ (X_{1}) α + φ (X_{2}) β = 0,

(6)

(7)

\frac{∂L}{∂b} = - e_{1}^{T} α + e_{2}^{T} β = 0,

(7)

(8)

\frac{∂L}{\partial c_{1}} = λ_{1} c_{1} - Δ_{11}^{T} α + Δ_{12}^{T} β = 0,

(8)

(9)

\frac{∂L}{\partial c_{2}} = λ_{1} c_{2} - Δ_{22}^{T} β + Δ_{12}^{T} α = 0,

(9)

(10)

\frac{∂L}{\partial d_{1}} = λ_{3} e_{1} - α - γ_{1} = 0,

(10)

and

(11)

\frac{\partial L}{\partial d_{2}} = λ_{4} e_{2} - β - γ_{2} = 0.

(11)

(6–11) are calculated mathematically to get

(12)

w = φ (X_{1}) α - φ (X_{2}) β,

(12)

(13)

c_{1} = \frac{1}{λ_{1}} (Δ_{11}^{T} α - Δ_{21}^{T} β),

(13)

(14)

c_{2} = \frac{1}{λ_{2}} (Δ_{22}^{T} β - Δ_{12}^{T} α),

(14)

(15)

γ_{1} = λ_{3} e_{1} - α \geq 0, α \leq λ_{3} e_{1}, α = λ_{3} e_{1} - γ_{1},

(15)

and

(16)

γ_{2} = λ_{4} e_{2} - β \geq 0, β \leq λ_{4} e_{2}, β = λ_{4} e_{2} - γ_{1} .

(16)

Substitute (12–16) into EquationEquation (4)(4) $\begin{matrix} \min_{w, b, c, d} & \frac{1}{2} ∥ w ∥^{2} + \frac{λ_{1}}{2} ∥ c_{1} ∥^{2} + \frac{λ_{2}}{2} ∥ c_{2} ∥^{2} + λ_{3} e_{1}^{T} d_{1} + λ_{4} e_{2}^{T} d_{2} \\ s . t . & φ {(X_{1})}^{T} w + b e_{1} + Δ_{11} c_{1} - Δ_{12} c_{2} \geq e_{1} - d_{1}, \\ - φ {(X_{2})}^{T} w - b e_{2} - Δ_{21} c_{1} + Δ_{22} c_{2} \geq e_{2} - d_{2}, \\ d_{1} \geq 0, \\ d_{2} \geq 0, \end{matrix}$ (4) to get the dual problem

(17)

\begin{matrix} \min_{α, b} & \frac{1}{2} α^{T} (K (X_{1}, X_{1}) + \frac{1}{λ_{1}} Δ_{11} Δ_{11}^{T} + \frac{1}{λ_{2}} Δ_{12} Δ_{12}^{T}) α \\ + \frac{1}{2} β^{T} (K (X_{2}, X_{2}) + \frac{1}{λ_{1}} Δ_{21} Δ_{21}^{T} + \frac{1}{λ_{2}} Δ_{22} Δ_{22}^{T}) β \\ - \frac{1}{2} α^{T} (K (X_{1}, X_{2}) + \frac{1}{λ_{1}} Δ_{11} Δ_{21}^{T} + \frac{1}{λ_{2}} Δ_{12} Δ_{22}^{T}) β \\ - \frac{1}{2} β^{T} (K (X_{2}, X_{1}) + \frac{1}{λ_{1}} Δ_{21} Δ_{11}^{T} + \frac{1}{λ_{2}} Δ_{22} Δ_{12}^{T}) α \\ - e_{1}^{T} α - e_{2}^{T} β \\ s . t . & e_{1}^{T} α - e_{2}^{T} β = 0, \\ 0 \leq α \leq λ_{3} e_{1}, \\ 0 \leq β \leq λ_{4} e_{2}, \end{matrix}

(17)

where $K (\cdot, \cdot)$ represents a generalized Gaussian kernel with a parameter $σ$ . Furthermore, when examining the solutions for $α$ and $β$ , where $α$ has $l$ non-zero components and $β$ has $s$ non-zero components. From the KKT condition, we can deduce

(18)

b_{1} = e_{1} - φ (X_{1})^{T} w - Δ_{11} c_{1} + Δ_{12} c_{2},

(18)

and

(19)

b_{2} = - e_{2} - φ (X_{2})^{T} w - Δ_{21} c_{1} + Δ_{22} c_{2},

(19)

where $b_{1}$ contains $l$ elements, and $b_{2}$ contains $s$ elements. From EquationEquations (18)(18) $b_{1} = e_{1} - φ (X_{1})^{T} w - Δ_{11} c_{1} + Δ_{12} c_{2},$ (18) and (Equation19(19) $b_{2} = - e_{2} - φ (X_{2})^{T} w - Δ_{21} c_{1} + Δ_{22} c_{2},$ (19) ), we can find

(20)

\begin{matrix} b = \frac{e_{1}^{T} b_{1} + e_{2}^{T} b_{2}}{l + s} . \end{matrix}

(20)

New Memory Influence Function

Memorizing different datasets will have an impact on the generalization ability of the model. Therefore, this paper has made the following improvements: An adaptive memory influence function based on Euclidean distance is proposed to adapt different data sets to the model, that is, for each individual example, a distinct memory influence function is introduced.

The memory influence function chosen within GMM is

(21)

δ (x_{i}, x_{j}) = \exp (- \frac{∥ x_{i} - x_{j} ∥^{2}}{2 σ^{2}}),

(21)

where $σ$ is a positive parameter and needs to be selected. However, it lacks the inherent ability to adapt to diverse datasets, potentially impacting the model performance. To address this limitation, we are undertaking improvements. In the Gaussian function, the probability of numerical distribution within $μ + 3 σ$ is 0.9974. This insight guides us in defining the influence range of each sample point. We consider the sample as the neighborhood center and designate the radius of the field as half of the Euclidean distance (d) to the nearest heterogeneous point. This approach confines the memory influence function to this specific field. More precisely, we set $\frac{d}{2} = 3 σ$ , allowing us to calculate $σ = \frac{d}{6}$ , by substituting it into the Gaussian function, we obtain

(22)

δ (x_{i}, x_{j}) = \exp (- \frac{18 ∥ x_{i} - x_{j} ∥^{2}}{d^{2}}) .

(22)

At this time, the influence function does not incorporate any parameters; instead, it is calculated based on the characteristics of the sample distribution. Introducing an adaptive memory influence function rooted in Euclidean distance allows for the individual characterization of each sample feature. This adaptation enhances the ability of the model to accommodate various datasets.

The comparative diagrams for the new influence function we developed and the influence function used in GMM are presented below:

illustrates the influence range of the influence function for both WGMM and GMM. The black circle with the sample as the center is a schematic diagram of the influence range of the sample. provides a schematic depiction of the novel influence function within WGMM. This reveals that diverse samples exhibit varying influence ranges, each with distinct sizes. represents GMM, where the influence ranges for various samples are uniformly sized. During practical model classification, the influence ranges between samples may overlap and converge.

Figure 1. An illustrative example employing synthetic data is used to demonstrate the value range of the memory influence function in linear WGMM and linear GMM. The schematic illustrates classification performance of WGMM when (a) is $λ_{1} = 4$ , $λ_{2} = 2.30$ , $λ_{3} = 11.31$ and $λ_{4} = 4$ utilizing the influence range defined in EquationEquation (22)(22) $δ (x_{i}, x_{j}) = \exp (- \frac{18 ∥ x_{i} - x_{j} ∥^{2}}{d^{2}}) .$ (22) as the memory influence function. (b) is a schematic diagram of the classification performance of GMM when $λ = 11.31$ and $μ = 111.43$ and the influence range of the RBF kernel as a memory influence function.

Finally, we summarize the procedure of our WGMM with the new memory influence function in Algorithm 1.

Table

Display Table

Experiments

In this section, we describe our experimental study. Section 4.1 lists the datasets and methods used for the experiments. Section 4.2 provides the metrics used to evaluate the performance of all methods, while Sections 4.3 and 4.4 respectively present the specific experiment details and the analysis and summary of the experimental results.

Datasets

Based on 31 datasets obtained from (Rosales-Pérez, Garca, and Herrera Citation2022), our model is evaluated. presents datasets information, including data dimension (n), data volume (m), the size of the positive samples (p), the size of the negative class samples (q), and imbalance ratio (IR) (Lu, Cheung, and Tang Citation2019) represents the ratio of positive samples to negative samples and is arranged in ascending order of IR values, ranging from small to large. Observing the performance of WGMM in different imbalanced growth scenarios is essential.

Table 1. Description of 31 datasets.

Download CSV Display Table

Performance Metrics

We employed the Geometric Mean (GM) evaluation index from reference (Luque et al. Citation2019) as the primary evaluation metric in this paper. The evaluation of our model is based on the confusion matrix, as defined in reference (Caelen Citation2017), which helps in computing key parameters: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). Specifically, TP represents the count of correctly classified positive samples, FP represents the count of misclassified positive samples, TN represents the count of correctly classified negative samples, and FN represents the count of misclassified negative samples. According to the insights presented in reference (Luque et al. Citation2019), sensitivity and specificity are

(23)

\begin{matrix} sen = \frac{TP}{TP + FN}, \end{matrix}

(23)

(24)

\begin{matrix} spe = \frac{TN}{TN + FP} . \end{matrix}

(24)

Then the GM is

(25)

\begin{matrix} GM = \sqrt{sen * spe} . \end{matrix}

(25)

Reference Methods

The dataset was divided using 10-fold cross-validation. In 10-fold cross-validation, the dataset is randomly separated into 10 distinct subsets. For each fold, one subset is employed as the test set, while the others serve as the training set. Next, 10-fold cross-validation is utilized to select the optimal parameters from within the specified parameter range. Subsequently, these optimal parameters are employed in 20 rounds of 10-fold cross-validation. The number of folds used in each iteration varies, resulting in a total of 200 tests for each dataset.

The comparison methods selected in this paper are LR (LaValley Citation2008), FSVM-CIL (Batuwita and Palade Citation2010), GMM (Wang and Shao Citation2022), SMOTE-SVM (Chawla et al. Citation2002) and RUS-SVM (He and Garcia Citation2009). These methods represent advancements in recent years. Linear experiments involving these methods were conducted on 31 datasets, while 17 datasets were chosen for nonlinear experiments. The experiment presents certain challenges related to hyperparameter selection. For the regularization coefficient, we selected a parameter range represented as ${2^{i} | i = - 8, - 6, \dots, 6}$ . In the case of the kernel function parameters in the nonlinear experiments, the range is ${2^{i} | i = - 10, - 8, \dots, 10}$ . All these models were implemented on a personal computer equipped with an Intel Core dual-core processor (dual 4.2 GHz) and 32GB of RAM using MATLAB 2017a. The Quadratic Programming Problems for these models were solved using the same algorithm and tolerance.

Experiments Results and Discussion

In this section, we offer a comparative analysis of the performance of WGMM and alternative methods. We conduct experiments using 31 datasets for linear analysis and 17 datasets for nonlinear studies. presents the results of the linearity experiments conducted by each approach. presents a selection of datasets that illustrate the non-linear experimental findings for each method. And illustrates the overall effects of each approach on all datasets. Meanwhile, displays the optimal parameters $λ_{1}$ and $λ_{2}$ selected by the WGMM method during the experiments. And portrays the model results with different parameters using example data. Additionally, highlights the relationship between data parameters and indicators.

Figure 2. (a) depicts the GM distribution for 31 datasets across 5 linear algorithms. (b) illustrates the GM distribution for 17 datasets using 5 nonlinear algorithms.

Figure 3. (4) the optimal reference ratio between $λ_{1}$ and $λ_{2}$ in the formula. (a) is the linear WGMM algorithm. (b) is the nonlinear WGMM algorithm.

Figure 4. Results of dataset yeast-1-4-5-8, where the $1 st$ and $4 th$ dimensions are selected.

Figure 5. On the datasets of different IR, the GM value of the WGMM method in the selected parameter range. (a)glass0, (b)vehicle2, (c)ecoli1, (d)ecoli3, (e)abalone9–18,(f)winequality-red-4, (g)yeast6,(h)poker-8–9 vs 6, (i)poker-8 vs 6.

Table 2. GM comparison of linear experiments.

Display Table

Table 3. GM comparison of nonlinear experiments.

Display Table

displays the evaluation results for linear models using the GM index. It is evident that WGMM exhibits the best overall performance, particularly when the Imbalance Ratio (IR) is large. Moreover, our method is generally better than LR. In addition, GMM has a GM value of 0.00 on 9 datasets. It can be found that the test accuracy of the minority class on these datasets is 0.00. GMM performs poorly on imbalanced data. In terms of variance, WGMM falls within an acceptable range, and RUS-SVM excels, with variance at 0.00 on most datasets.

presents the evaluation of nonlinear experiments using the GM index. Notably, the FSVM-CIL model performs well on datasets with a small Imbalance Ratio (IR), and WGMM and FSVM-CIL show comparable performance. However, as the IR increases, WGMM emerges as the leading performer, exhibiting significantly superior results compared to other methods.

Comparing , we find:

In , out of the 17 datasets, 11 of them exhibit a distinctive pattern. Under the GM metric, our linear algorithm outperforms the nonlinear model, which can be attributed to the influence of data characteristics.
The datasets ecoli-0-3-4-7 vs 5-6, ecoli-0-1-4-7 vs 2-3-5-6, glass-0-1-6 vs 5, yeast-1-4- 5-8 vs 7 and glass5 exhibited a GM value of 0.00 in the GMM linear experiments. While substantial improvements were observed in the nonlinear experiments, the results remained less than satisfactory.

consists of two boxplots. represents a comparative boxplot of GM values for five methods in the linear experiment. Notably, WGMM exhibits higher median, first quartile, and third quartile values compared to the other methods, with shorter whiskers. This indicates the superior performance of WGMM. Conversely, the GMM method shows a lower box, longer whiskers, and lower outliers, clearly highlighting the subpar performance of unweighted methods in imbalanced experiments. represents the boxplot comparison of the five methods in the nonlinear experiment. Overall, both FSVM-CIL and WGMM methods perform better. Their boxes are short and skewed upward, signifying larger GM values. While the median and first quartile of FSVM-CIL are slightly greater than WGMM, the third quartile of WGMM is larger, and its upper whisker is longer. This suggests that predictive results of WGMM are notably larger.

In formula (4), $λ_{1}$ and $λ_{2}$ represent the coefficients of the positive and negative class memory cost functions, respectively, with $λ_{1}$ referring to the positive class. Owing to the imbalance in sample sizes between the positive and negative classes, applying the same weight would introduce bias into performance of the model, favoring the class with more samples. Hence, it becomes essential to define different weights. The weight for the minority class is increased to balance the model’s performance on both types of samples. is the optimal reference comparison diagram for $λ_{1}$ and $λ_{2}$ for each dataset in the context of the linear experiment. is the optimal reference comparison diagram for each dataset comparing $λ_{1}$ and $λ_{2}$ in the context of nonlinear experiments. Most of the datasets show that $λ_{1} > λ_{2}$ . This implies that greater weight is assigned to the minority samples, thereby achieving data balance through varying weights. In situations of class imbalance, the model attains optimal classification.

selects the 1st and 4th dimensions of the dataset yeast-1-4-5-8 vs 7 for training and testing. represents the optimal parameters, specifically $λ_{1} = 2^{2}$ , $λ_{2} = 2^{- 4}$ , $λ_{3} = 2^{0}$ and $λ_{4} = 2^{2}$ . shows the result when the weights of the cost memory function are equal, that is, $λ_{1} = λ_{2} = 2^{4}$ . It is evident that , the trained classifier exhibits a bias toward the majority class samples, favoring the negative class. This underscores the importance of weighting.

is the linear experimental GM change plot of different datasets for different $λ_{1}$ and $λ_{2}$ . We can see the following points.

Different datasets correspond to different optimal parameters $λ_{1}$ and $λ_{2}$ , which may be affected by the data structure and IR value.
In many data sets, when $λ_{1} > λ_{2}$ , the GM value is larger. And the GM value of the model with non-optimal parameters will be partially 0.00.
The larger the IR, the fewer parameter groups have a GM value greater than 0.5, and the more parameter groups have a GM value of 0.00.

From the above results, it can be seen that WGMM is a competitive choice.

Conclusions

In this study, our WGMM sets weight parameters for the memory cost functions of positive and negative class samples, respectively. A new adaptive memory influence function is proposed. Implementation samples are described individually without increasing the weight of the entire category. It will not be affected by different training samples and can adapt to different imbalance scenarios. Ensure zero empirical risk while retaining generalization capabilities.

We conduct a comprehensive evaluation of WGMM on 31 imbalanced datasets, benchmarking its performance against alternative methods. Experimental results demonstrate the efficacy of WGMM in tackling imbalanced classification problems and showcasing robust generalization abilities. Notably, in scenarios with a substantial IR, our model exhibits pronounced characteristics and excels across various evaluation metrics.

Regarding the different parameters set for the positive and negative class memory impact functions, experiments found that the optimal parameters of the minority class sample memory impact function are greater than or equal to the optimal parameters of the majority class sample memory impact function. The adaptive memory influence function enables the model to adapt well to different data, and the effect is better than other methods.

These findings strongly affirm WGMM as an effective approach for addressing imbalanced classification challenges. Its superiority in performance and adaptability across different imbalanced datasets without the need for parameter tuning positions WGMM as a potent tool for practical applications.

Future research avenues may involve further optimizing performance of WGMM and exploring its applicability in diverse fields and tasks. In summary, this study provides novel insights and effective solutions for resolving imbalanced classification problems.

Supplemental material

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable suggestions.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Data Availability Statement

The data that support the findings of this study are available in [IEEE] at [doi: 10.1109/TCYB.2022.3163974], reference (Rosales-Perez, Garca, and Herrera Citation2022). These data were derived from the following resources available in the public domain: [https://ieeexplore.ieee.org/document/9756639].

Supplementary Material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/08839514.2024.2355424

Additional information

Funding

The work was supported by the National Natural Science Foundation of China [61966024, 62366035]; National Natural Science Foundation of China [62106112]; Natural Science Foundation of Inner Mongolia Autonomous Region [2023MS01006].

References

Abdelhamid, D., S. Khaoula, and O. Atika. 2014. Automatic bank fraud detection using support vector machines. In The International Conference on Computing Technology and Information Management, Dubai, UAE, January.
Google Scholar
Alfian, G., M. Syafrudin, I. Fahrurrozi, N. L. Fitriyani, F. T. D. Atmaji, T. Widodo, N. Bahiyah, F. Benes, and J. Rhee. 2022. Predicting breast cancer from risk factors using SVM and extra-trees-based feature selection method. Computers 11 (9):136. doi:10.3390/computers11090136.
Web of Science ®Google Scholar
Bach, F. R., D. Heckerman, and E. Horvitz. 2006. Considering cost asymmetry in learning classifiers. Journal of Machine Learning Research 7:1713–20.
Web of Science ®Google Scholar
Batuwita, R., and V. Palade. 2010. FSVM-CIL: Fuzzy support vector machines for class imbalance learning. IEEE Transactions on Fuzzy Systems 18 (3):558–71. doi:10.1109/TFUZZ.2010.2042721.
Web of Science ®Google Scholar
Caelen, O. 2017. A Bayesian interpretation of the confusion matrix. Annals of Mathematics and Artificial Intelligence 81 (3–4):429–50. doi:10.1007/s10472-017-9564-8.
Web of Science ®Google Scholar
Chaabane, S. B., M. Hijji, R. Harrabi, and H. Seddik. 2022. Face recognition based on statistical features and SVM classifier. Multimedia Tools & Applications 81 (6):8767–84. doi:10.1007/s11042-021-11816-w.
Web of Science ®Google Scholar
Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. The Journal of Artificial Intelligence Research 16:321–57. doi:10.1613/jair.953.
Web of Science ®Google Scholar
Chen, C., X. Xu, G. Wang, and L. Yang. 2022. Network intrusion detection model based on neural network feature extraction and PSO-SVM. In 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, April, 1462–65.
Google Scholar
Datta, S., and S. Das. 2015. Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. Neural Networks 70:39–52. doi:10.1016/j.neunet.2015.06.005.
PubMed Web of Science ®Google Scholar
Fletcher, R. 2000. Practical methods of optimization. New Jersey, USA: JohnWiley & Sons.
Google Scholar
Hamdan, Y. B., and A. Sathesh. 2021. Construction of statistical SVM based recognition model for handwritten character recognition. Journal of Information Technology and Digital World 3 (2):92–107. doi:10.36548/jitdw.2021.2.003.
Google Scholar
Hart, P. 1968. The condensed nearest neighbor rule (corresp.). IEEE Transactions on Information Theory 14 (3):515–16. doi:10.1109/TIT.1968.1054155.
Web of Science ®Google Scholar
Harvianto, H., L. Ashianti, J. Jupiter, and S. Junaedi. 2016. Analysis and voice recognition in Indonesian language using MFCC and SVM method. ComTech: Computer, Mathematics and Engineering Applications 7 (2):131–39. doi:10.21512/comtech.v7i2.2252.
Google Scholar
He, H., Y. Bai, E. A. Garcia, and S. Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, China, June, 1322–28.
Google Scholar
He, H., and E. A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21 (9):1263–84. doi:10.1109/TKDE.2008.239.
Web of Science ®Google Scholar
Hussain, M., S. K. Wajid, A. Elzaart, and M. Berbar. 2011. A comparison of SVM kernel functions for breast cancer detection. In 2011 Eighth International Conference Computer Graphics, Imaging and Visualization, Singapore, August, 145–50.
Google Scholar
Imam, T., K. M. Ting, and J. Kamruzzaman. 2006. z-SVM: An SVM for improved classification of imbalanced data. In AI 2006: Advances in Artificial Intelligence: 19th Australian Joint Conference on Artificial Intelligence, Proceedings, Hobart, Australia, Vol. 19, 264–73.
Google Scholar
Iranmehr, A., H. Masnadi-Shirazi, and N. Vasconcelos. 2019. Cost-sensitive support vector machines. Neurocomputing 343:50–64. doi:10.1016/j.neucom.2018.11.099.
Web of Science ®Google Scholar
Kibria, M. R., A. Ahmed, Z. Firdawsi, and M. A. Yousuf. 2020. Bangla compound character recognition using support vector machine (SVM) on advanced feature sets. In 2020 IEEE Region 10 Symposium, Dhaka, Bangladesh, 965–68.
Google Scholar
Krawczyk, B. and M. Woźniak. 2015. Cost-sensitive neural network with roc-based moving threshold for imbalanced classification. In 16th International Conference, Wroclaw, Poland, October 14–16, 2015, Proceedings 16, 45-52.
Google Scholar
Krawczyk, B., M. Woźniak, and G. Schaefer. 2014. Cost-sensitive decision tree ensembles for effective imbalanced classification. Applied Soft Computing 14:554–62. doi:10.1016/j.asoc.2013.08.014.
Web of Science ®Google Scholar
Kumar, M., R. K. Sharma, and M. K. Jindal. 2011. SVM based offline handwritten Gurmukhi character recognition. SCAKD Proceedings 758:51–62.
Google Scholar
LaValley, M. P. 2008. Logistic regression. Circulation 117 (18):2395–99. doi:10.1161/CIRCULATIONAHA.106.682658.
PubMed Web of Science ®Google Scholar
Lilhore, U. K., S. Simaiya, H. Pandey, V. Gautam, A. Garg, and P. Ghosh. 2022. Breast Cancer Detection in the IoT Cloud-based Healthcare Environment Using Fuzzy Cluster Segmentation and SVM Classifier. In Ambient Communications and Computer Systems. Lecture Notes in Networks and Systems, ed. Y. Hu, S. Tiwari, M. C. Trivedi, and K. K. Mishra. Vol. 356. Singapore: Springer.
Google Scholar
Lu, Y., Y. M. Cheung, and Y. Y. Tang. 2019. Bayes imbalance impact index: A measure of class imbalanced data set for classification problem. IEEE Transactions on Neural Networks and Learning Systems 31 (9):3525–39. doi:10.1109/TNNLS.2019.2944962.
PubMed Web of Science ®Google Scholar
Luo, H., X. Pan, Q. Wang, S. Ye, and Y. Qian 2019. Logistic regression and random forest for effective imb alanced classification. In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Milwaukee, WI, USA. 1:916–17.
Google Scholar
Luque, A., A. Carrasco, A. Martin, and A. de Las Heras. 2019. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition 91:216–31. doi:10.1016/j.patcog.2019.02.023.
Web of Science ®Google Scholar
Mathew, J., C. K. Pang, M. Luo, and W. H. Leong. 2017. Classification of imbalanced data by oversampling in kernel space of support vector machines. IEEE Transactions on Neural Networks and Learning Systems 29 (9):4065–76. doi:10.1109/TNNLS.2017.2751612.
PubMed Web of Science ®Google Scholar
Nasien, D., H. Haron, and S. S. Yuhaniz. 2010. Support Vector Machine (SVM) for English handwritten character recognition. In 2010 2nd International Conference on Computer Engineering and Applications, Bali Island, 1:249–52.
Google Scholar
Nguyen, H. M., E. W. Cooper, and K. Kamei. 2011. Borderline over-sampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms 3 (1):4–21. doi:10.1504/IJKESDP.2011.039875.
Google Scholar
Rosales-Pérez, A., S. Garca, and F. Herrera. 2022. Handling imbalanced classification problems with support vector machines via evolutionary bilevel optimization. IEEE Transactions on Cybernetics 53 (8):4735–47. doi:10.1109/TCYB.2022.3163974.
Web of Science ®Google Scholar
Seo, H., L. Brand, L. S. Barco, and H. Wang. 2022. Scaling multi-instance support vector machine to breast cancer detection on the BreaKHis dataset. Bioinformatics 38 (Supplement_1):92–100. doi:10.1093/bioinformatics/btac267.
Web of Science ®Google Scholar
Sudha, C., and D. Akila. 2021. Credit card fraud detection system based on operational & transaction features using svm and random forest classifiers. In 2021 2nd International Conference on Computation, Automation and Knowledge Management (ICCAKM), Dubai, United Arab Emirates, 133–38.
Google Scholar
Sui, Y., Y. Wei, and D. Zhao. 2015. Computer-aided lung nodule recognition by SVM classifier based on combination of random undersampling and SMOTE. Computational & Mathematical Methods in Medicine 2015:1–13. doi:10.1155/2015/368674.
Web of Science ®Google Scholar
Vapnik, V., and R. Izmailov. 2021. Reinforced SVM method and memorization mechanisms. Pattern Recognition 119:108018. doi:10.1016/j.patcog.2021.108018.
Web of Science ®Google Scholar
Wang, Z., and Y. H. Shao. 2022. Generalization-Memorization Machines. arXiv preprint arXiv:2207.03976.
Google Scholar
Xiong, Y., and R. Zuo. 2020. Recognizing multivariate geochemical anomalies for mineral exploration by combining deep learning and one-class support vector machine. Computers & Geosciences 140:104484. doi:10.1016/j.cageo.2020.104484.
Web of Science ®Google Scholar
Zhang, S., Q. Fu, and W. Xiao. 2017. Advertisement click-through rate prediction based on the weighted-ELM and adaboost algorithm. Scientific Programming 2017:2938369. doi:10.1155/2017/2938369.
Web of Science ®Google Scholar
Zheng, X. 2020. SMOTE variants for imbalanced binary classification: Heart disease prediction. Ann Arbor, Michigan, USA: Los Angeles ProQuest Dissertations Publishing. University of California.
Google Scholar

Handling Imbalanced Classification Problems by Weighted Generalization Memorization Machine

ABSTRACT

Introduction

Related Works

Imbalanced Classification

Generalization-Memorization Machine

Weighted Generalization Memorization Machine (WGMM)

Model Formation

New Memory Influence Function

Experiments

Datasets

Table 1. Description of 31 datasets.

Performance Metrics

Reference Methods

Experiments Results and Discussion

Table 2. GM comparison of linear experiments.

Table 3. GM comparison of nonlinear experiments.

Conclusions

glass.eps

xiang.eps

new41.eps

glass0.eps

yeast6.eps

new31.eps

ecoli3.eps

poker-8_vs_6.eps

abalone9-18.eps

vehicle2.eps

xiangfei.eps

glasse.eps

poker-8-9_vs_6.eps

ecoli1.eps

bestno.eps

winequality-red-4.eps

best.eps

Acknowledgements

Disclosure Statement

Data Availability Statement

Supplementary Material

Additional information

Funding

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date