Full article: A novel bias-alleviated hybrid ensemble model based on over-sampling and post-processing for fair classification

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

With the rapid development of machine learning in the field of classification, the classification fairness has become the research emphasis second to prediction accuracy. However, the data bias and algorithmic discrimination that affect the fair classification of models have not been well resolved, which may damage or benefit the specific groups related to the sensitive attributes (e.g. age, race, and gender). To alleviate the unfairness of the classification model, this study proposes a novel bias-alleviated hybrid ensemble model (BAHEM) based on over-sampling and post-processing. First, a new clustering-based over-sampling method is proposed to reduce the data bias caused by the imbalance in label and sensitive attribute. Then, a stacking-based ensemble learning method is employed to obtain the higher performance and robustness of the BAHEM. Finally, a new classification with alternating normalisation (CAN)-based post-processing method is proposed to further improve the fairness and maintain the accuracy of the BAHEM. Three datasets with different sensitive attributes and four evaluation metrics were used to evaluate the prediction accuracy and fairness of the BAHEM. The experimental results verify the superior fairness of the BAHEM with little accuracy reduction.

KEYWORDS:

1. Introduction

With the extensive application of machine learning models for solving various classification problems, including credit scoring (Dastile et al., Citation2020), crime prediction (Kim et al., Citation2018), software performance prediction (Liu et al., Citation2021), and loan applications (Wang et al., Citation2020), social fairness has received increasing attention. Fair classification of models can be affected by data bias and algorithmic discrimination. The raw datasets that are used to train the machine learning models may contain human biases that are intended or unintended, such as the biases to gender, race, or age. The models trained using the biased datasets will learn and extract the data bias and discriminate against certain groups. Therefore, a fair machine learning model is urgently required.

The definition of the fair classification problem was firstly proposed by Kamiran and Calders (Citation2009), and it is mainly resulted from the unbalanced training of some sensitive attributes in the machine learning models due to the imbalanced distribution of training data or the data bias. For example, if gender is taken as a sensitive attribute and the male-female ratio in a data set is 1:100, then the samples containing female will dominate the training stage of the model, thus making the model's prediction of the male samples inaccurate. In the field of credit scoring, traditional machine learning methods only consider the distribution of labels and tend to make full use of all available information to improve the prediction accuracy as much as possible. If the imbalanced distribution of sensitive attributes does exist, the classification result of the model may be biased towards the majority class, leading to the unfairness of the model. Therefore, a model that can deal with the trade-off between accuracy and fairness is needed, and its validity should be verified by experiments on the real datasets. The aim of research on the fair classification problem is to improve the fairness of the models with little reduction in prediction accuracy by alleviating data bias and algorithmic discrimination.

On the one hand, the datasets in the real world are commonly imbalanced, both in labels and sensitive attributes. Imbalanced labels may reduce the prediction accuracy of machine learning models, and imbalanced sensitive attributes may cause a discriminative impact on unprivileged groups, affecting the classification fairness of models. Researchers have proposed several methods to balance the datasets, including synthetic minority over-sampling technique (Chawla et al., Citation2002), adaptive synthetic sampling (ADASYN; He et al., Citation2008), and balance cascade (Liu et al., Citation2008). However, existing sampling methods simply consider the imbalance in labels while ignoring the imbalance in sensitive attributes. Therefore, a sampling method that considers both imbalance in the labels and the sensitive attributes is required.

On the other hand, to improve the performance and robustness of the machine learning models, ensemble learning methods are widely adopted, including extreme gradient boosting (XGBoost; Chen & Guestrin, Citation2016), light gradient boosting machine (LightGBM; Ke et al., Citation2017), and gradient boosting decision tree (GBDT; Friedman, Citation2001). Although deep learning methods have been widely used to solve a variety of classification problems with excellent performance, such as gesture recognition (Qi et al., Citation2021) and sarcasm identification (Onan & Toçoğlu, Citation2021), it is difficult to analyse the fairness of deep learning models due to their nature of black box. Therefore, this study analyses the fairness of machine learning models, and uses ensemble learning technology to enhance the fairness of models and maintain their accuracy. In addition, classification with alternating normalisation (CAN) was adopted to readjust the predicted results so as to further improve the classification performance of the machine learning models (Jia et al., Citation2021). However, these methods only focus on the performance of machine learning models while ignoring the classification fairness of models.

Therefore, the motivation of this study is to provide an ensemble model to alleviate the data bias and algorithmic discrimination for fair classification. The main contributions of this study are listed as follows:

A novel bias-alleviated hybrid ensemble model (BAHEM) based on over-sampling and post-processing is proposed in this study to enhance the classification fairness and maintain the accuracy of ensemble models.
A new clustering-based over-sampling method is proposed to balance the label and sensitive attribute automatically by generating new samples according to the data distributions. The clustering method can improve the sampling efficiency by separating the dataset into several subsets.
A stacking-based ensemble learning method is employed to adaptively select and integrate competent base classifiers with higher average rankings of accuracy and fairness, which are output from the first layer of stacking. Hence, the performance and robustness of the proposed BAHEM are improved.
A new CAN-based post-processing method is proposed to further improve the fairness and maintain the accuracy of BAHEM by modifying the prediction results with higher uncertainty and corresponding to the majority of the sensitive attribute.
Three datasets with different sensitive attributes and four evaluation metrics (two traditional performance metrics and two fairness metrics) are adopted to evaluate the classification performance and fairness of the BAHEM.

The remainder of this study is organised as follows. In Section 2, related work on data sampling methods, ensemble learning methods, and fair classification methods are reviewed. In Section 3, details of the proposed BAHEM are presented. The experimental settings, including the datasets, evaluation metrics, and parameter settings of the models, are presented in Section 4. In Section 5, the experimental investigation and comparison of the performances of the BAHEM and other benchmark models are described. The conclusions and suggestions for future work are presented in Section 6.

2. Related work

The proposed BAHEM in this study primarily involves three aspects: data sampling methods, ensemble learning methods, and fair classification methods. The literature on these three aspects is reviewed in this section.

2.1. Data sampling methods

The problem of imbalanced datasets is one of the greatest challenges in training machine learning models, because the models trained by the imbalanced datasets may be biased and favour the majority class (Thabtah et al., Citation2020). To reduce the negative influence of imbalanced datasets, two different data sampling methods are mainly used in the current research: one is under-sampling method and the other is over-sampling method. The under-sampling methods balance the datasets to reduce the size of the majority class by removing some of the majority samples. For example, Onan (Citation2019) proposed a consensus clustering based under-sampling approach which combined five different clustering algorithms to balance the dataset. Devi et al. (Citation2019) analysed the effects of data imbalance in machine learning models and proposed a Tomek-link under-sampling algorithm to solve the data imbalance. Guzmán-Ponce et al. (Citation2021) proposed a two-stage under-sampling method that combines a clustering method for filtering the majority samples and a graph-based procedure for determining the appropriate imbalance ratio (IR) for each subset. Jiang et al. (Citation2022) proposed a boosting random forest with static under-sampling and ensemble methods to reduce the overlap between classes. However, the under-sampling methods may lose some potential informative data while removing the majority samples.

In contrast to the under-sampling methods that balance the datasets by removing the majority samples, the over-sampling methods generate minority samples to balance the datasets. For instance, Tao et al. (Citation2019) proposed a real-value negative selection over-sampling method that can generate minority samples without reusing minority samples from the original dataset and avoid generating noise samples. Puntumapon et al. (Citation2016) proposed a clustering-based over-sampling method to reduce model overfitting and to improve the generalisation of minority samples that are generated. However, over-sampling methods in these researches mainly consider how to improve the prediction accuracy of the model, while ignoring the classification fairness of the model.

Therefore, in this study, a new clustering-based over-sampling method is proposed that can automatically balance the label and sensitive attribute in the datasets according to their IR, thereby improving the adaptability to imbalanced datasets and the classification fairness of the model.

2.2. Ensemble learning methods

Ensemble learning methods, one of the most effective ways to improve the performance of base classifiers, have been widely adopted and extended by researchers to solve various classification problems, such as text classification (Onan, Citation2018) and sentiment classification (Onan et al., Citation2017). Ensemble learning methods mainly include bootstrap aggregation (bagging; Breiman, Citation1996), boosting (Freund & Schapire, Citation1996), and stacking methods (Wolpert, Citation1992). Onan et al. (Citation2016) integrated statistical keyword extraction methods by bagging and boosting and verified the effectiveness of the ensemble methods in the field of text classification. Among these methods, stacking methods have been proved to be an efficient and flexible ensemble learning method, which integrates the prediction results of base classifiers to obtain the final prediction results with increasing prediction accuracy. As an example, Xu et al. (Citation2020) combined k-means clustering and ensemble learning to forecast the price of stock market. Potha et al. (Citation2021) proposed a sophisticated extrinsic random-based ensemble method to detect the malware and demonstrated the effectiveness of ensemble learning methods. Xia et al. (Citation2021) proposed a weighted stacking ensemble with sparsity regularisation, which adjusts the weights of the base classifiers according to the label correlations in multi-label classification problems. Li and Li (Citation2022) improved adaptive boosting (AdaBoost) algorithm with weight adjustment factors to handle the imbalanced data classification with minority samples. In our previous study, Zhang et al. (Citation2021) proposed a stacking ensemble method that combined the outlier detection and sampling methods to boost the prediction accuracy and generalisation ability of model.

However, existing stacking ensemble methods select and integrate the base classifiers with higher prediction accuracy while ignoring the classification fairness of the base classifiers, which may cause the obtained ensemble model to be biased toward certain groups. Therefore, in this study, a stacking-based ensemble learning method is employed to select competent base classifiers with higher accuracy and fairness to improve the classification fairness of the model.

2.3. Fair classification methods

The methods used to improve the classification fairness of the model can be separated into three categories: the pre-processing methods, the in-processing methods, and the post-processing methods.

Pre-processing methods aim to reduce data bias and ensure the classification fairness of the model by transforming the distribution of the datasets (d’Alessandro et al., Citation2017). Among the existing pre-processing methods, the most popular ones are instance sampling (Iosifidis et al., Citation2019), transformation (Calmon et al., Citation2017), and label swapping (Kamiran & Calders, Citation2012). For instance, Petrović et al. (Citation2022) developed a sample re-weighting method to reduce data bias and improve classification fairness by learning sample weighting functions using adversarial training algorithms.

In-processing methods modify the state-of-the-art machine learning algorithms by changing the constraints to enhance the classification fairness of the model (Mehrabi et al., Citation2021). For instance, Iosifidis and Ntoutsi (Citation2019) extended AdaBoost to a fairness-aware classifier that considers the classification fairness of each classifier while updating the sample weights. Zafar et al. (Citation2017) proposed a notion of fairness as the constraint in the objective function of machine learning models.

Post-processing methods modify prediction results to enhance the classification fairness of the model (Iosifidis et al., Citation2019). As an example, Fish et al. (Citation2016) designed a boosting classifier that improves classification fairness by modifying the decision boundary of the classifiers to protect unprivileged groups. Lohia et al. (Citation2019) proposed a fairness post-processing method that ranks samples using a bias reduction algorithm to enhance both group fairness and individual fairness.

However, the fair classification methods in these three categories improve the classification fairness of the model by sacrificing much prediction accuracy, which may cause unexpected losses when using the model for decision making. Therefore, inspired by the pre-processing methods and post-processing methods, a novel BAHEM based on over-sampling and post-processing is proposed in this study, which includes a new clustering-based over-sampling method as the pre-processing method and a new CAN-based post-processing method to alleviate data bias and improve the classification fairness of the model with little reduction in prediction accuracy.

3. Model

In this study, a novel BAHEM based on over-sampling and post-processing is proposed to ensure the fairness and alleviate the data bias of the classification model. Figure shows the framework of the proposed BAHEM. Three methods (i.e. clustering-based over-sampling method, stacking-based ensemble learning method and CAN-based post-processing method) are proposed as the three stages of the BAHEM. First, the original dataset is separated into two parts: training data and testing data. Then, the training set and validation set are obtained by separating the training data. Second, the training set is clustered into subsets and further over-sampled to obtain a balanced training set. Further, the base classifiers trained by the balanced training set are selected according to the accuracy and fairness of each classifier and integrated into the stacked ensemble model, which is used to obtain the prediction results by predicting the testing data. The prediction results are modified based on the uncertainty score of each result, and the modified prediction results are regarded as the final prediction results. The process details are presented in the following sub-sections.

Figure 1. Framework of the proposed BAHEM.

3.1. Clustering-based over-sampling method

Data imbalance is a common problem in real-world datasets, which exists not only in labels but also in sensitive attributes. Machine learning models trained using imbalanced data will be biased against some of the classes. Just like the imbalance of labels will affect the accuracy of the model, the imbalance of sensitive attributes will affect the fairness of the model. The clustering algorithm gathers the samples into different clusters, so that the samples in each cluster are as similar as possible, and the samples in different clusters are as different as possible. Therefore, the data imbalance within each class is reduced. Clustering algorithms have been widely used to solve the imbalance problem of datasets (Onan, Citation2019; Xu et al., Citation2020). In this study, a new clustering-based over-sampling method that considers the bias in both label and sensitive attribute, is proposed to balance them by comparing their respective IR.

As exhibited in Figure , the feature weight of the training set is firstly adjusted. For example, F₁ to F_n represent all features of the training set, and F_s represents the sensitive attribute in the training set. Then, F_s is duplicated as F_s′, which is appended to the training set, resulting in the adjusted training set. Apparently, the weight of the sensitive attribute F_s is increased, causing the subsequent clustering algorithm to pay more attention to the sensitive attribute in the adjusted training set. Subsequently, a clustering algorithm separates the adjusted training set into several subsets (e.g. subset 1 and subset 2) to improve the sampling efficiency. The IR of the label and the sensitive attribute in each subset are calculated and compared. The IR is the ratio of the number of the majority class to that of the minority class. For example, IR_L represents the IR of the label, and IR_SA represents the IR of the sensitive attribute. The proposed method can calculate both IR_L and IR_SA and compare them in each subset, and employ the popular over-sampling method (i.e. ADASYN) to balance them. If IR_SA is greater than or equal to IR_L (e.g. in subset 1), ADASYN is used to balance the sensitive attribute. If IR_SA is less than IR_L (e.g. in subset 2), ADASYN is used to balance the label. Finally, after all the subsets are over-sampled, they are integrated to produce a balanced training set.

Figure 2. Schematic diagram of clustering-based over-sampling method.

3.2. Stacking-based ensemble learning method

Base classifiers commonly have problems such as low accuracy and fairness, which can be alleviated through classifier integration. Therefore, a stacking-based ensemble learning method is employed in the proposed BAHEM to integrate multiple base classifiers with higher accuracy and fairness.

As shown in Figure , multiple base classifiers (Clf 1, Clf 2, … , Clf m) in the base classifier pool are trained using the balanced training set. Because accuracy and average odds difference (AOD; Bellamy et al., Citation2019) are the most commonly used indicators, they are evaluated in selecting the competent base classifiers. After the base classifiers are trained, the accuracy (ACC; Stehman, Citation1997) and AOD (Bellamy et al., Citation2019) of each base classifier in the validation set are calculated and sorted, respectively. The top k competent classifiers are then selected according to the average ranking of ACC and AOD. The k selected competent base classifiers (Sclf 1, Sclf 2, … , Sclf k) are further permuted and combined into several ensemble classifiers whose prediction results are used as new features to train the meta classifier. Because Xia et al. (Citation2018) proved the superior performance of logistic regression (LR) as a meta classifier in stacking methods, LR is adopted as the meta classifier. Finally, the stacked ensemble model is obtained.

Figure 3. Schematic diagram of stacking-based ensemble learning method.

3.3. CAN-based post-processing method

Traditional machine learning models are commonly trained using training data and they predict testing data to evaluate the classification performance of the models. Jia et al. (Citation2021) proved that classification performance can be improved by readjusting the prediction results of challenging samples, and they proposed the CAN method. In the post-processing method, how to select and modify the appropriate prediction results is the key to improve the accuracy and fairness of the model. Therefore, a new CAN-based post-processing method is proposed in this study. In contrast to the original CAN method, which only improves the accuracy of the model, the proposed method ensures the classification fairness of the model while maintaining accuracy.

To evaluate the fairness of the model intuitively, the data can be divided into unprivileged group and privileged group according to a sensitive attribute, and the difference between true positive rate and false negative rate in these two groups are calculated respectively, then their average value is taken as the fairness of the model. To improve the fairness of the model, it is necessary to make the difference between true positive rate and false negative rate in both unprivileged and privileged groups as small as possible. The CAN-based post-processing method proposed in this study can select and modify the prediction results with higher uncertainty to reduce the difference. Hence, the fairness of the model is improved.

As depicted in Figure , the stacked ensemble model is firstly used to obtain the original prediction results by predicting the testing data. Then, the CAN method, which defines entropy as the uncertainty score, is adopted to calculate the uncertainty score of each prediction result. After all the uncertainty scores of the prediction results are calculated, a threshold is given, and the prediction results with uncertainty scores higher than the threshold are selected. Because the prediction results obtained by the model with an imbalanced dataset will be biased towards the majority class, the datasets are usually balanced by reducing the majority class or increasing the minority class. Similarly, while analysing the fairness of the model, the prediction results will be biased towards the group that belongs to the majority of the sensitive attribute. To ensure that the modified prediction results improve the fairness of the proposed BAHEM, only the prediction results corresponding to the majority of the sensitive attribute are selected for modification. For example, it is assumed that the sensitive attribute and prediction results are both binary and “1” is the majority of the sensitive attribute, then, if the threshold is set as 0.5, all the prediction results with uncertainty scores higher than or equal to 0.5, are selected. Among these selected prediction results, if the corresponding sensitive attribute is “1”, then the prediction result is modified to the opposite results (i.e. “1” to “0”, and “0” to “1”). Finally, the modified prediction results are taken as the final prediction results.

Figure 4. Schematic diagram of CAN-based post-processing method.

4. Experimental settings

4.1. Datasets description

In this study, three standard datasets, namely, Adult (Kohavi, Citation1996), Bank (Moro et al., Citation2014), and German (Asuncion & Newman, Citation2007), from the UC Irvine (UCI) machine learning repository are used to estimate the classification performance and fairness of the proposed BAHEM. Table lists the details of the datasets used in this study, including the sample size, numbers of positive and negative samples, number of total features, and sensitive attributes. In addition, the code of this study is available at Github.Footnote¹

Table 1. Detailed information of the datasets.

Download CSV Display Table

4.2. Evaluation metrics

Four evaluation metrics, namely, ACC, AOD, balanced accuracy (BA; Brodersen et al., Citation2010), and equal opportunity difference (EOD; Hardt et al., Citation2016) were adopted in this study in order to evaluate the classification performance and fairness of the proposed BAHEM.

ACC is the basic evaluation metric used to indicate the overall performance of a model. The formula for ACC is given in Equation (1). Here, TP, TN, FP, and FN represent true positive, true negative, false positive and false negative, respectively. The higher ACC represents the higher prediction accuracy of the model. (1) $ACC = \frac{TP+TN}{TP+TN+FP+FN}$ (1)

BA is an evaluation metric that is commonly adopted to evaluate the performance of a model trained by an imbalanced dataset. BA is calculated using the true positive rate (TPR) and true negative rate (TNR). The formula for BA, TPR, and TNR are given in Equation (2), Equation (3), and Equation (4), respectively. A higher BA represents the higher prediction accuracy of the model. (2) $\begin{aligned} BA & = \frac{TPR + TNR}{2} \end{aligned}$ (2) (3) $\begin{aligned} TPR & = \frac{TP}{TP + FN} \end{aligned}$ (3) (4) $\begin{aligned} TNR & = \frac{TN}{FP + TN} \end{aligned}$ (4)

AOD and EOD are used to evaluate the classification fairness of the model, and AOD is calculated using Equation (5). Here, SA = 1 represents the sensitive attribute belonging to the unprivileged group, and SA = 0 represents the sensitive attribute belonging to the privileged group. The formula for FPR is given in Equation (6). TPR_SA = 1 and FPR_SA = 1 represent TPR and FPR in the unprivileged group, respectively, and TPR_SA = 0 and FPR_SA = 0 represent TPR and FPR in the privileged group respectively. The formula for EOD is presented in Equation (7). For a comprehensive comparison, the absolute value of EOD, i.e. |EOD|, is used in the following experimental comparison. A lower AOD or |EOD| represents the higher classification fairness of the model. (5) $\begin{aligned} AOD & = \frac{1}{2} [| TP R_{SA = 1} - TP R_{SA = 0} | + | FP R_{SA = 1} - FP R_{SA = 0} |] \end{aligned}$ (5) (6) $\begin{aligned} FPR & = \frac{FP}{FP + TN} \end{aligned}$ (6) (7) $\begin{aligned} EOD & = TP R_{SA = 1} - TP R_{SA = 0} \end{aligned}$ (7)

4.3. Parameter settings

The original dataset was randomly separated as follows: 20% of the original dataset was used as the testing data, and 80% of the original dataset was used as the training data. Among the training data, 80% was used as the training set and the rest was used as the validation set. In the clustering-based over-sampling method, k-means was used as the clustering method and the number of clustering centres was set as 2; ADASYN was used as the over-sampling method. The k-means and ADASYN were executed by the Python modules “sklearn” and “imblearn”, respectively. In the stacking-based ensemble learning method, XGBoost, GBDT, AdaBoost, random forest (RF), support vector machine (SVM), LR, and LightGBM were used as the base classifiers. XGBoost and LightGBM were executed by the Python module “xgboost” and “lightgbm,” respectively. GBDT, AdaBoost, RF, SVM, and LR were executed by the Python module “sklearn.”

In the CAN-based post-processing method, inspired by Agrawal et al. (Citation2019), Kirar and Agrawal (Citation2019) and Kirar et al. (Citation2022), the ACC and AOD for each dataset using different thresholds of uncertainty scores are shown in Figure . Considering that the threshold is generally set as 0.5 by default in classification problems, this experiment compares the threshold values around the default value when selecting parameters, namely 0.4, 0.5 and 0.6. Although the ACC rises with the increase of the threshold for all datasets, the AOD for “Adult-race,” “Bank-age,” “German-age,” and “German-sex” performed well when thresholds were set as 0.6, 0.5, 0.5 and 0.6, respectively. Therefore, the threshold of uncertainty scores for “Adult-race,” “Bank-age,” “German-age,” and “German-sex” were set as 0.6, 0.5, 0.5 and 0.6, respectively, through trial-run experiments. For a fair comparison, all the parameters of the clustering methods, over-sampling methods, and base classifiers were set as default.

Figure 5. Performance comparison between different thresholds on each dataset.

5. Experimental analyzation

In this study, three datasets with different sensitive attributes and four evaluation metrics were adopted to evaluate the classification performance and fairness of the proposed BAHEM. Each evaluation metric was run and calculated 10 times to ensure the reliability of the experiment. The average results of each evaluation metric were used as the performance of the BAHEM. All the experiments were realised on Microsoft Windows 10 operating system, using Python Version 3.7.

5.1. Performance evaluation of baseline classifiers

To evaluate the classification performance and fairness of BAHEM, the baseline results of three datasets with different sensitive attributes were evaluated using four evaluation metrics. The performance of baseline classifiers is presented in Table . The German dataset with different sensitive attributes selected can be used as two datasets. To clarify the sensitive attributes of the datasets in the following experimental analysis, the datasets with different sensitive attributes are renamed as “Adult-race,” “Bank-age,” “German-age,” and “German-sex,” respectively.

Table 2. Performance evaluation of baseline classifiers.

Download CSV Display Table

5.2. Performance evaluation of clustering-based over-sampling method

To verify the effectiveness of the proposed clustering-based over-sampling method, the performances of base classifiers adopting the clustering-based over-sampling method are presented in Table . The results in bold mean that the corresponding base classifiers perform better after the clustering-based over-sampling method is employed. Although the ACC of most base classifiers decreased, the AOD, BA, and EOD improved, which demonstrates that the proposed clustering-based over-sampling method can effectively balance the label and sensitive attribute in the dataset to improve the prediction accuracy and fairness.

Table 3. Performance evaluation of clustering-based over-sampling method.

Download CSV Display Table

5.3. Performance evaluation of stacking-based ensemble learning method and CAN-based post-processing method

In the stacking-based ensemble learning method, the competent base classifiers (e.g. GBDT, AdaBoost, and LightGBM for the Adult dataset that uses race as the sensitive attribute) are selected from the base classifier pool for permutation and combination because of their outstanding ACC and AOD, as shown in Table .

Table 4. Competent base classifiers selected for different datasets.

Download CSV Display Table

To demonstrate the outperformance of the proposed BAHEM, the performance of the stacked ensemble model without adopting the CAN-based post-processing method is compared with that of the BAHEM adopting the CAN-based post-processing method. Table presents the performance comparison between the stacked ensemble model and the BAHEM. The results of BAHEM are highlighted in bold if the evaluation metric in the BAHEM is better than that of the stacked ensemble model. As shown in Table , although the partial ACC and BA of the BAHEM on some datasets (e.g. Bank-age and German-age) are slightly worse than those of the stacked ensemble model, BAHEM shows a significant improvement in the fairness metrics (i.e. AOD and EOD). The experimental results indicate the stacking-based ensemble learning method and CAN-based post-processing method can effectively improve the classification fairness of the model without sacrificing too much prediction accuracy.

Table 5. Performance comparison between the stacked ensemble model and BAHEM.

Download CSV Display Table

5.4. Performance comparison between base classifiers and the proposed BAHEM

To demonstrate that the proposed BAHEM can improve classification fairness with little reduction in prediction accuracy, the performance of the BAHEM is compared with seven base classifiers on four evaluation metrics. Histograms of the performance comparison between BAHEM and base classifiers are presented in Figure . As shown in Figure , although the ACC and BA of the BAHEM are slightly reduced compared with the base classifiers on each dataset, the fairness metrics (i.e., AOD and EOD) of BAHEM are improved, indicating that the proposed BAHEM can effectively improve fairness without sacrificing too much prediction accuracy.

Figure 6. Histograms of performance comparison between base classifiers and BAHEM.

5.5. Performance comparison between benchmark models and the proposed BAHEM

To further prove the higher classification performance and fairness of the proposed BAHEM, it is compared with the benchmark models proposed by Kearns et al. (Citation2019), Celis et al. (Citation2019), Pleiss et al. (Citation2017), Hardt et al. (Citation2016), Rezaei et al. (Citation2020), and Wang et al. (Citation2021). The source codes of related benchmark models are public. For a fair comparison, all experiments were conducted under the same experimental settings. The comparison results are shown in Table . Certain evaluation metrics were not adopted in the benchmark models; hence, the corresponding indicators are marked as “/” in this table. As the table shows, the ACC of the BAHEM are slightly reduced compared with most other benchmark models, but the BAHEM outperforms other models in AOD and EOD on most datasets, demonstrating that the proposed BAHEM can improve fairness without sacrificing too much prediction accuracy.

Table 6. Performance comparison between benchmark models and the proposed BAHEM.

Download CSV Display Table

6. Conclusion and future work

In this study, a novel BAHEM based on over-sampling and post-processing is proposed to alleviate data bias and improve classification fairness without sacrificing too much prediction accuracy. The proposed BAHEM mainly contains two main contributions. First, a new clustering-based over-sampling method is proposed, which generates the subsets with clustering methods and automatically balances the label and sensitive attribute to improve the adaptability to imbalanced datasets and the classification fairness of the model. Second, a new CAN-based post-processing method is proposed to select prediction results with higher uncertainty and modify them to further enhance the fairness of the BAHEM while maintaining the prediction accuracy. Three datasets (i.e. Adult, Bank, and German) with different sensitive attributes and four evaluation metrics (i.e. ACC, AOD, BA, and EOD) were used to evaluate the classification performance and fairness of the BAHEM. The experiment results show that classification performance and fairness of the BAHEM outperform those of other benchmark models.

However, the proposed BAHEM in this paper has some shortcomings in both method and practical application. In planned future work, the proposed BAHEM can be further improved by generating an ensemble model with base classifiers that are processed using in-processing methods. For sampling methods, the threshold can be adjusted adaptively according to different sample distributions. For post-processing methods, more comprehensive indicators in addition to uncertainty, can be considered in evaluating and modifying the prediction results to obtain higher classification performance and fairness of the model. For practical application, it is necessary to further enhance the interpretability of the model and the robustness of the results obtained from the model.

Compliance with ethical standards

Conflicts of interest: The authors declare that there is no conflict of interests regarding the publication of this article.

Ethical standard: The authors state that this research complies with ethical standards. This research does not involve either human participants or animals.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The datasets analysed during the current study are available in the UCI repository. German dataset is from https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german, Adult dataset is from https://archive.ics.uci.edu/ml/machine-learning-databases/adult, and Bank dataset is from https://archive.ics.uci.edu/ml/machine-learning-databases/00222.

Additional information

Funding

This work has been supported by Fundamental Research Funds for the Provincial Universities of Zhejiang Institute of Economics and Trade (No. 19YQ19), National Natural Science Foundation of China (No. 51875503), Zhejiang Natural Science Foundation of China (No. LZ20E050001), and Zhejiang Key R & D Project of China (No. 2022C03166).

Notes

1 Available at https://github.com/yhefang/BAHEM.

References

Agrawal, D. K., Kirar, B. S., & Pachori, R. B. (2019). Automated glaucoma detection using quasi-bivariate variational mode decomposition from fundus images. IET Image Processing, 13(13), 2401–2408. https://doi.org/10.1049/iet-ipr.2019.0036
Web of Science ®Google Scholar
Asuncion, A., & Newman, D. (2007). UCI machine learning repository. School of Information and Computer Science, University of California. http://www.ics.uci.edu/~mlearn/MLRepository.html
Google Scholar
Bellamy, R. K. E., Mojsilovic, A., Nagar, S., Ramamurthy, K. N., Richards, J., Saha, D., Sattigeri, P., Singh, M., Varshney, K. R., Zhang, Y., Dey, K., Hind, M., Hoffman, S. C., Houde, S., Kannan, K., Lohia, P., Martino, J., & Mehta, S. (2019). AI fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias. IBM Journal of Research and Development, 63(4/5), 4:1-4:15. https://doi.org/10.1147/JRD.2019.2942287
Web of Science ®Google Scholar
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. https://doi.org/10.1007/BF00058655
Web of Science ®Google Scholar
Brodersen, K. H., Ong, C. S., Stephan, K. E., & Buhmann, J. M. (2010). The balanced accuracy and its posterior distribution. In Proceedings of the 20th International Conference on Pattern Recognition, August 23–26, Istanbul, Turkey, pp. 3121–3124.
Google Scholar
Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K. N., & Varshney, K. R. (2017). Optimized pre-processing for discrimination prevention. Advances in Neural Information Processing Systems, 30, 3992–4001. https://proceedings.neurips.cc/paper/2017/hash/9a49a25d845a483fae4be7e341368e36-Abstract.html
Google Scholar
Celis, L. E., Huang, L., Keswani, V., & Vishnoi, N. K. (2019). Classification with fairness constraints: A meta-algorithm with provable guarantees. In Proceedings of 2019 ACM Conference on Fairness, Accountability, and Transparency, January 29–31, Atlanta, GA, USA, pp. 319–328.
Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Web of Science ®Google Scholar
Chen, T. Q., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 13–17, San Francisco, USA, pp. 785–794.
Google Scholar
d’Alessandro, B., O’Neil, C., & LaGatta, T. (2017). Conscientious classification: A data scientist’s guide to discrimination-aware classification. Big Data, 5(2), 120–134. https://doi.org/10.1089/big.2016.0048
PubMed Web of Science ®Google Scholar
Dastile, X., Celik, T., & Potsane, M. (2020). Statistical and machine learning models in credit scoring: A systematic literature survey. Applied Soft Computing, 91, 106263. https://doi.org/10.1016/j.asoc.2020.106263
Web of Science ®Google Scholar
Devi, D., Biswas, S. K., & Purkayastha, B. (2019). Learning in presence of class imbalance and class overlapping by using one-class SVM and undersampling technique. Connection Science, 31(2), 105–142. https://doi.org/10.1080/09540091.2018.1560394
Web of Science ®Google Scholar
Fish, B., Kun, J., & Lelkes, Á. D. (2016). A confidence-based approach for balancing fairness and accuracy. In Proceedings of 2016 SIAM International Conference on Data Mining, June 5–7, Miami, Florida, USA, pp. 144–152.
Google Scholar
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning, July 3–6, Bari, Italy, pp. 148–156.
Google Scholar
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.1214/aos/1013203451
Web of Science ®Google Scholar
Guzmán-Ponce, A., Sánchez, J. S., Valdovinos, R. M., & Marcial-Romero, J. R. (2021). DBIG-US: A two-stage under-sampling algorithm to face the class imbalance problem. Expert Systems with Applications, 168, 114301. https://doi.org/10.1016/j.eswa.2020.114301
Web of Science ®Google Scholar
Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems, 29, 3315–3323. https://proceedings.neurips.cc/paper/2016/hash/9d2682367c3935defcb1f9e247a97c0d-Abstract.html
Google Scholar
He, H. B., Bai, Y., Garcia, E. A., & Li, S. T. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), June 1–8, Hong Kong, China, pp. 1322–1328.
Google Scholar
Iosifidis, V., Fetahu, B., & Ntoutsi, E. (2019). FAE: A fairness-aware ensemble framework. In Proceedings of 2019 IEEE International Conference on Big Data, December 9–12, Los Angeles, CA, USA, pp. 1375–1380.
Google Scholar
Iosifidis, V., & Ntoutsi, E. (2019). Adafair: Cumulative fairness adaptive boosting. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, November 3–7, Beijing, China, pp. 781–790.
Google Scholar
Jia, M. L., Reiter, A., Lim, S. N., Artzi, Y., & Cardie, C. (2021). When in doubt: improving classification performance with alternating normalization. arXiv preprint arXiv:2109.13449.
Google Scholar
Jiang, M. X., Yang, Y. L., & Qiu, H. Q. (2022). Fuzzy entropy and fuzzy support-based boosting random forests for imbalanced data. Applied Intelligence, 52(4), 4126–4143. https://doi.org/10.1007/s10489-021-02620-y
Web of Science ®Google Scholar
Kamiran, F., & Calders, T. (2009). Classifying without discriminating. In Proceedings of 2009 2nd International Conference on Computer, Control and Communication, February 17–18, Karachi, Pakistan, pp. 1–6.
Google Scholar
Kamiran, F., & Calders, T. (2012). Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1), 1–33. https://doi.org/10.1007/s10115-011-0463-8
Web of Science ®Google Scholar
Ke, G. L., Meng, Q., Finley, T., Wang, T. F., Chen, W., Ma, W. D., Ye, Q. W., & Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of Annual 2017 Conference on Neural Information Processing Systems, December 4–9, California, USA, pp. 3146–3154.
Google Scholar
Kearns, M., Neel, S., Roth, A., & Wu, Z. S. (2019). An empirical study of rich subgroup fairness for machine learning. In Proceedings of 2019 ACM Conference on Fairness, Accountability, and Transparency, January 29–31, Atlanta, GA, USA, pp. 100–109.
Google Scholar
Kim, S., Joshi, P., Kalsi, P. S., & Taheri, P. (2018). Crime analysis through machine learning. In Proceedings of 2018 IEEE 9th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), November 1–3, University of British Columbia, Vancouver, Canada, pp. 415–420.
Google Scholar
Kirar, B. S., & Agrawal, D. K. (2019). Computer aided diagnosis of glaucoma using discrete and empirical wavelet transform from fundus images. IET Image Processing, 13(1), 73–82. https://doi.org/10.1049/iet-ipr.2018.5297
Web of Science ®Google Scholar
Kirar, B. S., Agrawal, D. K., & Kirar, S. (2022). Glaucoma detection using image channels and discrete wavelet transform. IETE Journal of Research, 68(6), 4421–4428. https://doi.org/10.1080/03772063.2020.1795934
Web of Science ®Google Scholar
Kohavi, R. (1996). Scaling up the accuracy of naive-Bayes classifiers: A decision-tree hybrid. In Proceedings of 1996 International Conference on Knowledge Discovery and Data Mining, August 2–4, Oregon, USA, pp. 202–207.
Google Scholar
Li, X., & Li, K. W. (2022). Imbalanced data classification based on improved EIWAPSO-AdaBoost-C ensemble algorithm. Applied Intelligence, 52(6), 6477–6502. https://doi.org/10.1007/s10489-021-02708-5
Web of Science ®Google Scholar
Liu, W. H., Hu, E. W., Su, B. G., & Wang, J. (2021). Using machine learning techniques for DSP software performance prediction at source code level. Connection Science, 33(1), 26–41. https://doi.org/10.1080/09540091.2020.1762542
Web of Science ®Google Scholar
Liu, X. Y., Wu, J., & Zhou, Z. H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(2), 539–550. https://doi.org/10.1109/TSMCB.2008.2007853
PubMed Web of Science ®Google Scholar
Lohia, P. K., Ramamurthy, K. N., Bhide, M., Saha, D., Varshney, K. R., & Puri, R. (2019). Bias mitigation post-processing for individual and group fairness. In Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, May 12–17, Brighton, UK, pp. 2847–2851.
Google Scholar
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6), 1–35. https://doi.org/10.1145/3457607
Web of Science ®Google Scholar
Moro, S., Cortez, P., & Rita, P. (2014). A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62, 22–31. https://doi.org/10.1016/j.dss.2014.03.001
Web of Science ®Google Scholar
Onan, A. (2018). An ensemble scheme based on language function analysis and feature engineering for text genre classification. Journal of Information Science, 44(1), 28–47. https://doi.org/10.1177/0165551516677911
Web of Science ®Google Scholar
Onan, A. (2019). Consensus clustering-based undersampling approach to imbalanced learning. Scientific Programming, https://doi.org/10.1155/2019/5901087
Web of Science ®Google Scholar
Onan, A., Korukoğlu, S., & Bulut, H. (2016). Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications, 57, 232–247. https://doi.org/10.1016/j.eswa.2016.03.045
Web of Science ®Google Scholar
Onan, A., Korukoğlu, S., & Bulut, H. (2017). A hybrid ensemble pruning approach based on consensus clustering and multi-objective evolutionary algorithm for sentiment classification. Information Processing & Management, 53(4), 814–833. https://doi.org/10.1016/j.ipm.2017.02.008
Web of Science ®Google Scholar
Onan, A., & Toçoğlu, M. A. (2021). A term weighted neural language model and stacked bidirectional LSTM based framework for sarcasm identification. IEEE Access, 9, 7701–7722. https://doi.org/10.1109/ACCESS.2021.3049734
Web of Science ®Google Scholar
Petrović, A., Nikolić, M., Radovanović, S., Delibašić, B., & Jovanović, M. (2022). FAIR: Fair adversarial instance re-weighting. Neurocomputing, 476, 14–37. https://doi.org/10.1016/j.neucom.2021.12.082
Web of Science ®Google Scholar
Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., & Weinberger, K. Q. (2017). On fairness and calibration. In Proceedings of 31st Conference on Neural Information Processing Systems, December 4–9, Long Beach, CA, USA, pp. 5684–5693.
Google Scholar
Potha, N., Kouliaridis, V., & Kambourakis, G. (2021). An extrinsic random-based ensemble approach for android malware detection. Connection Science, 33(4), 1077–1093. https://doi.org/10.1080/09540091.2020.1853056
Web of Science ®Google Scholar
Puntumapon, K., Rakthamamon, T., & Waiyamai, K. (2016). Cluster-based minority over-sampling for imbalanced datasets. IEICE Transactions on Information and Systems, 99(12), 3101–3109. https://doi.org/10.1587/transinf.2016EDP7130
Google Scholar
Qi, W., Ovur, S. E., Li, Z. J., Marzullo, A., & Song, R. (2021). Multi-sensor guided hand gesture recognition for a teleoperated robot using a recurrent neural network. IEEE Robotics and Automation Letters, 6(3), 6039–6045. https://doi.org/10.1109/LRA.2021.3089999
Web of Science ®Google Scholar
Rezaei, A., Fathony, R., Memarrast, O., & Ziebart, B. (2020). Fairness for robust log loss classification. In Proceedings of the 2020 AAAI Conference on Artificial Intelligence, February 7–12, New York, USA, pp. 5511–5518.
Google Scholar
Stehman, S. V. (1997). Selecting and interpreting measures of thematic classification accuracy. Remote Sensing of Environment, 62(1), 77–89. https://doi.org/10.1016/S0034-4257(97)00083-7
Web of Science ®Google Scholar
Tao, X. M., Li, Q., Ren, C., Guo, W. J., Li, C. X., He, Q., Liu, R., & Zou, J. R. (2019). Real-value negative selection over-sampling for imbalanced data set learning. Expert Systems with Applications, 129, 118–134. https://doi.org/10.1016/j.eswa.2019.04.011
Web of Science ®Google Scholar
Thabtah, F., Hammoud, S., Kamalov, F., & Gonsalves, A. (2020). Data imbalance in classification: Experimental evaluation. Information Sciences, 513, 429–441. https://doi.org/10.1016/j.ins.2019.11.004
Web of Science ®Google Scholar
Wang, J. L., Liu, Y., & Levy, C. (2021). Fair classification with group-dependent label noise. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, March 3–10, New York, USA, pp. 526–536.
Google Scholar
Wang, Y. L., Zhang, Y. H., Lu, Y., & Yu, X. R. (2020). A comparative assessment of credit risk model based on machine learning — a case study of bank loan data. Procedia Computer Science, 174, 141–149. https://doi.org/10.1016/j.procs.2020.06.069
Google Scholar
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259. https://doi.org/10.1016/S0893-6080(05)80023-1
Web of Science ®Google Scholar
Xia, Y. F., Liu, C. Z., Da, B., & Xie, F. M. (2018). A novel heterogeneous ensemble credit scoring model based on bstacking approach. Expert Systems with Applications, 93, 182–199. https://doi.org/10.1016/j.eswa.2017.10.022
Web of Science ®Google Scholar
Xia, Y. L., Chen, K., & Yang, Y. (2021). Multi-label classification with weighted classifier selection and stacked ensemble. Information Sciences, 557, 421–442. https://doi.org/10.1016/j.ins.2020.06.017
Web of Science ®Google Scholar
Xu, Y., Yang, C. J., Peng, S. L., & Nojima, Y. (2020). A hybrid two-stage financial stock forecasting algorithm based on clustering and ensemble learning. Applied Intelligence, 50(11), 3852–3867. https://doi.org/10.1007/s10489-020-01766-5
Web of Science ®Google Scholar
Zafar, M. B., Valera, I., Gomez Rodriguez, M., & Gummadi, K. P. (2017). Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web Companion, April 3–7, Perth, Australia, pp. 1171–1180.
Google Scholar
Zhang, W. Y., Yang, D. Q., & Zhang, S. (2021). A new hybrid ensemble model with voting-based outlier detection and balanced sampling for credit scoring. Expert Systems with Applications, 174, 114744. https://doi.org/10.1016/j.eswa.2021.114744
Web of Science ®Google Scholar

A novel bias-alleviated hybrid ensemble model based on over-sampling and post-processing for fair classification

Abstract

1. Introduction