936
Views
25
CrossRef citations to date
0
Altmetric
Articles

On the Influence of Class Noise in Medical Data Classification: Treatment Using Noise Filtering Methods

, &

ABSTRACT

Classification systems play an important role in medical decision support, because they allow automatizing and accelerating the data analysis process. However, their quality is based on that of the training dataset upon which the classification models are built. The labeling process of each training example is usually performed by domain experts or automatic systems. When a wrong assignment of class labels to examples is performed, the training process and, therefore, the classification performance, might be negatively affected. This problem is formally known as class label noise. One of the most used techniques to reduce the harmful consequences of mislabeled objects is noise filtering, which removes noisy examples from the training data. This article analyzes the usefulness of such methods in the context of medical data classification. The experiments carried out on several real-world datasets show the importance of noise filtering when class noise affects the data.

Introduction

Medical data analysis and decision making are acknowledged as important yet difficult tasks and typically are based on years of experience of an expert. In this framework, machine learning techniques have received increased attention from the medical community in the last decade (Kononenko Citation2001; Pombo, Araújo, and Viana Citation2014). They allow processing of massive and complex data, providing a fast and effective automatic diagnosis (Azar and Hassanien Citation2015; Krawczyk and Schaefer Citation2014). Even though the final decision always lies in hands of physicians, such methods facilitate their work by providing a second opinion, which is especially valuable for physicians who are less experienced or who are experiencing decreased awareness due to fatigue or stress.

According to the specialized literature (Kononenko Citation2001), three main factors must be considered when designing a new medical decision support system:

  • Learning dataset. The first step consists of collecting a set of representative examples that will be the basis for the learning process. These objects must be described by using a set of attributes of high discriminative power and a class label for each example, which is usually provided by a domain expert or an automatic procedure.

  • Classification method. The second step uses a classification technique to extract the properties of the learning dataset by constructing a model. This model (called a classifier) must be able to generalize the knowledge embedded in the data to new examples.

  • Ease of use. The medical decision support system will be used by physicians who are not necessarily machine learning experts. Therefore, it should ideally be designed without needing the configuration of parameters to each specific problem, and it should be sufficiently interpretable to shed additional light on a patient’s condition.

Even though these three components are important when designing a medical decision system, the first one has a particular relevance because it might somehow influence the quality of the subsequent stages. For this reason, this article focuses on the first step concerning the properties of the training dataset.

As previously commented, a domain expert or an automatic procedure is usually responsible for labeling each training example in the medical dataset. However, this process is commonly subject to errors (Malossini, Blanzieri, and Ng Citation2006; Wu and Zhu Citation2008). Errors affecting class labels are formally known in the literature as class label noise or simply class noise (Frénay and Verleysen Citation2014). Class noise could negatively affect the learning process and, therefore, the building time and classification accuracy of the created models (Sáez, Luengo, and Herrera Citation2013). For these reasons, two approaches have been proposed to deal with these inconveniences (Zhu and Wu Citation2004):

  1. Algorithm-level approaches (Cohen Citation1995; Quinlan Citation1993). These methods adapt existing algorithms to properly handle the noise or be less influenced by its presence.

  2. Data-level approaches (Brodley and Friedl Citation1999; Gamberger et al. Citation1999). These methods preprocess the data, trying to reduce the impact of noise before building classifiers.

Algorithm-level approaches depend on the concrete adaptation of each classification technique, and therefore, they are not directly extensible to other learning algorithms. Otherwise, data-level approaches are independent of the classifier used, which usually makes this type of technique the most popular choice.

Among data-level approaches, noise filters, which remove noisy examples from the training data, are commonly employed due to their benefits with respect to classification accuracy (Brodley and Friedl Citation1999; Gamberger, Lavrac, and Dzeroski Citation1996). However, these techniques do not always provide an improvement in performance because, in many cases, their success depends on the characteristics of the data (Sáez, Luengo, and Herrera Citation2013).

This article examines the usefulness of several of such noise filtering methods when working with the particular properties of several real-world medical datasets, which are affected by varied quantities of class noise. The elimination of potentially noisy examples helps to build more robust classifiers over the data, which generally leads to a better classification performance because the modeling of the knowledge represented by these difficult examples is avoided (Brodley and Friedl Citation1999). The set of noisy examples detected could be optionally presented to an expert in order to determine their nature, i.e., if these examples really contain errors or they are exceptions to general classification rules that still contain valuable information. The filtered datasets will be used to create classifiers with three learning methods, each of which has a different behavior against noise: (1) the C4.5 robust algorithm (Quinlan Citation1993), (2) a Support Vector Machine (SVM; Vapnik Citation1998), considered accurate but being noise sensitive in some cases, and (3) the nearest neighbor (NN) rule (Mclachlan Citation2004), which is considered very noise sensitive. We will analyze the performance of these algorithms with and without noise filtering based on a series of experiments that allow us to draw conclusions with regard to the danger of class noise in clinical decision support systems.

The rest of this article is organized as follows. “Medical Data Classification in Presence of Class Label Noise” introduces medical classification with noisy data. “Noise Filtering Methods” details the use of noise filters, paying attention to those considered in this work. In “Experimental Framework,” we describe the experimental framework, and in “Analysis of Results,” we analyze the results obtained. In the final section, we enumerate some concluding remarks.

Medical data classification in presence of class label noise

Noise may be present as errors in the source and input of data affecting the quality of the dataset (Wang, Storey, and Firth Citation1995). In classification, two types of noise are traditionally distinguished in the literature (Zhu and Wu Citation2004): attribute noise and class noise.

Attribute noise affects the attribute’s values of examples in a dataset, whereas class noise is produced when the examples are labeled with the wrong classes. Class noise is the most disruptive type of noise to classifier performance because mislabeled examples have a high impact when building classifiers (Sáez et al. Citation2013, Citation2014; Zhu and Wu Citation2004).

In medical data classification, class noise can proceed from several sources (Hickey Citation1996):

  • Human errors. Mistakes during the labeling process, which are more prone to occur when work is with complex data, might occur due to weariness, routine, quick examination of each case, or time pressure. Subjectivity can also produce class noise; for example, when a variability in the labeling by several experts exists (Malossini, Blanzieri, and Ng Citation2006).

  • Machine errors. When a machine is responsible for providing automatic labels, the occurrence of design faults or momentary errors could lead to the presence of erroneous labels.

  • Digitalization and archiving errors. When creating a digital record of the examined cases, one might incorrectly input a class by simple mistake. The same situation occurs when using historical recordings.

Class noise can affect the system performance not only producing a decrement of classification accuracy (Sáez et al. Citation2014), but also affecting the complexity of the classifier built in terms of its size and interpretability (Brodley and Friedl Citation1999). On this account, many works in the literature focus on its treatment (Brodley and Friedl Citation1999; Sánchez et al. Citation2003; Wilson Citation1972). Two main alternatives have been proposed to deal with noisy data:

  • Algorithm-level approaches. These are techniques characterized by being less influenced by noisy data. For example, C4.5 (Quinlan Citation1993) uses pruning strategies to reduce the chances that the trees are overfitting due to noise (Quinlan Citation1986).

  • Data-level approaches. The most well-known type of methods within this group is that of noise filters (Brodley and Friedl Citation1999; Khoshgoftaar and Rebours Citation2007). They identify noisy examples, which can be eliminated from the training data. shows an example of the benefits of learning classifiers in the presence of class noise when filtering techniques are applied.

Figure 1. Dataset with two examples with class noise. shows the exemplary decision boundary created when considering the noisy samples as proper ones. shows a decision boundary after the removal of the noisy examples during the filtering. The decision boundary obtained after the filtering is less complex and leads to a better generalization of these two classes.

Figure 1. Dataset with two examples with class noise. Figure 1a shows the exemplary decision boundary created when considering the noisy samples as proper ones. Figure 1b shows a decision boundary after the removal of the noisy examples during the filtering. The decision boundary obtained after the filtering is less complex and leads to a better generalization of these two classes.

Because mislabeled examples can negatively affect the final accuracy of medical decision support systems, there is a high risk in real-world applications because we are often dealing with human health and life. For this reason, the possibility of applying noise filters to the data in order to improve their quality should be always carefully studied. The next section briefly reviews previous works on noise filtering, paying special attention to those filters considered in the experimentation phase of this study.

Noise filtering methods

Noise filters, which remove noisy examples from the dataset, are one of the most used approaches when training data are affected by class noise (Brodley and Friedl Citation1999). The filtered data can then be used as an input to different classifiers (Khoshgoftaar and Rebours Citation2007)—hence, the computation time needed to prepare the data is required only once. The elimination of mislabeled examples provides some advantages in classifier performance (Gamberger et al. Citation1999), in contrast to the removal of instances with attribute noise (Zhu and Wu Citation2004). This is because examples with errors in some attribute’s values still contain valuable information in other attributes, which can help to build the classifier.

These methods are usually employed with classification methods that are sensitive to noisy data and require data preprocessing to address the problem. The separation of noise detection and learning phase has the advantage of avoiding the usage of noisy instances in the classifier building process (Gamberger, Lavrac, and Dzeroski Citation2000).

Even though several noise filtering schemes are proposed in the literature (Brodley and Friedl Citation1999; Gamberger et al. Citation1999; Sánchez et al. Citation2003; Wilson Citation1972), they are usually grouped into different categories according to their characteristics. For a complete review about noise filters, the reader may consult the work of Frénay and Verleysen (Citation2014).

Some filters are based on analyzing the neighborhood among examples (Sánchez et al. Citation2003; Wilson Citation1972), using to this end the k-NN classifier (Mclachlan Citation2004). There are noise filters that use the predictions from an ensemble of classifiers in order to get an improvement in noise detection against considering a specific classifier (Brodley and Friedl Citation1999). Other types of filtering methods propose an iterative removal of noisy examples (Khoshgoftaar and Rebours Citation2007) to increase the noise filtering accuracy.

The noise filters considered in this work apply different filtering procedures and are well-known representatives of the field. The parameter setup for all the noise filters is shown in . All of the parameters are the default ones recommended by the authors of such filters, although we have fixed n = 3 partitions for filters based on ensembles to unify all of them. For noise filters that compute distances between examples, the Heterogeneous Value Difference Metric (HVDM) (Wilson and Martinez Citation1997), which is valid for both nominal and numerical attributes, has been used. All these noise filters are briefly described as follows:

  1. Ensemble Filter (EF) (Brodley and Friedl Citation1999). This classifies the training data using an n-fold cross-validation with C4.5 (Quinlan Citation1993), NN (Mclachlan Citation2004), and LDA (le Cessie and van Houwelingen Citation1992). Finally, EF removes an example if it is misclassified by more than half of the classifiers.

  2. Edited Nearest Neighbor (ENN) (Wilson Citation1972). This algorithm removes those examples that class does not agree with of the majority of its k = 3 nearest neighbors.

  3. Iterative-Partitioning Filter (IPF) (Khoshgoftaar and Rebours Citation2007). IPF removes noisy examples in multiple iterations. In each one, the data is split into n folds and C4.5 is built over each of these subsets to evaluate all the examples. Then, the examples misclassified by the majority of the classifiers are removed and a new iteration is started.

  4. Multiedit (ME) (Devijver Citation1986). This splits the data into n folds. NN classifies the examples from the part x considering the part (x + 1) mod n as the training set and the misclassified examples are removed. This process is repeated until no examples are eliminated.

  5. Nearest Centroid Neighbor Edition (NCNE) (Sánchez et al. Citation2003). NCNE removes the examples misclassified by the k (k = 3) nearest centroid neighbors rule.

  6. Relative Neighborhood Graph (RNG) (Sánchez, Pla, and Ferri Citation1997). This builds a proximity undirected graph. Each vertex corresponds to an example and the set of edges is composed of those that relate two examples, satisfying a neighborhood relation. RNG discards those objects misclassified by the examples that have a neighborhood relation.

Table 1. Parameter setup for the noise filters.

Experimental framework

This section presents the details of the experimental framework designed to check how the noise filters described in “Noise Filtering Methods” can alleviate the negative impact of mislabeled examples on the final accuracy. The first subsection here describes the real-world medical datasets used. Then, the following subsection shows how noise is introduced into these datasets. Finally, the methodology followed to analyze the results is described, along with the parameter setup for the classification algorithms.

Datasets

The experimentation considers 12 real-world medical datasets describing different types of applications of clinical decision support systems (including the detection of different types of cancer, diseases in the urinary system, kidney, heart and Parkinson’s disease, among others). Most of them are taken from the UCI repository (Lichman Citation2013), with exception of the breast cancer (Krawczyk and Filipczuk Citation2014) and hypertension (Krawczyk and Woźniak Citation2015) datasets. These datasets are shown in , where #C refers to the number of classes, #E to the number of examples, and #A to the number of attributes (along with the number of real, integer, and nominal attributes). Examples containing missing values are removed from the datasets before their usage. The consideration of an experimental framework with datasets focused on different medical applications and having different characteristics in terms of the number of classes, examples, and attributes, will enable us to extract more general conclusions from the analysis of the results obtained in the experimentation.

Table 2. Medical datasets considered in the experimentation.

A brief description of each dataset is given in the following:

  1. Pathological breast tissues (breast tissue). This dataset distinguishes between different types of breast tissues (carcinoma, fibroadenoma, mastopathy, glandular, connective, and adipose) based on physical characteristics measured from the skin.

  2. Cytological slide examination (breast cancer). This describes the examination of breast cancer cytology slides taken with fine needle biopsy. Several features describing morphological properties of the nuclei of the cells are used to categorize three types of cancer.

  3. Seminal quality (fertility). This consists of medical information (childhood diseases, traumas, surgical interventions, frequency of alcohol consumption, etc.) collected to estimate the seminal quality (normal or altered) of different individuals.

  4. Hypertension detection (hypertension). This is a multiclass dataset that deals with the diagnosis of first-order hypertension and five types of secondary-order hypertensions. Features are extracted from a set of medical records and examinations.

  5. Inflammations of urinary bladder (inflammations). These data describe potential patients, who are characterized by medical variables (temperature, occurrence of nausea, lumbar pain, and so on). The goal is to predict a possible disease of the urinary system.

  6. Chronic kidney disease (kidney disease). This deals with the early detection of kidney failure and it was collected from nearly 2 months in a hospital.

  7. Mice protein expression (mice protein). This dataset contains expression levels of 77 proteins measured in the cerebral cortex of eight classes of control and Down syndrome mice exposed to context fear conditioning, a task used to assess associative learning.

  8. Voice disorder detection (Parkinson’s). This dataset is composed of several biomedical voice metrics. Each example corresponds to a voice recording from people with or without PD. The aim is to discriminate healthy people from unhealthy people.

  9. Postoperative choice (postoperative). This dataset tries to determine where patients in a postoperative recovery area should be sent to next. The attributes correspond to body temperature metrics because hypothermia is a significant concern after surgery.

  10. Diabetic retinopathy (retinopathy). This dataset contains attributes extracted from the Messidor image set to predict whether an image contains signs of diabetic retinopathy. The attributes represent a detected lesion, an anatomical part, or an image-level descriptor.

  11. Heart disease categorization (statlog). This dataset contains attributes describing the diagnosing of cardiac single proton emission computed tomography (SPECT) images. Each example corresponds to a summarized SPECT image (patient), which is classified into two categories: normal and abnormal.

  12. Thoracic surgery survivals (thoracic surgery). These data were collected retrospectively at the Wrocław Thoracic Surgery Centre for patients who underwent major lung resections for primary lung cancer in the years 2007–2011. The task is to predict if a patient will survive for 1 year from the date of the surgery.

Class label noise introduction process

The aforementioned 12 datasets might already contain class noise (Garcia, de Carvalho, and Lorena Citation2015). Thus, some of the objects were mislabeled by the expert and they will mislead the classifier learning. However, the concrete quantity of such mislabeled examples is a priori unknown, and therefore, we cannot properly extract conclusions based on it.

For this reason, it is necessary to control the amount of class noise in each dataset, introducing the noise in a supervised manner. The uniform class noise scheme (Teng Citation1999) is used to this end: the x% of the examples are selected and their class labels are changed, considering a random one from the set of classes. The noise levels x = 0%, 5%, 10%, 20%, 30% are considered. Thus, 48 new noisy datasets with class noise are created from the 12 base datasets (making a total of 60 datasets).

In order to create a noisy dataset from the original one, the noise is introduced into the training partitions as follows:

  1. A level of noise x% of class noise is introduced into a copy of the full original dataset.

  2. Both datasets, the original one and the noisy copy, are partitioned into five equivalent folds with the same examples in each one.

  3. The training partitions are built from the noisy copy, whereas the test partitions are formed from examples from the base dataset—the noise-free dataset.

Methodology of analysis

The 60 noisy datasets will be preprocessed with the six noise filters (“Noise Filtering Methods”) resulting in 360 new preprocessed datasets. The effect of these filters will be analyzed, comparing the performance obtained for each dataset with three different classifiers: C4.5 (Quinlan Citation1993), SVM (Vapnik Citation1998), and NN (Mclachlan Citation2004). The accuracy estimation of the classifiers in a dataset is obtained by means of a stratified 5-fold cross-validation, averaging the test accuracy results. The parameter setup for the three classification algorithms used is presented in .

Table 3. Parameter setup for the classification algorithms.

Statistical comparisons among the results of the 60 datasets considered will also be performed. The results of the Friedman Aligned test (García et al. Citation2010) and the Finner procedure (Finner Citation1993) will be computed to check the differences among the three classification algorithms when no noise filters are applied. The same tests will be used to compare all six noise filtering methods, considering each classification method independently, in order to find which are the best filters when working with noisy medical data.

Analysis of results

This section is devoted to the analysis of the results obtained. The first three subsections present the performance results obtained by C4.5, SVM, and NN, using each one of the noise filtering methods. We then show the statistical comparisons performed to find the best classifiers and noise filters for the medical datasets that we are working with. The final subsection presents a possible hypothesis for explaining the findings presented in previous sections, along with the lessons learned after carrying out the experimental study.

Results of C4.5

The experimental results obtained with C4.5, with and without filtering in respect to different noise levels, are given in . These are briefly summarized hereafter:

Table 4. Accuracies (%) obtained by C4.5 with respect to a given class noise level and noise filtering method.

1. C4.5 without filtering:

  • For datasets without additional noise, C4.5 obtains a satisfactory accuracy for most of the medical datasets (9 out of 12 datasets have a performance higher than 75%).

  • However, in the presence of noise, a quick deterioration of the classification accuracy is observed. For example, with only 5% of noise level, breast tissue has a 5.67% of accuracy drop and Parkinson’s a 4.62% drop.

  • For higher noise levels, C4.5 becomes highly inaccurate (mice protein has a 21.39% of accuracy drop and breast cancer a 15.41%).

2. C4.5 with filtering:

  • For some of the datasets without additional noise (for example, retinopathy and breast tissue using EF), using some noise filters still leads to small improvements in accuracy. This fact can be explained in that these datasets are already affected by some small degree of class noise on their own and can also add further emphasis on the usage of noise filtering in the process of medical pattern classification.

  • There are two cases without additional noise (mice protein and Parkinson’s) in which none of the filters improve the performance of C4.5 without preprocessing.

  • When considering the presence of noise in the data, the use of noise filters generally improves the results of not preprocessing (with the exception of the mice protein dataset, in which not preprocessing is the best alternative up to a 10% of noise). It is interesting to note the results in the fertility dataset, in which the performance is usually not altered when considering the introduction of noise. This is because this dataset has a low number of examples (100) and most of them (88%) belong to one of the classes. These aspects imply that the amount of noise introduced is low (because it depends on the total amount of examples) and it probably affects the most common class. Thus, the occurrence of these few, and possibly isolated, noisy examples can keep unaltered the learning of this class, and therefore, the classifier performance is usually slightly affected.

  • Comparing the performance of the different noise filters, we may observe that the best approaches are, in general, EF, NCNE, and IPF, because they obtain the best results in many of the datasets and noise levels. One should note that the differences among these filters are, in most cases, relatively small.

Results of SVM

The performance results of SVM, which are presented in , are outlined following:

Table 5. Accuracies (%) obtained by SVM with respect to a given class noise level and noise filtering method.

1. SVM without filtering:

  • For datasets without additional noise, SVM obtains good accuracy results for most of the datasets.

  • Considering a very low noise level (5%), the accuracy is almost maintained (for example, in hypertension the drop is 0.59%) or even slightly improved.

  • For the highest noise level (30%), SVM is also able to maintain better their initial accuracy than C4.5. Even though in some datasets we observe a high drop in performance (breast tissue with a 8.45%) or a low drop (retinopathy, 3.47%), these are usually lower than in the case of C4.5.

2. SVM with filtering:

  • Applying noise filters to some of the datasets without induced noise leads to small improvements in accuracy (for example, breast tissue gains a 1.9% by using EF).

  • For most of the datasets without noise, the version without filtering is slightly better than versions that consider the use of filters.

  • When considering noise in the data, noise filters improve the accuracy of not preprocessing in some noise levels, generally when they are very high. Despite this fact, the results of SVM in the presence of noise are still good in some datasets (mice protein, retinopathy, and statlog) at low and intermediate noise levels.

  • The filters obtaining the best results are EF, NCNE, and IPF. There are some datasets in which all the noise filters obtain a good and similar performance, higher than that of SVM without filtering (thoraric surgery and kidney disease). In other datasets (inflammations and fertility), using filtering or not almost does not vary the performance across all the noise levels.

Results of NN

shows the experimental results with respect to different noise levels of the NN classifier. They are summarized in the following:

Table 6. Accuracies (%) obtained by NN with respect to a given class noise level and noise filtering method.

1. NN without filtering:

  • NN obtains a satisfactory accuracy for most of the medical datasets without additional noise (8 out of 12 datasets have a performance higher than 75%).

  • The presence of noise makes the performance of NN deteriorate. Some datasets are already affected, considering only a 5% of noise level (breast cancer has a 4.45% accuracy drop and hypertension a 4.56%).

  • For higher noise levels, NN becomes more affected by mislabeled examples. For example, considering a 30% noise level, the breast cancer dataset has a 21.19% accuracy drop and the mice protein dataset has a 26.63%.

2. NN with filtering:

  • Without additional noise, using noise filters leads to improvements in accuracy in almost all the datasets (with exception of breast tissue, mice protein and Parkinson’s).

  • Considering the presence of noise in the data, the usage of filtering strategies improves the results of not preprocessing (with the exception of the Parkinson’s and breast tissue datasets up to 5% of noise). The improvements are highly remarkable in some cases, for example, in postoperative (31.17%) and mice protein (23.37%).

  • The best filtering approaches are EF, NCNE, and IPF, which obtain the best results in many of the datasets and noise levels.

Statistical comparative among noise filters

present the statistical comparison performed among classification methods (C4.5, SVM, and NN) and among noise filtering methods (considering each classification method independently), respectively. The ranks obtained by the Friedman Aligned procedure (Rank column), which represent the effectiveness associated with each algorithm, and the p-value related to the significance of the differences found by this test (pFA row) are shown. The pFin column shows the adjusted p-value computed by the Finner test.

Table 7. Statistical comparison among classification methods without filtering.

Table 8. Statistical comparison among noise filters and not preprocessing.

shows that the average rank obtained by SVM is the best. This is followed by C4.5 and NN. The p-value of the Friedman Aligned test shows that the differences found among these methods are significant. Furthermore, the p-values obtained with the Finner test also show that the differences are always significant (lower than 0.1), when comparing SVM with the other two classification methods.

focuses on the comparison among noise filters and not preprocessing, considering each classifier independently. The average ranks of Friedman Aligned show that the best methods are EF, IPF, and NCNE (even though their order can change depending on the classifier considered). The p-values pFA show statistical differences in all the cases. The p-values obtained with the Finner procedure, when comparing the noise filters, are usually very low in all the cases, except when these filters are compared with each other (and when EF is compared with not preprocessing, using SVM).

From the results of these tables, it is possible to conclude that SVM is the best classifier to work with the medical data considered in this study. It behaves well even without using filtering if the noise level is not high enough. The best noise filters are EF, IPF, and NCNE with independence of the classifier used later.

Noise filtering in medical data classification: Lessons learned

This section summarizes the main findings and lessons learned about the usage of noise filters in medical data classification, after carrying out the experimental study presented in the previous section.

1. Classifier choice and noisy medical data. Even though the most robust classifier used in this work was C4.5, SVM generally has obtained the best performance results (either with or without filtering). This fact clearly shows that it is necessary to test all the available classifiers with our medical data before choosing one of them, with independence of our initial thoughts about its robustness to noise. This is in accordance with the no free lunch theorem (Wolpert Citation2001), which states that there is no single universal classification method.

2. Medical data classification without using noise filters. Generally, all the classifiers tested (C4.5, SVM, and NN) have provided good accuracy when they are used with the medical datasets without additional noise. This fact shows the potential usefulness of machine learning for these types of problems. However, when noise is introduced into the data, their performance accuracy is usually decreased. Therefore, using noise filtering as a preprocessing step should be considered when designing medical pattern classification systems, because they may provide important advantages in some cases.

3. Noise filters and noisy medical data. The application of noise filters generally improves classifier performance when data are severely affected by noise. Filters make classifiers more robust to noise, alleviating the influence of noisy samples. The best filters found in our experiments were EF, IPF, and NCNE. Therefore, they are the recommended methods to be used in medical pattern classification systems.

However, noise filters do not benefit classifier performance with respect to not preprocessing for some of the datasets and noise levels (which are usually low). Higher noise levels in the data allow noise filters to better show their potential. Furthermore, one must note that noise filters do not always provide an improvement in performance results (Sáez, Luengo, and Herrera Citation2013), and this depends on several aspects, such as the characteristics of the data treated.

Concluding remarks

In this article, we have examined the usefulness of several noise filters when working with real-world medical datasets, which are affected by class noise. Class noise might be a result of human expert error or a result from some erroneous data gathering. Even though the occurrence of these labeling errors could be mitigated by a double-check of the recorded data or the consensus among several experts when labeling the examples, these solutions are usually time consuming and imply a higher cost. Therefore, labeling errors are generally difficult to prevent. Mislabeled examples can strongly influence the learning process, and therefore, their treatment is of high importance when building classifiers.

We have analyzed the performance of three classification algorithms (C4.5, SVM, and NN), with and without noise filtering, considering six different noise filters and 12 medical datasets with different noise levels.

The experimental results have shown that even small levels of class noise can significantly decrease the classification performance. Without filtering, SVM generally has obtained the best performance results. Noise filtering is especially crucial with the highest noise levels. The best filters found in our experiments are EF, IPF, and NCNE.

However, the application of the noise filters does not always improve the performance of not preprocessing. For this reason, it is always recommended to test all the available classifiers and noise filters, because their success usually depends on the concrete characteristics of the medical data treated.

Funding

José A. Sáez was supported by EC under FP7, Coordination and Support Action, Grant Agreement Number 316097, ENGINE European Research Centre of Network Intelligence for Innovation Enhancement (http://engine.pwr.wroc.pl).Bartosz Krawczyk and Michał Woźniak were supported by the Polish National Science Centre under the grant no. DEC-2013/09/B/ST6/02264.

Additional information

Funding

José A. Sáez was supported by EC under FP7, Coordination and Support Action, Grant Agreement Number 316097, ENGINE European Research Centre of Network Intelligence for Innovation Enhancement (http://engine.pwr.wroc.pl).Bartosz Krawczyk and Michał Woźniak were supported by the Polish National Science Centre under the grant no. DEC-2013/09/B/ST6/02264.

References

  • Azar, A. T., and A. E. Hassanien. 2015. Dimensionality reduction of medical big data using neural-fuzzy classifier. Soft Computing 19:1115–1127.
  • Brodley, C. E., and M. A. Friedl. 1999. Identifying mislabeled training data. Journal of Artificial Intelligence Research 11:131–167.
  • Cohen, W. W. 1995. Fast effective rule induction. In Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann.
  • Devijver, P. 1986. On the editing rate of the MULTIEDIT algorithm. Pattern Recognition Letters 4:9–12.
  • Finner, H. 1993. On a monotonicity problem in step-down multiple test procedures. Journal of the American Statistical Association 88:920–923.
  • Frénay, B., and M. Verleysen. 2014. Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems 25:845–869.
  • Gamberger, D., R. Boskovic, N. Lavrac, and C. Groselj. 1999. Experiments with noise filtering in a medical domain. In: Proceedings of the sixteenth international conference on machine learning. San Francisco, CA: Morgan Kaufmann.
  • Gamberger, D., N. Lavrac, and S. Dzeroski. 1996. Noise elimination in inductive concept learning: A case study in medical diagnosis. In Proceedings of the 7th international workshop on algorithmic learning theory. Berlin Heidelberg: Springer.
  • Gamberger, D., N. Lavrac, and S. Dzeroski. 2000. Noise detection and elimination in data preprocessing: Experiments in medical domains. Applied Artificial Intelligence 14:205–223.
  • Garcia, L. P. F., A. C. P. L. F. de Carvalho, and A. C. Lorena. 2015. Effect of label noise in the complexity of classification problems. Neurocomputing 160:108–119.
  • García, S., A. Fernández, J. Luengo, and F. Herrera. 2010. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences 180:2044–2064.
  • Hickey, R. J. 1996. Noise modelling and evaluating learning from examples. Artificial Intelligence 82:157–179.
  • Khoshgoftaar, T. M., and P. Rebours. 2007. Improving software quality prediction by noise filtering techniques. Journal of Computer Science and Technology 22:387–396.
  • Kononenko, I. 2001. Machine learning for medical diagnosis: History, state of the art and perspective. Artificial Intelligence in Medicine 23:89–109.
  • Krawczyk, B., and P. Filipczuk. 2014. Cytological image analysis with firefly nuclei detection and hybrid one-class classification decomposition. Engineering Applications of Artificial Intelligence 31:126–135.
  • Krawczyk, B., and G. Schaefer. 2014. A hybrid classifier committee for analysing asymmetry features in breast thermograms. Applied Soft Computing 20:112–118.
  • Krawczyk, B., and M. Woźniak. 2015. Hypertension type classification using hierarchical ensemble of one-class classifiers for imbalanced data. In ICT innovations 2014, ed. A. M. Bogdanova and D. Gjorgjevikj, 341–349, Advances in Intelligent Systems and Computing 311, Switzerland: Springer International.
  • le Cessie, S., and J. van Houwelingen. 1992. Ridge estimators in logistic regression. Applied Statistics 41:191–201.
  • Lichman, M. 2013. UCI machine learning repository. http://archive.ics.uci.edu/ml
  • Malossini, A., E. Blanzieri, and R. T. Ng. 2006. Detecting potential labeling errors in microarrays by data perturbation. Bioinformatics 22:2114–2121.
  • Mclachlan, G. J. 2004. Discriminant analysis and statistical pattern recognition, Wiley Series in Probability and Statistics. Wiley-Interscience.
  • Pombo, N., P. Araújo, and J. Viana. 2014. Knowledge discovery in clinical decision support systems for pain management: A systematic review. Artificial Intelligence in Medicine 60:1–11.
  • Quinlan, J. R. 1986. Induction of decision trees. Machine Learning 1:81–106.
  • Quinlan, J. R. 1993. C4.5: Programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann.
  • Sáez, J. A., M. Galar, J. Luengo, and F. Herrera. 2013. Tackling the problem of classification with noisy data using multiple classifier systems: Analysis of the performance and robustness. Information Sciences 247:1–20.
  • Sáez, J. A., M. Galar, J. Luengo, and F. Herrera. 2014. Analyzing the presence of noise in multi-class problems: Alleviating its influence with the one-vs-one decomposition. Knowledge and Information Systems 38:179–206.
  • Sáez, J. A., J. Luengo, and F. Herrera. 2013. Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognition 46:355–364.
  • Sánchez, J., R. Barandela, A. Márques, R. Alejo, and J. Badenas. 2003. Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters 24:1015–1022.
  • Sánchez, J., F. Pla, and F. Ferri. 1997. Prototype selection for the nearest neighbor rule through proximity graphs. Pattern Recognition Letters 18:507–513.
  • Teng, C. M. 1999. Correcting noisy data. In Proceedings of the sixteenth international conference on machine learning. San Francisco, CA, USA: Morgan Kaufmann.
  • Vapnik, V. 1998. Statistical learning theory. New York, NY, USA: Wiley.
  • Wang, R. Y., V. C. Storey, and C. P. Firth. 1995. A framework for analysis of data quality research. IEEE Transactions on Knowledge and Data Engineering 7:623–640.
  • Wilson, D. 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems and Man and Cybernetics 2:408–421.
  • Wilson, D. R., and T. R. Martinez. 1997. Improved heterogeneous distance functions. Journal of Artificial Intelligence Research 6:1–34.
  • Wolpert, D. 2001. The supervised learning no-free-lunch theorems. In Proceedings of the 6th online world conference on soft computing in industrial applications, Springer London, 25–42.
  • Wu, X., and X. Zhu. 2008. Mining with noise knowledge: Error-aware data mining. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 38:917–932.
  • Zhu, X., and X. Wu. 2004. Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review 22:177–210.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.