252
Views
2
CrossRef citations to date
0
Altmetric
Articles

Influence of Distance Measures on the Effectiveness of One-Class Classification Ensembles

&

Abstract

Because of the presence of only a single class during the learning procedure, one-class classification is considered as one of the most challenging tasks in contemporary machine learning. Recently, ensemble approaches have been applied to this task, proving that they may lead to a significant improvement of the robustness of the recognition system. Most of one-class classifiers base their decision on a distance from an object to the decision boundary, canonically expressed in the Euclidean measure. When combining such predictors, one must map the distance into support function (probability). Additionally, we use weighted one-class support vector machine, which utilizes a distance-based function for calculating weights assigned to objects. Therefore, one may easily see that the measure used has a crucial impact on the quality of base classifiers as well as on the classifier fusion. This study investigates the performance of seven different distance measures over a set of diverse benchmark datasets. Additionally, we analyze the correlation between the used distance measure and the selected fusion method. Experimental analysis allows us to shed some light on strengths and weaknesses of examined distance metrics applied to combining one-class classifiers.

Introduction

One-class classification (OCC) is known as learning in the absence of counterexamples; a primary object of OCC is to train a classifier using only patterns drawn from the target class distribution. Its main goal is to detect an anomaly or a state other than the one for the target class (Tax and Duin Citation2005). It is assumed that only information of the target class is available. Therefore, no information about the potential nature of outlier objects is needed to derive the decision boundary.

This is a very useful approach in many real-life applications, for example, machine diagnosis: when we know what the proper response from the device is, but the number of ways in which such a device may malfunction are numerous. Therefore, instead of costly and time-consuming generation of negative examples, one may use OCC to get the classification boundary only on the basis of available samples. Various terms have been used in the literature to refer to one-class learning approaches. The term single-class classification originates from (Koch et al. Citation1995), but outlier detection (Hodge and Austin Citation2004) or novelty detection (Bishop Citation1994) also are used to name this field of study. Another view on the OCC is that it seeks to distinguish one specific class from the more broad set of classes (e.g., selecting apples from fruits, gearbox error from machine faults or fraud from a set of transactions). The target class is considered as a positive one, whereas all others are considered as negative ones.

Usually, for a given problem, we may have a pool of several classifiers at our disposal. Canonical machine learning methods concentrate on selecting the single best classifier from the pool and delegating it to the problem-solving task. This approach seems very reasonable and is rooted in a normal human behavior—when having a problem, we tend to search for the most competent expert in a given area, not paying attention to less-renowned specialists. Yet when referencing to only a single classifier, we discard the fact that other models from the pool may also offer a valuable contribution to the considered problem. That is why a combined approach was proposed, utilizing decisions of more than one classifier. Such methods, known as multiple classifier systems (MCS) are considered to be one of the most promising research directions in the current field of machine learning and pattern recognition (Jain, Duin, and Mao).

Using MCS in OCC is an approach that still awaits proper attention. So far, several approaches have been proposed, based on a simple bagging (Li and Zhang Citation2008), boosting (Yeh, Lee, and Lee Citation2009), or random subspaces (Cheplygina and Tax Citation2011). Most of the works in this topic were application oriented, for example, on image retrieval (Wu and Chung Citation2009) or on monitoring the information network (Mazhelis and Puuronen Citation2004). Therefore, there is a lack of works devoted to the theoretical advances in combination with one-class classifiers. Recently, the authors have proposed diversity measures for selecting one-class predictors for the ensemble (Krawcyzk and Woźniak Citation2012a).

Several methods for combining one-class classifiers were proposed in Tax and Duin (Citation2001); they will be described in more detail in the section titled “Combining One-Class Classifiers.” One should notice that most one-class predictors base their decision on the distance from an object to the decision boundary. To apply fusion methods, we require the probability (or classification support) of object x for a given class. Therefore, to perform the classifier fusion one must map the distance into a probability. Canonically used in OCC was Euclidean distance. The majority of studies dealing with both theory and applications of one-class classification have used this measure. So far, no deeper examination has been conducted on the properties of this measure in one-class tasks. Therefore, switching to different methods of measuring distance between objects may have a beneficial influence on the classification step. This is a well-known approach in multiclass classification, but unexplored so far for one-class methods (Krawcyzk and Woźniak Citation2012b). This work proposes to use different measures of distance and investigates their influence on the overall accuracy of the combined classifier.

The main contributions of this study are as follows:

  • Proposal on using different distance measures to see if they are more useful for the process of combining one-class classifiers than the used Euclidean measure.

  • Investigation of the weighted one-class support vector machine on different distance metrics. This one-class classifier strongly relies on these metrics, because the weights of objects are calculated by measuring their distances from the center of the hypersphere.

  • Extensive computational experiments that allow some light to be shed on the strengths and weaknesses of the used distance measures. Additionally, a correlation between the applied fusion method and the used measure is analyzed.

The manuscript is organized as follows: in the next section, a weighted one-class support vector machine is described briefly. The section following that presents the concept of combining one-class classifiers. “Distance Measures” deals with the used distance metrics. “Experimental Investigations” presents our experiments, and the final section concludes the paper.

Weighted One-Class Support Vector Machine

One-class classification aims at distinguishing between the available objects coming from the target distribution and unknown outliers , which are unavailable during the classifier training step but may appear in the process of classifier exploitation. One-class support vector machine (OCSVM; Schölkopf and Smola Citation2002) achieves this goal by computing a closed boundary in a form of a hypersphere enclosing all the objects from . During the exploitation phase, a decision made about the new object is based on checking whether it falls inside the hypersphere. If so, the new object is labeled as one belonging to . Otherwise it belongs to .

The center a and a radius R are the two parameters that are sufficient for describing such a decision hypersphere. To have a low acceptance of the possible outliers, the volume of this d-dimensional hypersphere, which is proportional to should be minimized in such a way that tightly encompasses all available objects from . The minimization of implies minimization with respect to . Following this the minimization functional may be formulated as follows:

(1)
with respect to the constraint:
(2)
where are objects from , and, N stands for the quantity of training objects. Additionally, to allow for the fact that there may have been some outliers in the training set and to increase the robustness of the trained classifier, some objects with distances to a greater than R are allowed in the training set but are associated with an additional penalty factor. This is done identically to a standard SVM by the introduction of slack variables .

This concept can be further extended to a weighted one-class support vector machine (WOCSVM; Bichego and Figueiredo Citation2009) by the introduction of weights that allow for an association of an importance measure to each of the training objects. This forces slack variables to be additionally controlled by . If a small weight is associated with object , then the corresponding slack variable indicates a small penalty. In effect, the corresponding slack variable will be larger, allowing to lie further from the center a of the hypersphere. This reduces the impact of on the shape of a decision boundary of WOCSVM.

By using the above-mentioned ideas, we can modify the minimization functional:

(3)
with the modified constraints that almost all objects are within the hypersphere:
(4)
where , . Here C stands for a parameter that controls the optimization process; the larger C, the fewer outliers are allowed with the increase of the volume of the hypersphere.

For establishing weights, we may use techniques dedicated to weighted multiclass support vector machines (Cyganek Citation2012). In this study, we propose to use the following formula:

(5)
where is the distance between the object and the center a, and is used to prevent the case of .

COMBINING ONE-CLASS CLASSIFIERS

One-class boundary methods are based on computing the distance between the object x and target class . To apply fusion methods, we require the probability (or classification support) of object x for a given class. Therefore, to conduct the fusion we use a heuristic mapping. We propose to use the following solution:

(6)
which models a Gaussian distribution around the classifier, where is a distance metric, is the normalization constant, and is the scale parameter. Parameters and should be fitted to the target class distribution.

After performing such a mapping, one may use the proposed fusion functions (Tax and Duin Citation2001). This study applies all five of them, with the assumption that the pool consists of R one-class classifiers:

  1. Mean vote, which combines binary output labels of one-class classifiers. It is expressed by:

    (7)
    where is the indicator function and is a classification threshold. When a threshold equal to 0.5 is applied, this rule transforms into a majority vote for binary problems.

  2. Mean weighted vote, which introduces the weighting of base classifiers by , where is the fraction of target class objects accepted by the kth classifier:

    (8)

    which is a smoothed version of the mean vote method.

  3. Product of the weighted votes, which is defined as:

    (9)

  4. Mean of the estimated probabilities which is expressed by:

    (10)

  5. Product combination of the estimated probabilities, which is expressed by:

    (11)

    This fusion method assumes that the outlier object distribution is independent of x and thus uniform in the area around the target concept.

DISTANCE MEASURES

As one may see from the previous section, used distance measure has a major impact on the performance of WOCSVM and the process of combining one-class classifiers. So far, for this problem only the Euclidean squared distance has been used (Tax and Duin). In this work, we would like to investigate the performance of other distance metrics for one-class classifier fusion, to see how they cope with the learning paradigm of OCSVM and check if there exists a correlation between specific distance measures and the used fusion method.

There exists a significant number of distance measures derived from many various fields such as mathematics, physics, information theory, computer science, and econometrics. A good review of existing distance measures is presented in (Cha Citation2007). Many machine learning algorithms, such as minimal distance classifiers (Domeniconi, Peng, and Gunopulos Citation2002), rely on the distance metric for the input data patterns. In recent years, many studies have demonstrated, both empirically and theoretically, that a proper metric can significantly improve the performance in classification, clustering, and retrieval tasks (Hu, Yu, and Xie Citation2008).

A distance metric, is a function for calculating a distance between two objects, m and n. To consider a given function as a distance metric, it has to fulfill the following three properties (Wilson and Martinez Citation1997):

  1. It is always greater than or equal to zero.

  2. The distance from an object to itself is always equal to zero.

  3. It obeys the triangle inequality, i.e., for three points m, n and o, for any choice of n.

This article investigates the performance of seven popular distance metrics, assuming that each object is described by i features. Euclidean squared distance, Canberr distance, Chebyshev distance, Manhattan distance, and Cosine correlation are independent from the underlying data distributions, whereas Pearson correlation distance and uncentered Pearson correlation distance require estimation of the mean and the standard deviation on the training set.

  1. Euclidean squared distance:

    (12)

  2. Canberr distance:

    (13)

  3. Chebyshev distance:

    (14)

  4. Manhattan distance:

    (15)

  5. Pearson correlation distance, in which one must first compute the correlation parameter r:

    (16)

    where are average values and are standard deviations. Using this parameter, one may define a distance metric as:

    (17)

    where takes values from [0,2].

  6. Uncentered Pearson correlation distance is similar to Pearson correlation distance, but we assume that average values are equal to 0:

    (18)

    Using this parameter, one may define a distance metric as:

    (19)

  7. Cosine correlation:

    (20)

EXPERIMENTAL INVESTIGATIONS

The experimental investigations were designed in a way that allows for answering following questions:

  • Is there any improvement in quality when switching to different types of distance measures in combining one-class classifiers?

  • How does the distance measures used influence the performance of WOCSVM ensembles?

  • Is there any correlation between the used distance metric and the selected fusion method?

Datasets

We have chosen 10 binary datasets in total, nine coming from the UCI Repository and an additional one originating from the chemoinformatics domain and describing the process of discovering pharmaceutically useful isoforms of the CYP 2C19 molecule. The dataset is available for download at SIAM (Citation2011).

Details of the chosen datasets are given in . We should point out that so far there have been no benchmarks for one-class datasets. Therefore, the only solution on how to assess the performance of new methods for this field was to treat one-class problems as a decomposition of multiclass benchmarks, as in our previous works (Krawczyk and Woźniak Citation2012a). This article uses a simple, unified, and effective scheme for transformation of multiclass sets into canonical one-class problems.

TABLE 1 Details of Datasets Used in the Experimental Investigation

Let’s assume that a dataset consist of objects drawn from class set . Class is chosen to become target class . All other objects from become outliers objects with labels . The pool of R individual classifiers is then trained with normal procedure (such as cross-validation) on the objects from class whereas objects from are used for the testing phase. Therefore, the testing set at each validation consists of all outlier objects and one fold of the objects from the target class. With this it is possible to estimate error on the outlier class and the generalization properties of the classifier on the target class. One should notice that from a single M-class problem, this procedure may derive M separate one-class datasets.

In this work, as a target class we have chosen the minority class; the remaining class was used as outliers.

Set-Up

As a base classifier, a WOCSVM with a polynomial kernel was selected. Parameters for the kernel were established with the use of the tuning procedure: and cost parameter . The pool of classifiers was constructed using a random subspace method (Ho Citation1998) and consisted of ten models. Each subspace contained 60% of the original feature set. The focus of this study is not on the problem of selecting classifiers to the committee, therefore, an entire pool was always used.

For fusion methods, the threshold parameter . This was established with the use of the grid search procedure.

The combined 5 × 2 cv F test (Alpaydin Citation1999) was carried out to assess the statistical significance of the obtained results.

All experiments were carried out in the R language (R Development Core Team Citation2008).

Results and Discussion

The results of the experiment are given in .

TABLE 2 Classification Accuracy [%] on the Benchmark Datasets

From the experimental analysis, one may see that Euclidean distance is not always the best solution for combining WOCSVM. In some of the cases it offers best performance, but in others, switching to a different distance metric seems beneficial for the quality of the recognition system. Among the six other investigated measures, the best performance was returned by the Pearson correlation distance, which in some cases significantly outperformed the Euclidean metric. Interestingly, the uncentered Pearson correlation distance displayed an inferior performance, never behaving better than its more common version. In most cases the canonical Pearson distance behaved statistically better than its uncentered version. The worst performance was returned by Manhattan distance in most cases, yet, for few datasets its performance raised, which leads to a conclusion that it deals well with combining one-class classifiers over categorical or discrete features.

As for the combining methods, the best performance was returned by mean weighted vote and product of weighted votes. The latter seemed to behave especially well when carpled with the Pearson correlation distance metric, which leads to a conclusion that this setting can lead to boosting the accuracy of one-class ensembles, especially for datasets that are based on real-number features or are imbalanced—as we can observe a big gain in accuracy for the highly imbalanced CYP2C19 isoform dataset. The mean of the estimated probabilities and product combination of the estimated probabilities returned inferior results in comparison with the mentioned fusion methods, but it is worth mentioning that both worked very well with the cosine correlation metric. This is another proof that there exists a correlation between the performance of a selected type of one-class combiner and chosen distance metric.

Finally, we would like to compare our findings to our previous research results. In our earlier paper (Krawczyk and Woźniak Citation2012b), we investigated the influence of distance measures on ensembles of OCSVM, finding no strict trend in the data. On the contrary, for WOCSVM we observed a steady good performance of Pearson correlation distance metric, which allows us to draw a conclusion that WOCSVM copes well with it.

CONCLUSIONS AND FUTURE WORKS

This article addressed the issue of distance metrics used in one-class classification. Boundary methods, such as the used WOCSVM base their decisions on the distance from an examined object to the decision surface. When combining several one-class classifiers, one must additionally map the distance into the probability. Therefore, the problem of measuring the distance has a strong influence on the creation of combined one-class classifiers. We have examined seven different distance measures for five different classifier fusion blocks.

Experimental investigation, backed up by a statistical test of significance, showed that, in some cases, switching to a different distance metric may lead to an significant improvement of overall accuracy of the ensemble. WOCSVM seems to work very well with the combination of the product of weighted votes fusion strategy and Pearson correlation distance metric. This proves that researching the distance metric problem for combining one-class classifiers is a worthwhile research direction.

Our future works will concentrate on introducing a metric learning system for one-class classification, which may automatically select the most promising distance metric for a given dataset.

Funding

This work was supported by The Polish National Science Centre under the grant PRELUDIUM number DEC-2013/09/N/ST6/03504.

REFERENCES

  • Alpaydin, E. 1999. Combined 5 × 2 cv f test for comparing supervised classification learning algorithms. Neural Computation 11(8):1885–1892.
  • Bicego, M. and M. A. T. Figueiredo. 2009. Soft clustering using weighted one-class support vector machines. Pattern Recognition 42(1):27–32.
  • Bishop, C. M. 1994. Novelty detection and neural network validation. Vision, Image and Signal Processing 141(4): 217–222.
  • Cha, S. H. 2007. Comprehensive study on distance/similarity measures between probability density functions. International Journal of Mathematical Modeling and Methods in Applied Sciences 1(4):300–307.
  • Cheplygina, V., and D. M. J. Tax. 2011. Pruned random subspace method for one-class classifiers. Lecture Notes in Computer Science 6713:96–105. Berlin, Heldelberg: Springer.
  • Cyganek, B. 2012. One-class support vector ensembles for image segmentation and classification. Journal of Mathematical Imaging and Vision 42(2–3):103–117.
  • Domeniconi, C., J. Peng, and D. Gunopulos. 2002. Locally adaptive metric nearest-neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(9):1281–1285.
  • Ho, T. K. 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8):832–844.
  • Hodge, V. J., and J. Austin. 2004. A survey of outlier detection methodologies. Artificial Intelligence Review 22(2):85–126.
  • Hu, Q., D. Yu, and Z. Xie. 2008. Neighborhood classifiers. Expert Systems with Applications 34(2):866–876.
  • Jain, A. K., R. P. W. Duin, and J. Mao. 2000. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1):4–37.
  • Koch, M. W., M. M. Moya, L. D. Hostetler, and R. J. Fogler. 1995. Cueing, feature discovery, and one-class learning for synthetic aperture radar automatic target recognition. Neural Networks 8(7–8):1081–1102.
  • Krawczyk, B., and M. Woźniak. 2012a. Combining diverse one-class classifiers. Lecture Notes on Artificial Intelligence 7209, (Issue PART 2):590–601.
  • Krawczyk B. and M. Woźniak. 2012b. Experiments on distance measures for combining one-class classifiers. In Proceedings of the Fedcisis 2012 Conference, 88–92, 9–12 September, IEEE, Wroclaw, Poland.
  • Li, C., and Y. Zhang. 2008. Bagging one-class decision trees. In Proceedings of 5th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2008), 2:420–423.
  • Mazhelis, O., and S. Puuronen. 2004. Combining one-class classifiers for mobile-user substitution detection. Paper presented at the ICEIS 2004—Proceedings of the Sixth International Conference on Enterprise Information Systems, 130–137, 14–17 April, Universidade Portucalense, Porto, Portugal.
  • R Development Core Team. 2008. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.
  • Scholkopf, B. and A. J. Smola. 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. Cambridge, MA, USA: MIT Press.
  • SIAM. Proceedings of the Eleventh SIAM International Conference on Data Mining, SDM 2011, April 28–30, 2011, Mesa, Arizona, USA. http://tunedit.org/challenge/QSAR. SIAM Omnipress, 2011.
  • Tax, D. M. J. and R. P. W. Duin. 2001. Combining one-class classifiers. In Proceedings of the Second International Workshop on Multiple Classifier Systems, MCS ’01, 299–308. London, UK: Springer-Verlag.
  • Tax, D. M. J. and R. P. W. Duin. 2005. Characterizing one-class datasets. In Proceedings of the Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, 21–26.
  • Wilson, D. R., and T. R. Martinez. 1997. Improved heterogeneous distance functions. Journal of Artificial Intelligence Research 6:1–34.
  • Wu, R.-S., and W.-H. Chung. 2009. Ensemble one-class support vector machines for content-based image retrieval. Expert Systems with Applications 36(3):4451–4459.
  • Yeh, C.-Y., Z.-Y. Lee, and S.-J. Lee. 2009. Boosting one-class support vector machines for multi-class classification. Applied Artificial Intelligence 23(4):297–315.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.