Abstract
This study considers the problem of representative selection: choosing a subset of data points from a dataset that best represents its overall set of elements. This subset needs to inherently reflect the type of information contained in the entire set, while minimizing redundancy. For such purposes, clustering might seem like a natural approach. However, existing clustering methods are not ideally suited for representative selection, especially when dealing with nonmetric data, in which only a pairwise similarity measure exists. In this article we propose δ-medoids, a novel approach that can be viewed as an extension of the k-medoids algorithm and is specifically suited for sample representative selection from nonmetric data. We empirically validate δ-medoids in two domains: music analysis and motion analysis. We also show some theoretical bounds on the performance of δ-medoids and the hardness of representative selection in general.
Notes
1 In certain contexts, metric learning (Xing et al. Citation2002; Davis et al. Citation2007) can be applied, but current methods are not well suited for data without vector space representation, and in some sense, learning a metric is of lesser interest for representative selection, because we care less about classification or the structural relations latent in the data.
2 For spectral clustering, the requirement is actually an affinity (or proximity) matrix.
3 In some cases, the distance matrix can be made sparse via KD-trees and nearest-neighbor approximations, which also require a metric embedding (Chen et al. Citation2011).
4 We note that no better approximation scheme is possible under standard complexity theoretic assumptions (Hochbaum and Shmoys Citation1985).
5 In fact, this proof applies for any value of k that cannot be directly ma nipulated by the algorithm.
6 In fact, this proof applies for any value of k that cannot be directly manipulated by the algorithm.