287
Views
3
CrossRef citations to date
0
Altmetric
Original Articles

Representative Selection in Nonmetric Datasets

, &
 

Abstract

This study considers the problem of representative selection: choosing a subset of data points from a dataset that best represents its overall set of elements. This subset needs to inherently reflect the type of information contained in the entire set, while minimizing redundancy. For such purposes, clustering might seem like a natural approach. However, existing clustering methods are not ideally suited for representative selection, especially when dealing with nonmetric data, in which only a pairwise similarity measure exists. In this article we propose δ-medoids, a novel approach that can be viewed as an extension of the k-medoids algorithm and is specifically suited for sample representative selection from nonmetric data. We empirically validate δ-medoids in two domains: music analysis and motion analysis. We also show some theoretical bounds on the performance of δ-medoids and the hardness of representative selection in general.

View correction statement:
Corrigendum

Notes

1 In certain contexts, metric learning (Xing et al. Citation2002; Davis et al. Citation2007) can be applied, but current methods are not well suited for data without vector space representation, and in some sense, learning a metric is of lesser interest for representative selection, because we care less about classification or the structural relations latent in the data.

2 For spectral clustering, the requirement is actually an affinity (or proximity) matrix.

3 In some cases, the distance matrix can be made sparse via KD-trees and nearest-neighbor approximations, which also require a metric embedding (Chen et al. Citation2011).

4 We note that no better approximation scheme is possible under standard complexity theoretic assumptions (Hochbaum and Shmoys Citation1985).

5 In fact, this proof applies for any value of k that cannot be directly ma nipulated by the algorithm.

6 In fact, this proof applies for any value of k that cannot be directly manipulated by the algorithm.

Additional information

Funding

This work has taken place in the Learning Agents Research Group (LARG) at the Artificial Intelligence Laboratory, The University of Texas at Austin. LARG research is supported in part by grants from the National Science Foundation (CNS-1330072, CNS-1305287), ONR (21C184-01), AFRL (FA8750-14-1-0070), AFOSR (FA9550-14-1-0087), and Yujin Robot.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.