Abstract
This paper investigates the generic problem of model selection in the specific context of Music Information Retrieval (MIR). In MIR research, similarity measures are developed for ranking musical items with respect to their relevance to a user's musical query. The application of such similarity measures in MIR systems typically requires musical works to be divided into more manageable units. This involves two tasks: melody segmentation and voice separation. For both of these tasks, several computational models have been proposed in the symbolic domain. It seems reasonable to assume that those solutions that are most in accordance with human performance will result in the best ranking of retrieval output.
We conducted two experiments, each with twenty experts and twenty novices. In the melody segmentation experiment, we found a high agreement between the participants. Evaluating algorithm output against participant data, we conclude that human output cannot be distinguished from three of the segmentation algorithms (Grouper, IDyOM and LBDM). For voice separation—which we evaluated by means of a melody identification task—the situation is different, as the combined results of two algorithms (Skyline and SSA) were shown to agree best with experimental results, and differences were found between novice and expert performance. Several other model selection criteria besides performance are discussed in conclusion.
Acknowledgements
We would like to thank the authors of the algorithms who have kindly answered our questions and requests for cooperation, in alphabetical order: Sven Ahlbäck, Emilios Cambouropoulos, Elaine Chew, Sren Madsen, Marcus Pearce and David Temperley. We also thank all the participants in the experiments for their cooperation. Bas de Haas and two anonymous reviewers gave valuable comments on earlier versions of this article.
Notes
1A model is considered surprising if the predicted outcomes are a small fraction of the plausible outcomes.
2This evaluation of the melody segmentation algorithms replaces the preliminary evaluation that was presented in Nooijer et al. (Citation2008a,b).