Abstract
This work concerns the necessity of statistical evaluation of Music Information Retrieval (MIR) experiments. This necessity is motivated by applying fundamental notions of statistical hypotheses testing to MIR research. Minimum requirements concerning statistical evaluation are developed and the appropriate statistical techniques are introduced and exemplified in a genre classification context. Articles from the MIR literature are examined and criticized for the lack of statistical evaluation they contain.
Notes
1Some of the remaining papers should in principle have had a quantitative evaluation but reported only some illustrative examples instead. The rest of the papers reported on topics such as user interface design, data bases, user studies, etc.
2ISMIR Citation2004, 5th International Conference on Music Information Retrieval, Audiovisual Institute, Universitat Pompeu Fabra Barcelona, Spain, 10 – 14 October 2004; Genre classification, Artist identification, Rhythm classification, Melody estimation, and Tempo induction; see http://ismir2004.ismir.net/ISMIRContest.html
3To be more precise, we used the training set of the contest.