2,334
Views
21
CrossRef citations to date
0
Altmetric
Original Articles

The Problem of Limited Inter-rater Agreement in Modelling Music Similarity

&
Pages 239-251 | Received 07 Oct 2015, Accepted 25 May 2016, Published online: 05 Jul 2016

Figures & data

Figure 1. Average FINE score inter-rater agreement for different intervals of FINE scores (solid line) one standard deviation (dash-dot lines). Dashed line indicates theoretical perfect agreement.

Figure 1. Average FINE score inter-rater agreement for different intervals of FINE scores (solid line) one standard deviation (dash-dot lines). Dashed line indicates theoretical perfect agreement.

Figure 2. Average FINE score of best performing system (y-axis) versus year (x-axis) plotted as circles connected via thick solid line. Upper bounds (solid), (dashed) and (dash-dot) plotted as horizontal lines.

Figure 2. Average FINE score of best performing system (y-axis) versus year (x-axis) plotted as circles connected via thick solid line. Upper bounds (solid), (dashed) and (dash-dot) plotted as horizontal lines.

Table 1. Comparison of best system versus three upper bounds , and due to low inter-rater agreement. Mean FINE scores plus standard deviations and t-test statistics are shown. Differences that are statistically not significant are given in bold.

Figure 3. Inter-rater scores plotted as a histogram over all double-annotated pieces contained in the SALAMI data set for a tolerance of 0.5 s. Mean value plotted as a dashed line.

Figure 3. Inter-rater scores plotted as a histogram over all double-annotated pieces contained in the SALAMI data set for a tolerance of 0.5 s. Mean value plotted as a dashed line.

Table 2. measures (mean and standard deviation) for lower () and upper () bounds within the SALAMI data set.

Table 3. measures (mean and standard deviation) for lower () and upper () bounds within different genre classes of the SALAMI data set (tolerance is 0.5 s).

Table 4. Comparison of best algorithm per MIREX edition on the SALAMI data set versus upper bound for a tolerance of 0.5 s. Boundary recognition mean values and standard deviations, and paired t-test statistics are shown.

Table 5. Comparison of best algorithm per MIREX edition on the SALAMI data set versus upper bound for a tolerance of 3 s. Boundary recognition mean values and standard deviations, and paired t-test statistics are shown.