532
Views
45
CrossRef citations to date
0
Altmetric
Original Articles

Prediction of Musical Affect Using a Combination of Acoustic Structural Cues

Pages 39-67 | Published online: 16 Feb 2007
 

Abstract

This study explores whether musical affect attribution can be predicted by a linear combination of acoustical structural cues. To that aim, a database of sixty musical audio excerpts was compiled and analyzed at three levels: judgments of affective content by subjects; judgments of structural content by musicological experts (i.e., “manual structural cues”), and extraction of structural content by an auditory-based computer algorithm (called: acoustical structural cues). In Study I, an affect space was constructed with Valence (gay-sad), Activity (tender-bold) and Interest (exciting-boring) as the main dimensions, using the responses of a hundred subjects. In Study II manual and acoustical structural cues were analyzed and compared. Manual structural cues such as loudness and articulation could be accounted for in terms of a combination of acoustical structural cues. In Study III, the subjective responses of eight individual subjects were analyzed using the affect space obtained in Study I, and modeled in terms of the structural cues obtained in Study II, using linear regression modeling. This worked better for the Activity dimension than for the Valence dimension, while the Interest dimension could not be accounted for. Overall, manual structural cues worked better than acoustical structural cues. In a final assessment study, a selected set of acoustical structural cues was used for building prediction models. The results indicate that musical affect attribution can partly be predicted using a combination of acoustical structural cues. Future research may focus on non-linear approaches, elaboration of dataset and subjects, and refinement of acoustical structural cue extraction.

Acknowledgements

This work is supported by the EU project MEGA IST-20410 and a grant from BOF 011V6701, Ghent University. The authors wish to thank J. Taelman for assistance, and an anonymous reviewer of JNMR for useful comments.

Notes

1In estimating the frequency of terms used for music description and search by non-music experts, Kim and Belkin (Citation2002) make a distinction between different categories: terms related to movements, neutral concepts, emotions, nature, objects, occasions or filmed events and musical features. In a music description task, emotion-related terms are used most often, namely in 31% of the cases, while terms related to occasions or filmed events are used in 23% of the cases. In search tasks, emotion-related terms are used in 24% of the cases, while terms related to occasions or filmed events are used in 29%.

2A number of computational models are nowadays available that extract perception-related properties from musical audio such as onset (e.g., CitationKlapuri, 1999; CitationSmith, 1994), beat (e.g., CitationToiviainen, 2001; CitationLarge & Kolen, 1994; CitationScheirer, 1998; CitationLaroche, 2003), consonance (e.g., CitationAures, 1985; CitationDaniel & Weber, 1997; Leman, Citation2000a), pitch (e.g., CitationClarisse et al., 2002; CitationDe Mulder et al., 2004), harmony, tonality (e.g., CitationTerhardt, 1974; CitationParncutt, 1989, CitationLeman, 1995, Citation2000b), timbre (e.g., CitationCosi et al., 1994; CitationToiviainen, 1996; CitationDe Poli & Prandoni, 1997).

3Apart from a preliminary study by Scheirer et al. (Citation2000) , we know of no other attempts that relate these and similar audio-extracted structural features to affect-based description of music. Some researchers, however, have used artificial stimuli to study affect perception using an analysis-by-synthesis methodology. This methodology allows a good control of conditions so that very specific relationships can be revealed between structural properties of the musical audio and affect descriptions (e.g., CitationHevner, 1936; CitationRigg, 1939; CitationThompson & Robintaille, 1992; CitationJuslin, 1997; CitationGagnon & Peretz, 2003). The method, however, does not guarantee that the resulting model will work for a large set of natural musical stimuli in application domains which are difficult to control.

4The French rationalist philosopher Réné Descartes (1596 – 1650) held that the clear and distinct ideas of our intuition form the atomic entities from which all knowledge can be derived by deduction. In this article, we adopt a Cartesian methodology in the sense that all knowledge about affect descriptions of music will be based on predefined atomic entities – namely, auditory modules which extract particular structural features from musical audio. These auditory modules are inspired by the human auditory system, but they are largely predefined in the sense that they are based on the intuition of the programmer. Moreover, they are not learned or trained by any empirical method, although their accuracy has been tested on local problems; for example, the roughness module has been tested on data available in the psycho-acoustical literature. In a similar way, other modules have been tested using data available in the literature (see Appendix B).

5The latter term, however, should be used with caution because the term “arousal”, in this context, does not reflect psychophysical conditions but rather straightforward cognitive appraisals of perceived qualities in music. In that context, we believe it is better to use the term “activity”.

6Juslin (Citation1997), using an analysis-by-synthesis methodology, uses multiple regression analysis starting from the pooled ratings of 54 subjects. The stimuli and experimental design led to high inter-rater-reliabilities. The approach circumvents the problem of having to extract the acoustical features and seems to work well with high-level structural descriptors. However its success may be due to carefully chosen examples of affective expressiveness. The strategy is less evident when natural musical stimuli from commercial recordings are used, together with a large number of subjects, and automatically extracted low-level structural features for application domains such as music information retrieval and interactive music systems.

7For a more detailed account of the conceptual model, see CitationLesaffre et al., 2003, CitationCamurri et al., 2001.

8The semantic differential is a contrasting pair of adjectives that can be used as a scaling instrument to quantitatively measure perceived qualities of an object. The choice of an initial set of m-semantic differentials in the experimental setup induces an m-multidimensional quantitative affect space.

9The negative and positive poles of the axis have to be taken into account in the interpretation of the models that are developed in subsequent parts of this article.

10Ambitus concerns the frequency range from low to high frequency.

11Notice, however, that it would be possible to construct a low-dimensional affect space for each individual subject. We do not do this, because the obtained spaces would tend to be different for each subject, and that would prevent us from comparing the models we set up for each individual subject. Instead, a projection of the individual affect descriptions into the inter-subjective space is made. The individual affect descriptions, in that sense, are constrained by the inter-subjective affect semantics. The approach has the advantage that it provides a suitable inter-subjective basis for comparing individual subjects on which a search for trends can be based. The projection is justified by the fact that the individuals can be assumed to be part of the same population as the subjects that yield the inter-subjective (Valence-Activity-Interest) affect space. We work with individual data, rather than global data, because we want somehow to take into account the subtle differences among individual subjects.

12The present model, despite its limited accuracy, has been successfully used in an artistic production called The Unfolding of an Impossible Building - Short Version, a composition by René Mogensen. The performance took place during the IPEM40! festival at Vooruit, 17 – 18 October, 2003 in Ghent in collaboration with dancers Domenico Giustino and Ophra Wolf, and guitarist Laura Maes. Valence and Activity parameters were extracted from musical audio in real time and used for projecting images in different colors on the scene. The system used a combination of EyesWeb (for video processing) and Pure Data (as host for the real-time auditory model).

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 471.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.