Abstract
As part of the research into content-based music information retrieval (MIR), this paper presents a preliminary attempt to automatically identify the language sung in popular music recordings. It is assumed that each language has its own set of constraints that specify the sequence of basic linguistic events when lyrics are sung. Thus, the acoustic structure of individual languages may be characterized by statistically modelling those constraints. To achieve this, the proposed method employs vector clustering to convert a singing signal from its spectrum-based feature representation into a sequence of smaller basic phonological units. The dynamic characteristics of the sequence are then analysed using bigram language models. As vector clustering is performed in an unsupervised manner, the resulting system does not need sophisticated linguistic knowledge; therefore, it is easily portable to new language sets. In addition, to eliminate interference from background music, we leverage the statistical estimation of the background musical accompaniment of a song so that the vector clustering truly reflects the solo singing voices in the accompanied signals.
Acknowledgement
This work was partially supported by National Science Council, Taiwan under Grants NSC93-2422-H-001-0004, NSC95-2221-E-001-034, and NSC95-2218-E-027-020.