Abstract
In a complex acoustic environment, several sound sources may simultaneously change their loudness, location, timbre, and pitch. Yet humans like many other animals are able to integrate effortlessly the multitude of cues arriving at their ears, and to derive coherent percepts and judgments about the different attributes of each source. This facility to analyze an auditory scene is conceptually based on a multi-stage process in which sound is first analyzed in terms of a relatively few perceptually significant attributes (the alphabet of auditory perception), followed by higher level integrative processes that organize and group the extracted attributes according to specific context-sensitive rules (the syntax of auditory perception) [1]. The sound received at the two ears is processed for attributes including source location, acoustic ambience, and source attributes such as tone and pitch, timbre and intensity.
Decades of physiological and psychoacoustical studies [2,3] have revealed elegant strategies at various stages of the mammalian auditory system for the representation of the signal cues underlying auditory perception. This information has facilitated the development of biophysical models, mathematical abstractions, and computational algorithms of the early and central auditory stages with the aim of capturing the functionality, robustness, and enormous versatility of the auditory system [4]. Numerous groups have implemented such algorithms in software and hardware, and have evaluated them by comparing their performance to human performance and against a range of robustness and flexibility requirements. Furthermore, these auditory-inspired processing strategies have been utilized in a wide range of applications including acoustic diagnostic monitoring systems for machines and manufacturing processes, battlefield acoustic signal analysis, sound analysis and recognition systems, robust detection and recognition of multiple interacting faults, and detection and recognition of underwater transients and weak signals in low signal-to-noise ratio (SNR) in acoustically-cluttered environments [5,6].
We shall briefly review the auditory encoding of various sound attributes to illustrate the above ideas. We shall specifically focus on the percept of sound timbre: what acoustic cues are most intimately correlated with it? How are they represented at various stages of the auditory pathway? And how the abstracted auditory signal processing algorithms and representations can be applied to measure speech intelligibility to describe musical timbre and to analyze complex auditory scenes?
Additional information
Notes on contributors
Shihab Shamma
Shihab Shamma received his BS degree in 1976 from Imperial College, in London, UK. He received his MS and PhD degrees in Electrical Engineering from Stanford University in 1977 and 1980, respectively. Prof Shamma received his MA in Slavic Languages and Literature in 1980 from the same institution. Dr Shamma has been a member of the University of Maryland faculty since 1984. He has been associated with the Systems Research Center since its inception in 1985, and received a joint appointment in 1990. Prof Shamma also holds a joint appointment with the University of Maryland Institute for Advanced Computer Studies. Prof Shamma has also worked at the National Institutes of Health and Stanford University. Prof. Shamma's research interests include biological aspects of speech analysis and neural signal processing and he has a number of publications to his credit.