Abstract
Objective assessment of intelligibility on the telephone is desirable for voice and speech assessment and rehabilitation. A total of 82 patients after partial laryngectomy read a standardized text which was synchronously recorded by a headset and via telephone. Five experienced raters assessed intelligibility perceptually on a five-point scale. Objective evaluation was performed by support vector regression on the word accuracy (WA) and word correctness (WR) of a speech recognition system, and a set of prosodic features. WA and WR alone exhibited correlations to human evaluation between |r| = 0.57 and |r| = 0.75. The correlation was r = 0.79 for headset and r = 0.86 for telephone recordings when prosodic features and WR were combined. The best feature subset was optimal for both signal qualities. It consists of WR, the average duration of the silent pauses before a word, the standard deviation of the fundamental frequency on the entire sample, the standard deviation of jitter, and the ratio of the durations of the voiced sections and the entire recording.
Acknowledgements
We would like to thank Maria Schuster M.D., Eva Uhl, Florian Hebel, and the speech therapists of the Department of Phoniatrics and Pediatric Audiology for obtaining the audio and perceptual evaluation data.
Declaration of interest: This work was funded by the German Cancer Aid (Deutsche Krebshilfe) under grant 107873. The responsibility for the content of this article lies with the authors. The authors report no declarations of interest.