Abstract
Text-to-speech options on augmentative and alternative communication (AAC) devices are limited. Often, several individuals in a group setting use the same synthetic voice. This lack of customization may limit technology adoption and social integration. This paper describes our efforts to generate personalized synthesis for users with profoundly limited speech motor control. Existing voice banking and voice conversion techniques rely on recordings of clearly articulated speech from the target talker, which cannot be obtained from this population. Our VocaliD approach extracts prosodic properties from the target talker's source function and applies these features to a surrogate talker's database, generating a synthetic voice with the vocal identity of the target talker and the clarity of the surrogate talker. Promising intelligibility results suggest areas of further development for improved personalization.
Notes
Acknowledgements
We thank the speakers and listeners who participated for their time and involvement in this work.
Declaration of interest: The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the article.
This research was supported in part by National Science Foundation Grants # 0712821 and # 1342102.
Notes
1. PSOLA represents speech as a sequence of raised cosine windowed regions of waveform that overlap. We refer to these windowed regions as epochs. For voiced segments, each epoch is centered on the estimated instant of glottal closure and is twice the fundamental period in length. For voiceless segments, the PSOLA epochs in ModelTalker are arbitrarily positioned at intervals determined by the F0 of adjacent voiced segments.
2. First differencing was used to remove the source contribution from the surrogate talker's pitch periods instead of the IAIF procedure to reduce the amount of signal processing involved, and because informal listening evaluations did not detect large perceptual differences.