1,354
Views
22
CrossRef citations to date
0
Altmetric
Research Article

Towards Personalized Speech Synthesis for Augmentative and Alternative Communication

, &
Pages 226-236 | Received 10 Jul 2013, Accepted 01 Feb 2014, Published online: 15 Jul 2014
 

Abstract

Text-to-speech options on augmentative and alternative communication (AAC) devices are limited. Often, several individuals in a group setting use the same synthetic voice. This lack of customization may limit technology adoption and social integration. This paper describes our efforts to generate personalized synthesis for users with profoundly limited speech motor control. Existing voice banking and voice conversion techniques rely on recordings of clearly articulated speech from the target talker, which cannot be obtained from this population. Our VocaliD approach extracts prosodic properties from the target talker's source function and applies these features to a surrogate talker's database, generating a synthetic voice with the vocal identity of the target talker and the clarity of the surrogate talker. Promising intelligibility results suggest areas of further development for improved personalization.

Notes

Acknowledgements

We thank the speakers and listeners who participated for their time and involvement in this work.

Declaration of interest: The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the article.

This research was supported in part by National Science Foundation Grants # 0712821 and # 1342102.

Notes

1. PSOLA represents speech as a sequence of raised cosine windowed regions of waveform that overlap. We refer to these windowed regions as epochs. For voiced segments, each epoch is centered on the estimated instant of glottal closure and is twice the fundamental period in length. For voiceless segments, the PSOLA epochs in ModelTalker are arbitrarily positioned at intervals determined by the F0 of adjacent voiced segments.

2. First differencing was used to remove the source contribution from the surrogate talker's pitch periods instead of the IAIF procedure to reduce the amount of signal processing involved, and because informal listening evaluations did not detect large perceptual differences.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.