Towards Personalized Speech Synthesis for Augmentative and Alternative Communication

Timothy MillsDepartment of Speech Language Pathology and Audiology, Northeastern University, Boston, MA, USA

H. Timothy BunnellNemours Biomedical Research, Wilmington, DE, USA

Rupal PatelDepartment of Speech Language Pathology and Audiology and Computer Science, Northeastern University, Boston, MA, USACorrespondence[email protected]

Abstract

Text-to-speech options on augmentative and alternative communication (AAC) devices are limited. Often, several individuals in a group setting use the same synthetic voice. This lack of customization may limit technology adoption and social integration. This paper describes our efforts to generate personalized synthesis for users with profoundly limited speech motor control. Existing voice banking and voice conversion techniques rely on recordings of clearly articulated speech from the target talker, which cannot be obtained from this population. Our VocaliD approach extracts prosodic properties from the target talker's source function and applies these features to a surrogate talker's database, generating a synthetic voice with the vocal identity of the target talker and the clarity of the surrogate talker. Promising intelligibility results suggest areas of further development for improved personalization.

Keywords::

Notes

Acknowledgements

We thank the speakers and listeners who participated for their time and involvement in this work.

Declaration of interest: The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the article.

This research was supported in part by National Science Foundation Grants # 0712821 and # 1342102.

Notes

1. PSOLA represents speech as a sequence of raised cosine windowed regions of waveform that overlap. We refer to these windowed regions as epochs. For voiced segments, each epoch is centered on the estimated instant of glottal closure and is twice the fundamental period in length. For voiceless segments, the PSOLA epochs in ModelTalker are arbitrarily positioned at intervals determined by the F0 of adjacent voiced segments.

2. First differencing was used to remove the source contribution from the surrogate talker's pitch periods instead of the IAIF procedure to reduce the amount of signal processing involved, and because informal listening evaluations did not detect large perceptual differences.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Towards Personalized Speech Synthesis for Augmentative and Alternative Communication

Information for

Open access

Opportunities

Help and information

Towards Personalized Speech Synthesis for Augmentative and Alternative Communication

Abstract

Notes

Acknowledgements

Notes

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature