Publication Cover
Automatika
Journal for Control, Measurement, Electronics, Computing and Communications
Volume 57, 2016 - Issue 1
166
Views
0
CrossRef citations to date
0
Altmetric
Original scientific paper

Towards automatic cross-lingual acoustic modelling applied to HMM-based speech synthesis for under-resourced languages

Primjena automatskog međujezičnog akustičnog modeliranja na HMM sintezu govora za oskudne jezične baze

, B.Sc., , Ph.D. & , Ph.D.
Pages 268-281 | Received 28 Oct 2014, Accepted 04 May 2015, Published online: 20 Jan 2017

References

  • M. H. Cohen, Voice user interface design. Addison-Wesley Professional, 2004.
  • L. Besacier, E. Barnard, A. Karpov, and T. Schultz, “Automatic speech recognition for under-resourced languages: A survey,” Speech Communication, vol. 56, no. 0, pp. 85–100, 2014.
  • H. Lin, J.-t. Huang, F. Beaufays, B. Strope, and Y.-h. Sung, “Recognition of multilingual speech in mobile applications,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012, pp. 4881–4884, IEEE, 2012.
  • V.-B. Le and L. Besacier, “Automatic Speech Recognition for Under-Resourced Languages: Application to Vietnamese Language,” IEEE Trans. Audio, Speech, and Language Processing, vol. 17, no. 8, pp. 1471–1482, 2009.
  • J. Kominek and A. W. Black, “The CMU Arctic speech databases,” in Fifth ISCA Workshop on Speech Synthesis, 2004.
  • J. Žibert and F. Mihelič, “Slovenian weather forecast speech database,” in Proc, SoftCOM, vol. 1, pp. 199–206, Soft-COM, 10 2000.
  • A. Hunt and A. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, vol. 1, pp. 373–376 vol. 1, May 1996.
  • H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039–1064, 2009.
  • J. Kominek, T. Schultz, and A. W. Black, “Synthesizer voice quality on new languages calibrated with mel-cepstral distorion,” in in SLTU 2008, Hanoi, Viet Nam, 2008.
  • H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A. Black, and K. Tokuda, “The HMM-based speech synthesis system (HTS) version 2.0,” in Proc. of Sixth ISCA Workshop on Speech Synthesis, pp. 294–299, 2007.
  • T. Justin, M. Pobar, I. Ipšić, F. Mihelič, and J. Žibert, “A bilingual HMM-based speech synthesis system for closely related languages,” in Text, Speech and Dialogue, pp. 543–550, Springer Berlin Heidelberg, 2012.
  • J. Dijkstra, L. C. Pols, and R. J. V. Son, “Frisian TTS, an example of bootstrapping TTS for minority languages,” in Fifth ISCA Workshop on Speech Synthesis, 2004.
  • N. T. Vu, F. Kraus, and T. Schultz, “Cross-language bootstrapping based on completely unsupervised training using multilingual A-stabil,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pp. 5000–5003, May 2011.
  • T. Schultz and A. Waibel, “Multilingual and Crosslingual Speech Recognition,” in Proc. DARPA Workshop on Broadcast News Transcription and Understanding, pp. 259–262, 1998.
  • K. C. Sim and H. Li, “Robust phone set mapping using decision tree clustering for cross-lingual phone recognition,” in Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pp. 4309–4312, March 2008.
  • C. Traber, K. Huber, K. Nedir, B. Pfister, E. Keller, and B. Zellner, “From multilingual to polyglot speech synthesis,” in Proc. of the Eurospeech, vol. 99, pp. 835–838, 1999.
  • J. Latorre, K. Iwano, and S. Furui, “New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer,” Speech Commun., vol. 48, no. 10, pp. 1227–1242, 2006.
  • M. Pobar, T. Justin, J. Žibert, F. Mihelič, and I. Ipšič, “A Comparison of Two Approaches to Bilingual HMM-Based Speech Synthesis,” in Text, Speech, and Dialogue, pp. 44–51, Springer Berlin Heidelberg, 2013.
  • T. Schultz, N. Vu, and T. Schlippe, “Global Phone: A multilingual text amp; speech database in 20 languages,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8126–8130, May 2013.
  • Y. Qian, H. Liang, and F. Soong, “A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin- English) TTS,” IEEE Trans. Audio, Speech, and Language Processing, vol. 17, pp. 1231–1239, Aug 2009.
  • X. Cui, J. Xue, X. Chen, P. Olsen, P. Dognin, U. V. Chaudhari, J. Hershey, and B. Zhou, “Hidden Markov Acoustic Modeling With Bootstrap and Restructuring for Low-Resourced Languages,” IEEE Trans. Audio, Speech, and Language Processing, vol. 20, pp. 2252–2264, Oct 2012.
  • Y. Qian, J. Xu, and F. Soong, “A frame mapping based HMM approach to cross-lingual voice transformation,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pp. 5120–5123, May 2011.
  • H. Cao, T. Lee, and P. Ching, “Cross-lingual speaker adaptation via Gaussian component mapping.,” in INTERSPEECH, pp. 869–872, 2010.
  • S.-J. Kim, J.-J. Kim, and M. Hahn, “HMM-based Korean speech synthesis system for hand-held devices,” IEEE Trans. Consumer Electronics, vol. 52, pp. 1384–1390, Nov 2006.
  • J. Žganec Gros and M. Žganec, “An efficient unit-selection method for embedded concatenative speech synthesis,” Informacije MIDEM—Journal of Microelectronics, Electronic Components and Materials, vol. 37, no. 3, pp. 158–164, 2007.
  • F. Mihelič, J. Gros, J. Dobrišek, S. and Žibert, and N. Pavešič, “Spoken Language Resources at LUKS of the University of Ljubljanai,” International Journal of Speech Technology, vol. 6, no. 3, pp. 221–232, 2003.
  • D. H. Klatt, “Review of the ARPA speech understanding project,” The Journal of the Acoustical Society of America, vol. 62, no. 6, pp. 1345–1366, 1977.
  • I. P. Association and C. A. I. Corporate, Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press, June 1999.
  • D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using Adapted Gaussian mixture models,” in Digital Signal Processing, p. 2000, 2000.
  • J. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains,” IEEE Trans. Speech and Audio Processing, vol. 2, pp. 291–298, Apr 1994.
  • A. P. Dempster, N. M. Laird, D. B. Rubin, et al., “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal statistical Society, vol. 39, no. 1, pp. 138, 1977.
  • Y. Linde, A. Buzo, and R. Gray, “An Algorithm for Vector Quantizer Design,” Communications, IEEE Transactions on, vol. 28, pp. 84–95, Jan 1980.
  • E. Standard, “Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Frontend feature extraction algorithm; Compression algorithms,” tech. rep., ETSI, 2003.
  • S. Young and S. Young, “The HTK Hidden Markov Model Toolkit: Design and Philosophy,” Entropic Cambridge Research Laboratory, Ltd, vol. 2, pp. 2–44, 1994.
  • J. luc Gauvain, L. Lamel, and G. Adda, “The LIMSI Broadcast News Transcription System,” Speech Communication, vol. 37, pp. 89–108, 2002.
  • M.-Y. Hwang and X. Huang, “Subphonetic modeling with Markov states-Senone,” in Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Conference on, vol. 1, pp. 33–36 vol.1, Mar 1992.
  • K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel-Generalized Cepstral Analysis,” in Proc. ICSLP-94, pp. 1043–1046, 1994.
  • K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Hidden Markov models based on multi-space probability distribution for pitch pattern modeling,” in Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on, vol. 1, pp. 229–232 vol.1, Mar 1999.
  • S. Imai, K. Sumita, and C. Furuichi, “Mel log spectrum approximation (MLSA) filter for speech synthesis,” Electronics and Communications in Japan (Part I: Communications), vol. 66, no. 2, pp. 10–18, 1983.
  • J. Yamagishi, T. Nose, H. Zen, Z.-H. Ling, T. Toda, K. Tokuda, S. King, and S. Renals, “Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis,” IEEE Trans. Audio, Speech, and Language Processing, vol. 17, pp. 1208–1230, Aug 2009.
  • M. J. Gales, The generation and use of regression class trees for MLLR adaptation. University of Cambridge, Department of Engineering, 1996.
  • A. Vasilijević and D. Petrinović, “Perceptual Significance of Cepstral Distortion Measures in Digital Speech Processing,” AUTOMATIKA: časopis za automatiku, mjerenje, elektroniku, računarstvo i komunikacije, vol. 52, no. 2, pp. 132–146, 2011.
  • R. B. D'agostino, W. Chase, and A. Belanger, “The Appropriateness of Some Common Procedures for Testing the Equality of Two Independent Binomial Populations,” The American Statistician, vol. 42, no. 3, pp. 198–202, 1988.
  • S. Martinčič-Ipšic, M. Pobar, and I. Ipšic, “Croatian large vocabulary automatic speech recognition,” AUTOMATIKA: časopis za automatiku, mjerenje, elektroniku, računarstvo i komunikacije, vol. 52, no. 2, pp. 147–157, 2011.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.