References
- D. Snyder, P. Ghahremani, D. Povey, et al. “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in Conference: 2016 IEEE Spoken language technology Workshop (SLT), Dec 2016.
- S. O. Arik, M. Chrzanowski, A. Coates, et al. “Deep Voice: Real-time neural text-to-speech.” Feb 2017.
- S. Arik, G. Diamos, A. Gibiansky, et al. “Deep voice 2: Multi-speaker neural text-to-speech,” in NIPS, May 2017.
- T. Capes, P. Coles, A. Conkie, et al., “Siri on-device deep learning-guided unit selection text-to-speech system,” in Interspeech Interpesch, 2017.
- J. Sotelo, S. Mehri, K. Kumar, et al., “Char2wav: End-to-end speech synthesis,” in ICLR workshop, 2017.
- W. Wang, S. Xu, and B. Xu. “First step towards end-to-end parametric TTS synthesis: Generating spectral parameters with neural attention” in Proceedings Interspeech, 2016, pp. 2243–2247.
- D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Arxiv preprint arXiv:1409.0473, 2014.
- H. Valbret, E. Moulines, and J. P. Tubach. “Voice transformation using PSOLA technique [C]//Acoustics, speech, and signal processing,” in ICASSP-92. 1992 IEEE International conference on. IEEE, Vol. 11, 1992-06, 1992, pp. 175–187.
- J. Simonin, L. Delophin-poulat, and G. Damnati, “Gaussian density tree structure in a multi-Gaussian HMM-based speech recognition system,” in Proc. Int. Conf. Spoken language processing, 1998.
- M. Tamura, T. Masuko, K. Tokuda, et al., “Speaker adaptation for HMM-based speech synthesis system using MLLR,” in [C]//ESCA/COCOSDA Workshop on speech synthesis. blue Mountains, Australia: ISCA, 1998, pp. 273–276.
- A. V. D. Oord, S. Dieleman, H. Zen, et al. “WaveNet: A generative model for raw audio,” Sep 2016.
- Y. Wang, R. Skerry-Ryan, D. Stanton, et al. “Tacotron: Towards end-to-end speech synthesis,” in Interspeech Interpesch, Aug 2017.
- J. Shen, R. Pang, R. J. Weiss, et al. “Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions,” Dec 2017.
- W. Ping, K. Peng, A. Gibiansky, et al., “Deep voice 3: scaling text-to-speech with convolutional sequence learning,” in ICLR, 2018.
- J. Sotelo, S. Mehri, K. Kumar, et al., “Char2wav: End-to-end speech synthesis,” in ICLR2017 workshop submission, 2017.
- S. Mehri, K. Kumar, I. Gulrajani, et al., “SampleRNN: An unconditional end-to-end neural audio generation model,” in ICLR, 2017.
- Y. Taigman, L. Wolf, A. Polyak,, et al. “Voice synthesis for in-the-wild speakers via a phonological loop,” in Arxiv:1707.06588, 2017.
- X. Gonzalvo, S. Tazari, C. Chan, et al., “Recent advances in Google real-time HMM-driven unit selection synthesizer,” in Interspeech, 2016.
- Y. N. Dauphin, A. Fan, M. Auli, et al. “Language modeling with gated convolutional networks,” in ICML, 2017.
- V. Ashish, S. Noam, P. Niki, et al., “Attention is all you need,” in Arxiv:1706.03762, 2017.
- J. Gehring, M. Auli, D. Grangier, et al., “Convolutional sequence to sequence learning,” in ICML, 2017.
- C. Raffel, M. T. Luong, P. J. Liu, et al., “Online and linear-time attention by enforcing monotonic alignments,” in ICML, 2017.
- J. Lorenzo-Trueba, F. Fang, X. Wang, et al. “Can we steal your vocal identity from the Internet: Initial investigation of cloning Obama’s voice using GAN, WaveNet and low-quality found data,” Proc. Odyssey 2018 The speaker and language recognition workshop, 2018, pp. 240–247.
- G. K. Anumanchipalli, J. Chartier, and E. F. Chang, “Speech synthesis from neural decoding of spoken sentences,” Nature, 568(7753):493–498, 2019.