730
Views
10
CrossRef citations to date
0
Altmetric
Invited Article

Automatic speech recognition: A primer for speech-language pathology researchers

Pages 599-609 | Received 28 Jun 2017, Accepted 28 Jul 2018, Published online: 09 Jan 2019

References

  • Ai, O.C., Hariharan, M., Yaacob, S., & Chee, L.S. (2012). Classification of speech dysfluencies with MFCC and LPCC features. Expert Systems with Applications, 39, 2157–2165. doi:10.1016/j.eswa.2011.07.065
  • Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., … Chen, J. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. In Proceedings of the 33rd International Conference on Machine Learning (ICML) (pp. 173–182). New York City, NY: PMLR.
  • Bocklet, T., Nöth, E., Stemmer, G., Ruzickova, H., & Rusz, J. (2011, December). Detection of persons with Parkinson's disease by acoustic, vocal, and prosodic analysis. In Proceedings of the 12th IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 478–483). Hawaii: IEEE.
  • Bocklet, T., Riedhammer, K., Nöth, E., Eysholdt, U., & Haderlein, T. (2012). Automatic intelligibility assessment of speakers after laryngeal cancer by means of acoustic modelling. Journal of Voice, 26, 390–397. doi:10.1016/j.jvoice.2011.04.010
  • Bocklet, T., Steidl, S., Nöth, E., & Skodda, S. (2013). Automatic evaluation of Parkinson's speech-acoustic, prosodic and voice related cues. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH) (pp. 1149–1153). ISCA: Lyon, France.
  • Caballero-Morales, S.O., & Trujillo-Romero, F. (2014). Evolutionary approach for integration of multiple pronunciation patterns for enhancement of dysarthric speech recognition. Expert Systems with Applications, 41, 841–852. doi:10.1016/j.eswa.2013.08.014
  • Carmichael, J., & Green, P. (2004). Revisiting dysarthria assessment intelligibility metrics. Proceedings of the 8th International Conference on Spoken Language Processing (ICSLP’04) (pp. 742–745). Jeju Island, Korea: ISCA.
  • Casanueva, I., Christensen, H., Hain, T., & Green, P.D. (2014). Adaptive speech recognition and dialogue management for users with speech disorders. In Proceedings of the 13th Annual Conference of the International Speech Communication Association (INTERSPEECH) (pp. 1776–1779). ISCA: Portland, OR.
  • Christensen, H., Cunningham, S., Fox, C., Green, P.D., & Hain, T. (2012). A comparative study of adaptive, automatic recognition of disordered speech. INTERSPEECH (pp. 1776–1779). Portland, OR: ISCA.
  • Christensen, H., Green, P.D., & Hain, T. (2013, August). Learning speaker-specific pronunciations of disordered speech. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH) (pp. 1159–1163). ISCA: Lyon, France.
  • Clapham, R.P., Martens, J.P., van Son, R.J., Hilgers, F.J., van den Brekel, M.M., & Middag, C. (2016). Computing scores of voice quality and speech intelligibility in tracheoesophageal speech for speech stimuli of varying lengths. Computer Speech & Language, 37, 1–10. doi:10.1016/j.csl.2015.10.001
  • Clapham, R., Middag, C., Hilgers, F., Martens, J.P., Van Den Brekel, M., & van Son, R. (2014). Developing automatic articulation, phonation and accent assessment techniques for speakers treated for advanced head and neck cancer. Speech Communication, 59, 44–54. doi:10.1016/j.specom.2014.01.003
  • Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28, 357–366. doi:10.1109/TASSP.1980.1163420
  • Deller, J.R., Hsu, D., & Ferrier, L.J. (1991). On the use of hidden Markov modelling for recognition of dysarthric speech. Computer Methods and Programs in Biomedicine, 35, 125–139. doi:10.1016/0169-2607(91)90071-Z
  • Deng, L., Hinton, G., & Kingsbury, B. (2013, May). New types of deep neural network learning for speech recognition and related applications: An overview. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013 (pp. 8599–8603). Vancouver, Canada: IEEE.
  • Doyle, P.C., Leeper, H.A., Kotler, A.L., & Thomas-Stonell, N. (1997). Dysarthric speech: A comparison of computerized speech recognition and listener intelligility. Journal of Rehabilitation Research and Development, 34, 309–316.
  • Falk, T.H., Chan, W.Y., & Shein, F. (2012). Characterization of atypical vocal source excitation, temporal dynamics and prosody for objective measurement of dysarthric word intelligibility. Speech Communication, 54, 622–631. doi:10.1016/j.specom.2011.03.007
  • Ferrier, L., Shane, H., Ballard, H., Carpenter, T., & Benoit, A. (1995). Dysarthric speakers' intelligibility and speech characteristics in relation to computer speech recognition. Augmentative and Alternative Communication, 11, 165–175. doi:10.1080/07434619512331277289
  • Fosler-Lussier, J.E. (1999). Dynamic pronunciation models for automatic speech recognition (Doctoral dissertation, University of California, Berkeley Fall 1999). Retrieved from ftp://ftp.icsi.berkeley.edu/global/global/pub/speech/papers/thesis-fosler99.pdf
  • Goldrick, M., Keshet, J., Gustafson, E., Heller, J., & Needle, J. (2016). Automatic analysis of slips of the tongue: Insights into the cognitive architecture of speech production. Cognition, 149, 31–39. doi:10.1016/j.cognition.2016.01.002
  • Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. Proceedings of the 31st International Conference on Machine Learning (ICML-14) (pp. 1764–1772). Beijing, China: PMLR.
  • Graves, A., Mohamed, A.R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. IEEE international conference on Acoustics, speech and signal processing (ICASSP), 2013 (pp. 6645–6649). Vancouver, Canada: IEEE.
  • Graves, A., Santiago, F., & Gomez, F. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning, ICML 2006 (pp. 369–376). Pittsburgh, Pennsylvania: Omni Press.
  • Green, P.D., Carmichael, J., Hatzis, A., Enderby, P., Hawley, M.S., & Parker, M. (2003). Automatic speech recognition with sparse training data for dysarthric speakers. In Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH). ISCA: Geneva, Switzerland.
  • Greenberg, S., Hollenback, J., & Ellis, D. (1996). Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus. In Proceedings of the International Conference on Spoken Language Processing: Vol. 96. (pp. 24–27). Philadelphia, PA: ISCA.
  • Gu, L., Harris, J. G., Shrivastav, R., & Sapienza, C. (2005). Disordered speech assessment using automatic methods based on quantitative measures. EURASIP Journal on Applied Signal Processing, 1400–1409. doi:10.1155/ASP.2005.1400
  • Haderlein, T., Nöth, E., Batliner, A., Eysholdt, U., & Rosanowski, F. (2011). Automatic intelligibility assessment of pathologic speech over the telephone. Logopedics Phoniatrics Vocology, 36, 175–181. doi:10.3109/14015439.2011.607470
  • Hawley, M., Enderby, P., Green, P., Brownsell, S., Hatzis, A., Parker, M., … Palmer, R. (2006). STARDUST; speech training and recognition for dysarthric users of assistive technology. In Proceedings of the 7th European Conference for the Advancement of Assistive Technology (AAATE) (Vol. 20, pp. 149–963). AAATE: Dublin, Ireland.
  • Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87, 1738–1752. doi:10.1121/1.399423
  • Hermansky, H., & Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2, 578–589. doi:10.1109/89.326616
  • Hinton, G.E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14, 1771–1800. doi:10.1162/089976602760128018
  • Hinton, G.E. (2012). A practical guide to training restricted Boltzmann machines. In Neural networks: Tricks of the trade (pp. 599–619). Berlin Heidelberg: Springer.
  • Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A-r., Jaitly, N., … Kingsbury, B. (2012). Deep neural networks for acoustic modelling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29, 82–97. doi:10.1109/MSP.2012.2205597
  • Hinton, G.E., Osindero, S., & Teh, Y.W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. doi:10.1162/neco.2006.18.7.1527
  • Hosom, J.P., Shriberg, L., & Green, J.R. (2004). Diagnostic assessment of childhood apraxia of speech using automatic speech recognition (ASR) methods. Journal of Medical Speech-Language Pathology, 12, 167.
  • Jayaram, G., & Abdelhamied, K. (1995). Experiments in dysarthric speech recognition using artificial neural networks. Journal of Rehabilitation Research and Development, 32, 162.
  • Jelinek, F. (1997). Statistical methods for speech recognition. Cambridge, MA: MIT press.
  • Jurafsky, D. (2000). Speech & language processing. London, UK: Pearson Education India.
  • Kent, R. D. (Ed.). (1992). Intelligibility in speech disorders: Theory, measurement and management (Vol.1). Amsterdam, Netherlands: John Benjamins Publishing.
  • Kim, M.J., Kim, Y., & Kim, H. (2015). Automatic intelligibility assessment of dysarthric speech using phonologically-structured sparse linear model. IEEE Transactions on Audio, Speech, and Language Processing, 23, 694–704. doi:10.1109/TASLP.2015.2403619
  • Kim, J., Kumar, N., Tsiartas, A., Li, M., & Narayanan, S.S. (2015). Automatic intelligibility classification of sentence-level pathological speech. Computer Speech & Language, 29, 132–144. doi:10.1016/j.csl.2014.02.001
  • Kitzing, P., Maier, A., & Åhlander, V.L. (2009). Automatic speech recognition (ASR) and its use as a tool for assessment or therapy of voice, speech, and language disorders. Logopedics Phoniatrics Vocology, 34, 91–96. doi:10.1080/14015430802657216
  • Landa, S., Pennington, L., Miller, N., Robson, S., Thompson, V., & Steen, N. (2014). Association between objective measurement of the speech intelligibility of young people with dysarthria and listener ratings of ease of understanding. International Journal of Speech-Language Pathology, 16, 408–416. doi:10.3109/17549507.2014.927922
  • Le, D., Licata, K., Persad, C., & Mower Provost, E. (2016). Automatic assessment of speech intelligibility for individuals with aphasia. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24, 2187–2199. doi:10.1109/TASLP.2016.2598428
  • Le, D., & Mower Provost, E. (2014). Modelling pronunciation, rhythm, and intonation for automatic assessment of speech quality in aphasia rehabilitation. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH) (pp. 1563–1567). Singapore: ISCA.
  • Le, D., & Mower Provost, E. (2016). Improving automatic recognition of aphasic speech with Aphasiabank. In Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH) (pp. 2681–2685). San Francisco, CA: ISCA. doi:10.21437/Interspeech.2016
  • Livescu, K. (2005). Feature-based pronunciation modelling for automatic speech recognition (Doctoral dissertation, Massachusetts Institute of Technology). Retrieved from https://dspace.mit.edu/handle/1721.1/34488
  • Maier, A., Haderlein, T., Eysholdt, U., Rosanowski, F., Batliner, A., Schuster, M., & Nöth, E. (2009). PEAKS–A system for the automatic evaluation of voice and speech disorders. Speech Communication, 51, 425–437. doi:10.1016/j.specom.2009.01.004
  • Maier, A., Hönig, F., Bocklet, T., Nöth, E., Stelzle, F., Nkenke, E., & Schuster, M. (2009). Automatic detection of articulation disorders in children with cleft lip and palate. The Journal of the Acoustical Society of America, 126, 2589–2602. doi:10.1121/1.3216913
  • Maier, A., Nöth, E., Batliner, A., Nkenke, E., & Schuster, M. (2006). Fully automatic assessment of speech of children with cleft lip and palate. INFORMATICA-LJUBLJANA, 30, 477.
  • Mayr, S., Burkhardt, K., Schuster, M., Rogler, K., Maier, A., & Iro, H. (2010). The use of automatic speech recognition showing the influence of nasality on speech intelligibility. European Archives of Oto-Rhino-Laryngology, 267, 1719–1725. doi:10.1007/s00405-010-1256-5
  • Mengistu, K.T., & Rudzicz, F. (2011). Comparing humans and automatic speech recognition systems in recognizing dysarthric speech. In Proceedings of the 24th Canadian Conference on Artificial Intelligence, St. John's. Canada: LNAI, Springer.
  • Miao, Y., Gowayyed, M., & Metze, F. (2015). EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In Proceeding of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 167–174). Scottsdale, Arizona: IEEE.
  • Middag, C., Bocklet, T., Martens, J.P., & Nöth, E. (2011). Combining phonological and acoustic ASR-free features for pathological speech intelligibility assessment. In Proceedings of the 12th Annual conference of the International Speech Communication Association (INTERSPEECH) (pp. 3005–3008). Florence, Italy: ISCA.
  • Middag, C., Martens, J.P., Van Nuffelen, G., & De Bodt, M. (2009). Automated intelligibility assessment of pathological speech using phonological features. EURASIP Journal on Advances in Signal Processing, 2009, 629030. doi:10.1155/2009/629030
  • Middag, C., Saeys, Y., & Martens, J.P. (2010). Towards an ASR-free objective analysis of pathological speech. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH) (pp. 294–297). Makuhari, Chiba, Japan: ISCA.
  • Middag, C., Van Nuffelen, G., Martens, J.P., & De Bodt, M. (2008). Objective intelligibility assessment of pathological speakers. In Proceedings of the 9th Annual conference of the International Speech Communication Association (INTERSPEECH) (pp. 1745–1748). Brisbane, Australia: ISCA.International Speech Communication Association (ISCA).
  • Miller, N. (2013). Measuring up to speech intelligibility. International Journal of Language & Communication Disorders, 48, 601–612. doi:10.1111/1460-6984.12061
  • Mohamed, A.R., Dahl, G.E., & Hinton, G. (2012). Acoustic modelling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20, 14–22. doi:10.1109/TASL.2011.2109382
  • Morgan, N. (2012). Deep and wide: Multiple layers in automatic speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20, 7–13. doi:10.1109/TASL.2011.2116010
  • Naaman, E., Adi, Y., & Keshet, J. (2017). Learning similarity function for pronunciation variations. In Proceedings of the 18th Annual Conference of the International Speech Communication Association (INTERSPEECH). Stockholm, Sweden: ISCA.
  • Rabiner, L.R., & Juang, B.H. (1993). Fundamentals of speech recognition. Upper Saddle River, NJ: Prentice-Hall, Inc.
  • Riedhammer, K., Stemmer, G., Haderlein, T., Schuster, M., Rosanowski, F., Nöth, E., & Maier, A. (2007, December). Towards robust automatic evaluation of pathologic telephone speech. In Proceedings of the IEEE Workshop on Automatic Speech Recognition & Understanding, (ASRU) (pp. 717–722). Kyoto, Japan: IEEE.
  • Rudzicz, F. (2007). Comparing speaker-dependent and speaker-adaptive acoustic models for recognizing dysarthric speech. In Proceedings of the 9th international ACM SIGACCESS conference on Computers and accessibility (pp. 255–256). Tempe, AZ: ACM.
  • Rudzicz, F. (2016). Clear Speech: Technologies that Enable the Expression and Reception of Language. Synthesis Lectures on Assistive, Rehabilitative, and Health-Preserving Technologies, 5, 1–103. doi:10.2200/S00672ED1V01Y201509ARH008
  • Rudzicz, F., Frydenlund, A., Robertson, S., & Thaine, P. (2016). Acoustic-articulatory relationships and inversion in sum-product and deep-belief networks. Speech Communication, 79, 61–73. doi:10.1016/j.specom.2016.03.001
  • Rudzicz, F., Hirst, G., & van Lieshout, P. (2012). Vocal tract representation in the recognition of cerebral palsied speech. Journal of Speech, Language, and Hearing Research, 55, 1190–1207. doi:10.1044/1092-4388(2011/11-0223)
  • Schuster, M., Maier, A., Haderlein, T., Nkenke, E., Wohlleben, U., Rosanowski, F., … Nöth, E. (2006). Evaluation of speech intelligibility for children with cleft lip and palate by means of speech recognizer technique. International Journal of Pediatric Otorhinolaryngology, 70, 1741–1747. doi:10.1016/j.ijporl.2006.05.016
  • Schuster, M., Haderlein, T., Nöth, E., Lohscheller, J., Eysholdt, U., & Rosanowski, F. (2006). Intelligibility of laryngectomees’ substitute speech: Automatic speech recognition and subjective rating. European Archives of Oto-Rhino-Laryngology and Head & Neck, 263, 188–193. doi:10.1007/s00405-005-0974-6
  • Seong, W.K., Park, J.H., & Kim, H.K. (2012). Dysarthric speech recognition error correction using weighted finite state transducers based on context–dependent pronunciation variation. In Proceedings of the 4th International Conference on Computers for Handicapped Persons (pp. 475–482). Vienna, Austria: LNCS, Springer.
  • Shahamiri, S.R., & Salim, S.S.B. (2014). Artificial neural networks as speech recognisers for Dysarthric speech: Identifying the best-performing set of MFCC parameters and studying a speaker-independent approach. Advanced Engineering Informatics, 28, 102–110. doi:10.1016/j.aei.2014.01.001
  • Sharma, H.V., Hasegawa-Johnson, M., Gunderson, J., & Perlman, A. (2009). Universal access: Preliminary experiments in dysarthric speech recognition. In Proceedings of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH). Brighton, United Kingdom: ISCA.
  • Su, H.Y., Wu, C.H., & Tsai, P.J. (2008, March). Automatic assessment of articulation disorders using confident unit-based model adaptation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4513–4516). Las Vegas: IEEE.
  • Sundermeyer, M., Schlüter, R., & Ney, H. (2012). LSTM neural networks for language modelling. In Proceedings of the 13th Annual Conference of the International Speech Communication Association (INTERSPEECH). Portland, OR: ISCA.
  • Tang, H., Keshet, J., & Livescu, K. (2012, July). Discriminative pronunciation modelling: A large-margin, feature-rich approach. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 (pp. 194–203). Jeju Island, Korea: Association for Computational Linguistics.
  • Tu, M., Wisler, A., Berisha, V., & Liss, J.M. (2016). The relationship between perceptual disturbances in dysarthric speech and automatic speech recognition performance. The Journal of the Acoustical Society of America, 140, EL416–EL422. doi:10.1121/1.4967208
  • Unser, M. (1984). On the approximation of the discrete Karhunen-Loeve transform for stationary processes. Signal Processing, 7, 231–249. doi:10.1016/0165-1684(84)90002-1
  • Vijayalakshmi, P., Reddy, M.R., & O’Shaughnessy, D. (2006). Assessment of articulatory sub-systems of dysarthric speech using an isolated-style phoneme recognition system. In Proceedings of the 7th Annual Conference of the International Speech Communication Association (INTERSPEECH). Pittsburgh, PA: ISCA.
  • Villa-Canas, T., Belalcazar-Bolaños, E., Bedoya-Jaramillo, S., Garcés, J.F., Orozco-Arroyave, J.R., Arias-Londoño, J.D., & Vargas-Bonilla, J.F. (2012, September). Automatic detection of laryngeal pathologies using cepstral analysis in Mel and Bark scales. In Symposium of Image, Signal Processing, and Artificial Vision (STSIVA), 2012 XVII (pp. 116–121). IEEE.
  • Windrich, M., Maier, A., Kohler, R., Nöth, E., Nkenke, E., Eysholdt, U., & Schuster, M. (2008). Automatic quantification of speech intelligibility of adults with oral squamous cell carcinoma. Folia Phoniatr Logop, 60, 151–115. doi:10.1159/000121004
  • Witt, S.M., & Young, S.J. (2000). Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication, 30, 95–108. doi:10.1016/S0167-6393(99)00044-8
  • Young, S. (1996). A review of large-vocabulary continuous-speech. IEEE Signal Processing Magazine, 13(5), 45. doi:10.1109/79.536824
  • Young, V., & Mihailidis, A. (2010). Difficulties in automatic speech recognition of dysarthric speakers and implications for speech-based applications used by the elderly: A literature review. Assistive Technology, 22, 99–112. doi:10.1080/10400435.2010.483646
  • Zhou, L., Fraser, K., & Rudzicz, F. (2016). Speech recognition in Alzheimer's disease and in its assessment. In Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH). San Francisco, CA: ISCA.
  • Zeghidour, N., Usunier, N., Kokkinos, I., Schatz, T., Synnaeve, G., & Dupoux, E. (2018). Learning filterbanks from raw speech for phone recognition. In Proceeding of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Calgary, Alberta, Canada: ISCA.
  • Ziegler, W., & Zierdt, A. (2008). Telediagnostic assessment of intelligibility in dysarthria: A pilot investigation of MVP-online. Journal of Communication Disorders, 41, 553–577. doi:10.1016/j.jcomdis.2008.05.001

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.