1,484
Views
30
CrossRef citations to date
0
Altmetric
Original Articles

Toward Spoken Human–Computer Tutorial Dialogues

, &
Pages 289-323 | Published online: 15 Dec 2010
 

Abstract

Oral discourse is the primary form of human–human communication, hence, computer interfaces that communicate via unstructured spoken dialogues will presumably provide a more efficient, meaningful, and naturalistic interaction experience. Within the context of learning environments, there are theoretical positions supporting a speech facilitation hypothesis that predicts that spoken tutorial dialogues will increase learning more than typed dialogues. We evaluated this hypothesis in an experiment where 24 participants learned computer literacy via a spoken and a typed conversation with AutoTutor, an intelligent tutoring system with conversational dialogues. The results indicated that (a) enhanced content coverage was achieved in the spoken condition; (b) learning gains for both modalities were on par and greater than a no-instruction control; (c) although speech recognition errors were unrelated to learning gains, they were linked to participants' evaluations of the tutor; (d) participants adjusted their conversational styles when speaking compared to typing; (e) semantic and statistical natural language understanding approaches to comprehending learners' responses were more resilient to speech recognition errors than syntactic and symbolic-based approaches; and (f) simulated speech recognition errors had differential impacts on the fidelity of different semantic algorithms. We discuss the impact of our findings on the speech facilitation hypothesis and on human–computer interfaces that support spoken dialogues.

Notes

1WER and WRR are standard metrics for assessing the reliability of automatic speech recognition systems. WER = [S + D + I]/N, where S, D, and I are the number of substitutions, deletions, and insertions in the automatically recognized text (with errors) when compared to the ideal text (no errors) of N words. WRR = 1 – WER.

2In this and subsequent analyses, the interaction order (spoken then typed vs. typed then spoken) was included as a between-subjects factor. However, the main effect for interaction order was never significant, nor were the two-way interactions between interaction order and other variables. Therefore, speaking first and then typing, versus typing first and then speaking, had no impact on the dependent variables.

3 p < .05 in this and subsequent analyses unless explicitly noted.

Acknowledgments. We thank our research colleagues in the Emotive Computing Group and the Tutoring Research Group at the University of Memphis (http://emotion.autotutor.org). Special thanks to Jeremiah Sullins and O'meed Entezari for their valuable contributions to this study.

Support. This research was supported by the National Science Foundation (REC 0106965, ITR 0325428, and HCC 0834847). Any opinions, findings, and conclusions or recommendations expressed in this article are those of the authors and do not necessarily reflect the views of NSF.

HCI Editorial Record. Received September 8, 2008. Revision received June 17, 2009. Accepted by John Anderson. Final manuscript received December 11, 2009. — Editor

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.