ABSTRACT
Background
While many measures exist for assessing discourse in aphasia, manual transcription, editing, and scoring are prohibitively labor intensive, a major obstacle to their widespread use by clinicians (Bryant et al. 2017; Cruice et al. 2020). Many tools also lack rigorous psychometric evidence of reliability and validity (Azios et al. 2022; Carragher et al. 2023). Establishing test reliability is the first step in our long-term goal of automating the Brief Assessment of Transactional Success in aphasia (BATS; Kurland et al. 2021) and making it accessible to clinicians and clinical researchers.
Aims
We evaluated multiple aspects of test reliability of the BATS by examining correlations between human/machine and human/human interrater edited transcripts, raw vs. edited transcripts, interrater scoring of main concepts, and test-retest performance. We hypothesized that automated methods of transcription and discourse analysis would demonstrate sufficient reliability to move forward with test development.
Methods & Procedures
We examined 576 story retelling narratives from a sample of 24 persons with aphasia and familiar and unfamiliar conversation partners (CP). Participants with aphasia (PWA) retold stories immediately after watching/listening to short video/audio clips. CP retold stories after six-minute topic-constrained conversations with a PWA in which the dyad co-constructed the stories. We utilized two macrostructural measures to analyze the automated speech-to-text transcripts of story retells: 1) a modified version of a semi-automated tool for measuring main concepts (mainConcept: Cavanaugh et al. 2021); and 2) an automated natural language processing “pipeline” to assess topic similarity.
Outcomes & Results
Correlations between raw and edited scores were excellent, interrater reliability on transcripts and main concept scoring were acceptable. Test-retest on repeated stimuli was acceptable. This was especially true of aphasic story retellings where there were actual within subject repeated stimuli.
Conclusions
Results suggest that automated speech-to-text was generally sufficient in most cases to avoid the time-consuming, labor intensive step of transcribing and editing discourse. Overall, our study results suggest that natural language processing automated methods such as text vectorization and cosine similarity are a fast, efficient way to obtain a measure of topic similarity between two discourse samples. Although test-retest reliability for the semi-automated mainConcept method was generally higher than for automated methods of measuring topic similarity, we found no evidence of a difference between machine automated and human-reliant scoring.
Acknowledgments
Research reported in this publication was entirely supported by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health under award number R21DC020265. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We would like to thank research assistants (Caroline Pare and Mia Tittmann) and all our study participants.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Supplementary material
Supplemental data for this article can be accessed online at https://doi.org/10.1080/02687038.2024.2351029