1,666
Views
2
CrossRef citations to date
0
Altmetric
Research Articles

Evaluating Fluency in Aphasia: Fluency Scales, Trichotomous Judgements, or Machine Learning

, & ORCID Icon
Pages 168-180 | Received 02 Aug 2022, Accepted 18 Jan 2023, Published online: 06 Feb 2023

ABSTRACT

Background

Speech-language pathologists (SLPs) and other clinicians often use aphasia batteries, such as the Western Aphasia Battery-Revised (WAB-R), to evaluate both severity and classification of aphasia. However, the fluency scale on the WAB-R is not entirely objective and has been found to have less than ideal inter-rater reliability, due to variability in weighing the importance of one dimension (e.g. articulatory effort or grammaticality) over another. This limitation has implications for aphasia classification. The subjectivity might be mitigated through the implementation of machine learning to identify fluent and non-fluent speech.

Aims

We hypothesized that two models consisting of convolutional and recurrent neural networks can be used to identify fluent and non-fluent aphasia as judged by SLPs, with greater reliability than use of the WAB-R fluency scale.

Methods & Procedures

The training and testing dataset for the networks was collected from the public domain, and the validation dataset was collected from participants in post-stroke aphasia studies. We used Kappa scores to evaluate inter-rater reliability among SLPs, and between the networks and SLPs.

Outcome and Results

Using public domain samples, the model for detecting non-fluent aphasia achieved high accuracy on the training dataset after 10 epochs (i.e., when algorithm scans the entire dataset) and 81% testing accuracy using public domain samples. The model for detecting fluent speech had high training accuracy and 83% testing. Across samples, using the WAB-R fluency scale, there was poor to perfect agreement among SLPs on the precise WAB-R fluency score, but substantial agreement on non-fluent (score 0-4) versus fluent (score of 5-9). The agreement between the model and the SLPs was moderate for identifying non-fluent speech and substantial fpr identifying fluent speech. When SLPs were asked to identify each sample as fluent, non-fluent, or mixed (without using the fluency scale), the agreement between SLPs was almost perfect (Kappa 0.94). The agreement between the SLPs’ trichotomous judgement and the models was fair for detecting non-fluent speech and substantial for detecting fluent speech.

Conclusions

Results indicate that neither the WAB-R fluency scale nor the machine learning algorithms were as useful (reliable and valid) as a simple trichotomous judgement of fluent, non-fluent, or mixed by SLPs. These results, together with data from the literature, indicate that it is time to re-consider use of the WAB-R fluency scale for classification of aphasia. It is also premature, at present, to rely on machine learning to rate spoken language fluency.

Introduction

Several aphasia batteries have been designed to evaluate the severity of aphasia and to classify the aphasia into “classical” vascular syndromes (Geschwind, Citation1965; Marsh & Hillis, Citation2012). The distinction between fluent and non-fluent aphasia may be especially important, because fluent aphasia is usually associated with lesions posterior to the central sulcus, and non-fluent aphasia is usually associated with lesions anterior to the central sulcus, at least acutely (before recovery or reorganization of structure-function relationships; Ferro & Madureira, Citation1997). Studies of chronic stroke patients (e.g. Mirman et al., 2018) often do not confirm this relationship, likely because patients rely on a reorganized network of areas to support language. Furthermore, there is not a single, widely accepted definition of fluency. The Merriam-Webster Dictionary, defines fluency as “the quality or state of being fluent.” With respect to spoken language, it further defines fluent as: “a) capable of using a language easily and accurately; and b) effortlessly smooth and flowing.” Thus, when we use the term fluency, our working definition is the quality of effortlessly producing smooth and flowing speech. However, we recognize that the definition does not capture all aspects of what distinguishes the speech of patients with acute anterior or posterior left hemisphere lesions.

The usefulness of aphasia classification, especially for research, has been controversial (see Brookshire, Citation1983 vs Caramazza, Citation1984), but their clinical use has persisted across decades. The clinical syndromes generally reflect the vascular territory involved, at least acutely after stroke (Ochfeld et al., Citation2010). However, many patients might not fit neatly into one syndrome; it is claimed that up to 60% of people with aphasia might be “unclassifiable” using the Boston Diagnostic Aphasia Examination (Goodglass et al., Citation2001). The Western Aphasia Battery-Revised (WAB-R; Kertesz, Citation2006) is another commonly used test, which yields a single classification (using the Boston, or “classical”, aphasia syndrome labels) in all cases. However, classification depends on a unidimensional scale that requires raters to select a number from zero to nine for fluency, but each score reflects multiple dimensions of speech that often vary in opposite directions. Further, the fluency scale has been reported to have relatively low inter-rater reliability across experienced SLPs, because of variability in weighing the importance of different factors included on the fluency scale (Hula et al., Citation2010; Clough & Gordon, Citation2018; Clough & Gordon, Citation2020; Trupe, Citation1984). For example, some may place more importance on speech rate, sentence length, or pauses in speech, while others may place more importance on grammatical accuracy (see Hula et al. for an in-depth discussion of this and other psychometric limitations of the WAB). The subjectivity and less than ideal reliability of the scale suggests the need for an objective and reliable measure that can accurately differentiate between fluent and non-fluent aphasia in an individual, if one accepts that fluency is a valid construct in characterizing aphasia (which is a debatable issue). The implementation of machine learning mitigates the test-retest and inter-rater unreliability and subjectivity -which stems from placing more importance on different factors in judging fluency- of rating fluency, although it does not address the issue of validity. Machine learning algorithms do not assign equal weighting to these factors, but determine the weight of each characteristic it detects that yields the best distinction of fluent or non-fluent. The variables used by the algorithms are not necessarily the same variables described in the WAB-R fluency scale. With the advent of new technologies, the potential applications for machine learning in the process of evaluation has increased. While many diseases and disorders have implemented machine learning for diagnoses, machine learning has only relatively recently been used in classification or evaluation of aphasia (Järvelin & Juhola, Citation2011; Landrigan et al., Citation2021; Mahmoud et al., Citation2021; Matias-Guiu et al., Citation2019). We aimed to determine if machine learning algorithms differentiate between fluent and non-fluent aphasia with greater reliability than the WAB-R fluency scale or with greater reliability than SLPs’ trichotomous judgement of fluent, non-fluent or mixed without relying on a scale.

Methodology

Machine Learning Algorithms

We implemented machine learning to develop models for detecting fluent and non-fluent speech, utilizing Python, a widely used programming language for research. We used two different classes of models - one convolutional neural network to detect non-fluent (vs. normal or fluent) speech and one recurrent neural network to detect fluent (vs. normal or non-fluent) speech. Convolutional neural networks (CNN) are designed to analyze images with spatial grid-like structure. Recurrent neural networks (RNN) are designed for serial inputs, that use previous outputs as inputs. They are used for tasks like speech recognition and handwriting recognition. Use of CNNs for image classification and use of RNNs for speech and language tasks is standard in machine learning. .Both types of neural networks were implemented because different ones were most effective in objectively detecting non-fluent and fluent speech, respectively. That is, non-fluent aphasia can be differentiated by the manner in which people speak, often characterized by the variables considered in “fluency” ratings, including pauses, slowness, articulatory distortions, impaired prosody (variations in rhythm, loudness, and pitch). These variations are captured by spectrograms; therefore, CNNs are effective in detecting non-fluent aphasia. However, individuals with fluent aphasia have a normal manner of speech (as described above), but the content fails to convey the intended meaning. Hence, a CNN is not ideal for capturing non-meaningful speech, since it cannot use the manner of speaking evident from the spectrogram to detect it. Thus, we use an RNN, because RNNs are frequently used for such language processing tasks as predicting the next word in a sentence. For the classification process, each video in the dataset was fed to both the CNN and the RNN, and we aggregated these two outputs to provide a final classification.

Speech samples

The data used for creating the algorithms were qualitative as we collected videos of people with aphasia (fluent and non-fluent) and speakers without aphasia. The videos used for training and testing within the model were collected from YouTube, which is considered to be secondary data. The videos collected from YouTube consisted of individuals with aphasia answering rudimentary questions from an SLP or describing certain images. The videos were posted by speech-language pathologists or neurologists, but we had our SLPs verify them as “fluent” or “non-fluent” (using the modal classification). . The videos included a wide range of severity (from mild to severe aphasia). For videos in which there were multiple people speaking, we preprocessed the videos so that the only clips that remained included only the patient who was speaking. Each video was split into 3 second segments for the algorithm to analyze. Fragmenting each video into segments allows the model to analyze more data. For non-fluent aphasia samples (labelled on YouTube as Broca’s aphasia, global aphasia, or transcortical motor aphasia), there were 45 clips and 175 segments. For fluent aphasia samples (labelled on YouTube as Wernicke’s aphasia, transcortical sensory, or anomic aphasia) there were 71 clips and 192 segments. The videos showed 12 men and 10 women speakers, ranging from young adult to elderly speakers (specific age generally not provided). For speakers without aphasia, there were 95 clips and 1240 segments, including men and women from young adult to elderly, although we do not have information on the age of the participants or any independent evidence that they had no neurological or psychological disease. They were unanimously rated by our SLPs as non-aphasic. To differentiate non-fluent aphasia from normal speech using a CNN, each input video was split into 3 second segments, and each segment was converted to a spectrogram. The model interpreted the spectrograms and provided an output for the diagnosis based on the majority vote of all the spectrograms. To differentiate fluent aphasia from normal speech using an RNN, we converted the videos from speech to text, and each word was listed as its word type (noun, pronoun, or adjective, etc.) using a pre-trained PerceptronTagger (https://explosion.ai/blog/part-of-speech-pos-tagger-in-python). The word types were converted to sequences of “0s” and “1s” using one hot encoding, which the model used to provide a classification.

For the validation dataset, we used videos from patients with post-stroke aphasia who participated in longitudinal or treatment studies at Johns Hopkins University School of Medicine. Each sample from patients with aphasia was elicited by asking the individual to describe the Cookie Theft picture from the Boston Diagnostic Aphasia Battery (also used in the National Institutes of Health Stroke Scale Lyden et al., Citation1999). The following instructions were used: “Describe everything you see happening in this picture. Try to speak in sentences.” We used these videos to compare detection of fluent and non-fluent speech by the algorithms to the detection of fluent and non-fluent speech by seven certified SLPs (using the WAB-R fluency scale or a trichotomous judgement of fluent, non-fluent, or mixed). The SLPs had an average of 10 (range 3-40) years of clinical experience following completion of a Master’s degree or doctoral degree, and all worked at Johns Hopkins University School of Medicine.

The CNN Process

Convolutional Neural Networks (CNN) work best with images. Their basic components consist of three different types of layers: convolutional, pooling, and fully-connected layers. A convolutional layer uses a filter, or a square matrix of some size that is smaller than the image’s dimensions. Starting from the top left corner of the image, it takes a subimage of the same dimensions as the filter. It performs a tensor product between this subimage and the filter. For example, a tensor product between two 3x3 matrices would mean the top left element of each matrix would be multiplied, then the middle left element from each matrix would be multiplied, and this process is repeated for each element in the matrices. Then, all these products are summed together. Thus, by trying this filter at all possible locations on the image we can generate a new matrix of numbers, of all the various tensor products. After creating this new matrix, a pooling layer is typically used to reduce the image’s size while preserving pertinent information. One common method is MaxPooling, which considers some subimage and replaces that whole subimage with the maximum value of that subimage. Generally, we set the stride (the vertical and horizontal shift of the filter on the image) so that there are no overlapping pixels between any of the subimages. We can handle dimensionalities not matching (a 27x27 image with a 4x4 kernel) by padding zeros to the edge of the image until the dimensionalities match. A BatchNormalization layer makes training more stable and faster by normalizing the outputs. Finally, a fully-connected layer is a standard neural network layer, containing neurons and connections between nodes. For our model, we used a Convolution2D layer of 16 filters with a stride length of 3 and Rectified Linear Unit (ReLU) activation. ReLu activation introduces non-linearity to a deep learning model, and is among the most commonly used activations in deep learning. This was followed by a BatchNormalization layer and a MaxPooling layer. We repeated this sequence of 3 layers two more times, except the Convolution2D layers had 32 and 64 filters respectively. We then flatten the output, converting it from some multidimensional output into a single-dimensional output (for instance, a 2x2 matrix would be flattened into a 1x4). We have a fully-connected layer of 128 neurons and a ReLU activation function, followed by a BatchNormalization. A DropOut of 0.5 is used, meaning 50% of the outputs of the previous layer will randomly be set to zero to help prevent overfitting of the weights. Finally, we have our final fully-connected layer with two neurons (non-fluent aphasia or not non-fluent aphasia) and a softmax activation function.

Hence, the audio was converted to image form in order to be in a usable format for CNNs. Furthermore, the audio was converted to spectrograms, in which the X-axis represents the frequency with the mel-scale, Y-axis represents time, and the color represents the loudness of the audio. The fluctuations of frequency and loudness of various frequencies in the graph captures various features of fluency, such as syllable stress and speaking rate. Fourier Transforms decompose a signal into its constituent frequencies and the amplitude of these frequencies. Using a graph of amplitude with respect to time, a new graph can be generated in polar coordinates. The radius is the graph’s amplitude at a certain time and the theta value is that time. Essentially, a vector moves across the amplitude graph and the height of that vector is graphed with respect to time in polar coordinates. The shape of the polar graph is dependent on the speed the vector moves across at because that changes the theta value. The key connection here is when the speed at which the vector moves (in terms of cycles of the polar graph per second) aligns with one of the frequencies composing the amplitude graph (in terms of beat per second), the center of the polar graph will be off center from the pole. The center does not align with the pole when the vector speed is equal to one of the frequencies in the amplitude graph, effectively allowing the various frequencies of the amplitude graph to be decomposed in addition to the amplitudes. By repeating this process on all the audio segments, we created a spectrogram to visualize how amplitude on frequency changes over time: time on the y-axis, a logarithmic frequency scale on the x-axis, and color to represent intensity. However, these spectrograms could be a different height depending on the duration of the video, and CNNs cannot use variable input sizes. To combat this, we found the shortest length audio (about 3 seconds) and extracted as many segments as possible of this length from each audio clip. We used a voting algorithm, in which the output is the majority classification (non-fluent or not) of the model for all the segments.

The RNN Process

RNNs are frequently used for language and speech tasks in machine learning, such as predicting the next word that will follow in a sentence given a phrase. In our case, we used a subtype of RNNs known as Long Short-Term Memory Networks (LSTMs). They use a special structure, enabling them to “remember” previous inputs. This structure includes the input, the cell state (essentially the memory), and the hidden state (the hidden layers of a standard feed-forward neural network). The input and hidden state first pass through the “forget gate”, which uses a sigmoid activation function, to determine what previous and current information should be removed or kept. The cell state multiplies corresponding information by either 0 or 1, depending on whether it should be kept or not. Next, the input and hidden state pass through the input gate: here, the current input is multiplied by both the hidden state and a weight matrix from the previous run. The new cell state is now the current cell state and will be used for the next run. Finally, the “output gate” uses the hidden state to determine the output of the LSTM model. For our model, we use a bidirectional LSTM, which is similar to the previously described LSTM model, except it has one LSTM that passes inputs forward and one LSTM that passes inputs backward. Then, there is another LSTM layer followed by two dense layers: one containing 64 “neurons” with an activation of ReLU and another containing 2 “neurons” with a softmax activation function. The final dense layer, with its two “neurons”, indicates fluent aphasia or no fluent aphasia. This is a standard structure in machine learning.

In the RNN process, the audio was converted from speech to text using an acoustic model and a language model. However, some features must be extracted prior to using the models. One feature extracted was an mel-frequency cepstral coefficient (MFCCs), which are numbers that essentially summarize this spectrogram. In addition to MFCCs, another feature used was I-vectors, which describe how the wavelength varies in addition to speaker properties. Then, we feed the MFCCs and I-vectors into a model called Kaldi’s NNet 3. The first layer, or the input layer, is a standard feed-forward layer, so all the nodes in one layer connect to all the nodes of the subsequent layer. Then, the next layers are essentially an RNN model consisting of two hidden layers and an output layer. The model uses a slightly more sophisticated algorithm that saves computation time by getting rid of some backtracks to previous layers. Finally, the model converts the neural network’s output into text and fetches other potential possibilities for the audio’s translation. We implemented an algorithm that uses the text from other converted segments from the same clip and determines the similarity for each potential possibility. We first convert each sentence to a vector using a Python library called sklearn and its TfidfVectorizer method. We compute a metric called cosine similarity between the text translated from the other converted segments and each potential possibility. By choosing the potential possibility with the highest similarity to other words spoken in the segment, the accuracy of the model is improved.

After the speech is converted to text, a library, called Nltk, was used to convert each word into its word type, which includes nouns and adjectives. Nltk uses a pre-trained PerceptronTagger (an accuracy of about 94%) – it has weights for each input feature, and these weights are adjusted during the training phase based on whether the predicted tag was correct or net. It is less accurate for phonemic paraphasias or ambiguous words versus standard words. Predictions are made using a combination of the weights and the features. One hot encoding (a process of converting categorical data variables to a series of zeroes and ones) is used to convert these tags of word types into a more usable form: a vector of zeroes and ones. Specifically, each column represents a different word type, and each type is converted to a vector of zeroes and ones where every column representing a word type distinct from the actual word type is zero, while the only column representing the actual word type is a one. We repeat this algorithm for every word in the sentence, creating a sequence of vectors composed of zeroes and ones. However, the sentences spoken were different lengths, which proved to be an issue as there were different numbers of vectors in each sequence. As a result, we found the maximum length sentence, and added vectors of plain zeroes – which indicated silence – to the end of every sequence that was shorter than the max length sequence until it was the same length as the max sequence; this technique is known as zero padding in machine learning. We fed these sequences for each video and trained the RNN.

SLP Ratings

A group of 7 certified SLPs specializing in neurogenic communication disorders were provided access to the audiovisual samples on a secure server. All were very experienced in using the WAB-R. They were asked to first watch and listen to the video and score fluency using the WAB-R fluency scale in the same way they would use it if they were evaluating the person in clinic. Two weeks after submitting their scores independently, the SLPs were asked to listen and watch the videos again, and to score each sample as “fluent” (“like in Wernicke’s, anomic, or transcortical sensory aphasia”), “non-fluent” (like in Broca’s, global, or transcortical motor aphasia) or “mixed” (only if the sample included elements of both “fluent” and “non-fluent” defined in this way). They were not given any other definition or characteristics of fluency. They submitted their classifications independently.

Statistical Analysis

We first evaluated agreement between the 7 certified SLPs in rating the 11 validation samples [once using the entire WAB-R fluency scale to rate the sample, then using it to rate each sample as non-fluent (0-4) or fluent (5-10)], and then using trichotomous SLP judgements of fluent, non-fluent, or mixed) using Kappa scores. Then, we evaluated the agreement between SLP determination of fluent (5-9) and the algorithm determination of fluent. SLP determination of fluent was scored as yes if any of the SLPs scored the sample as fluent (5-9). Likewise, we evaluated SLP determination of non-fluent and the algorithm determination of non-fluent. The SLP determination of non-fluent was scored as yes if any of the SLPs scored the sample as non-fluent (0-4). We used this method, rather than an average score by the SLPs because (1) the mean score for all SLPs was often not equal to any SLP score, and (2) given the multidimensional nature of the scales, a judgement of fluent or non-fluent may be justifiable (correct) depending on which characteristic of fluency is weighed most strongly (see discussion).

To compare the algorithms to the trichotomous SLP scores, the SLP score was considered “yes” for non-fluent if any SLP scored the sample as non-fluent or mixed (only one sample was scored as mixed), and considered “yes” for fluent if any SLP scored the sample as fluent or mixed. The model detection of “mixed” occurred with the RNN algorithm detected fluent speech and the CNN algorithm detected non-fluent speech in the same sample. All statistical analyses were run in STATA (version 15.1)

Results

Model results: training set

We used a train-test sequence on our dataset. The training dataset, consisting of 75 segments from non-aphasic speakers, 55 segments from speakers with Wernicke’s aphasia, and 34 segments from speakers with Broca’s aphasia, was used to train the CNN and RNN. The CNN achieved 91% accuracy, and the RNN achieved 87% accuracy on the training data. The testing dataset, which consisted of segments from 20 speakers without aphasia, 16 speakers classified as fluent aphasia, and 11 speakers classifieded as non-fluent aphasia, was used to evaluate the accuracy of the model. The CNN achieved 81% accuracy (), and the RNN achieved 82.5% accuracy on the testing dataset ().

Table 1. Results of CNN in testing dataset.

Table 2. Results of RNN for testing dataset.

Agreement between SLPs (validation set)

Consistent with previous studies (Clough and Gordon Citation2018, Citation2020; Trupe Citation1984) there was not always good agreement between the fluency scores (0-9) assigned by the seven certified SLPs, when using the WAB fluency scale. The Kappa score for the seven raters ranged from poor (<0, not significantly different from chance) to perfect (1.0) for the 11 samples, although the combined Kappa was 0.49 (moderate). There was perfect agreement among the seven raters for three samples, two with a unanimous fluency score of 0 and 1 with a unanimous score of 9. There was substantial agreement (Kappa = 0.83) for the dichotomous variable of fluent (score of 5-10) or non-fluent (score of 0-4); for eight of the samples there was perfect agreement on fluent or non-fluent. However, for three (27%) of the samples there was disagreement on fluent or non-fluent based on the score. For these three samples, the majority score (mode) was 3, 5, or 6.

There was almost perfect agreement between the certified SLPs when asked to rate each sample as fluent (“like Wernicke’s, transcortical sensory, or anomic”) or non-fluent (“like Broca’s, transcortical motor, or global aphasia”) or mixed. The Kappa score was 0.94, or almost perfect.

Agreement between SLPs and algorithms (validation set)

Agreement between the SLP determination of non-fluent using the WAB-R fluency score and the algorithm determination of non-fluent was 72.7% (Kappa= 0.46), which suggests moderate agreement. The agreement between the SLP determination of fluent using the WAB-R fluency scale and the algorithm determination of fluent was 81.8% (Kappa= 0.52), which suggests substantial agreement.

Agreement between the SLP determination of non-fluent (trichotomous judgement) and the algorithm determination of non-fluent was 63.6%, with Kappa score of 0.24, or fair agreement. Agreement between the SLP determination of fluent (trichotomous judgement) and the algorithm determination of fluent was 81.8% with a Kappa score of 0.65 or substantial agreement.

There was one sample that was rated by all SLPs as fluent (9), but was not identified by the algorithm as fluent. Rather it was identified as non-fluent. The transcription of this sample (A) is provided below. The person spoke slowly, with some pauses, but in complete, grammatical sentences.

Sample A. Scored as Fluent by SLPs but Non-fluent by Algorithm

“There’s a child gettin some cookies out of a cookie jar and he’s bein assisted by another child. Their um look like their um chair’s tiltin over. Um next to them is uh um a grown woman uh doin some dishes. Look like dishes splashed on the w-floor. Um there’s a a scene from outside the house. Hmm. Spillin water in the kitchen floor. Too much it it it’s not too much detail in the … kitchen it’s just a woman dishes- doin dishes dryin dishes whatever. That’s it.”

Two samples were scored as both fluent (using RNN) and non-fluent (using CNN) by the algorithms. For one of these samples, the SLPs were also mixed, scoring the sample as 1, 3, 4, or 6. The transcription of this sample (B) is provided below.

Sample B. Scored Variably as Fluent or Non-fluent by SLPs and as Both Fluent and Non-fluent by the Algorithms.

“Okay um let’s see pre [unintelligible 2 syllables] um pre pull senten-sees um i can’t try. Tell me what I see? Um let’s see, see um xxx Egypt quadraplegic I don’t know what I see.”

The other sample that the algorithms detected as both fluent (by RNN) and non-fluent (by CNN) was scored as fluent by all of the SLPs. (Sample C)

Sample C. Scored as Fluent by All SLPs and as Both Fluent and Non-fluent by the Algorithms

“Alright, this is a guy pushing on a stool. I can’t tell if he is a doctor or not, not a doctor, her. I can’t tell if he is a doctor or not. He’s not in a doctor. She looks like she’s looking up to … she’s splink, spink … I don’t know.. all right how does she? I guess she is just walking that up … walking this down. She is standing in the sink. I guess that is it.”

There was one sample that was identified as fluent by all SLPs using the WAB-R fluency score and normal (score of 10) by most of the SLPs, and not identified by the algorithms as fluent or non-fluent (i.e. identified as non-aphasic) by the algorithms, indicating good agreement on relatively normal samples.

Discussion

Here we show that SLPs are able to detect fluent or non-fluent speech, if simply asked to identify the sample (from picture description) as fluent (“like Wernicke’s, transcortical sensory, or anomic”) or non-fluent (“like Broca’s, transcortical motor, or global aphasia”) or mixed, with almost perfect agreement. However, when asked to use the WAB fluency scale to rate picture descriptions (of the Cookie Theft picture from the BDAE), there was relatively low agreement.

One can understand how some samples could be given variable scores on the WAB fluency scale (some suggesting non-fluent, and some suggesting fluent, aphasia). For example, consider Sample B: “Okay um let’s see pre [unintelligible 2 syllables] um pre pull senten-sees um i can’t try. Tell me what I see? Um let’s see, see um [unintelligible syllable] Egypt quadriplegic. I don’t know what I see.” This sample is consistent with at least parts of the four different scores that were given by the SLPs: 1= “Recurrent, brief stereotypic utterances with varied intonation; the emphasis or prosody may convey some meaning.” 3= “Longer, recurrent stereotypic or automatic utterances without information, or mumbling.” 4= “Halting, telegraphic speech; mostly single words; paraphasias; occasional prepositional phrases; severe word-finding difficulty. No more than two complete sentences with the exception of automatic sentences (e.g., “Oh I don’t know.”)”; 6= “More propositional sentences with normal syntactic patterns; may have paraphasias; significant word-finding difficulty and hesitations may be present.” The models also identified both fluent and non-fluent aspects of this sample.

The machine learning models showed substantial agreement with the SLPs rating (using the WAB fluency scale) for judgements of fluent speech. The model used for identifying non-fluent speech showed moderate agreement with the SLPs using the WAB-R fluency scale, but only fair agreement with the SLPs using the trichotomous judgement.

Results indicate that SLPs do have a reliable ability to judge fluent or non-fluent speech, which is hindered by using a scale that attempts to define fluency using a unidimensional subjective scale, based on several different aspects of speech that can vary in opposite directions. Fluent speech can also be identified reliably using a machine learning algorithm. However, use of a machine learning algorithm to identify non-fluent speech did not adequately capture SLP’s judgement of non-fluent speech (trichotomous judgement), but did capture aspects of the WAB-R fluency scale definition of non-fluent speech (with moderate agreement). It is possible that further attempts at developing machine learning algorithms could more successfully and reliably capture the SLPs judgement. It is likely that the models could be improved by training on more samples using the same language tasks, classified consistently as fluent or non-fluent (or mixed) by SLPs These algorithms were developed using a relatively small number of samples from the public domain. There are many other ways in which the algorithms could be improved, such as marking linguistic factors other than part-of-speech, such as word frequency, imageability, and age of acquisition, for RNNs.

The WAB-R has been previously reported to have high inter- and intra-rater reliability, but it is unclear that the fluency measure was rigorously evaluated on a variety of samples in these validations. While one could argue that the WAB-R fluency scale might be more reliably used in rating the picture description and spontaneous speech (answers to open-ended questions) than in rating the BDAE picture description, this seems unlikely. Previous studies that have reported poor reliability for this fluency scale did use the WAB-R picture and spontaneous speech subtest (Trupe, Citation1984; Clough & Gordon, Citation2018).

Implications of these findings are that the Fluency scale from the WAB-R does not reliably help to classify patients into classic aphasia syndromes. Furthermore, its highly weighted contribution to the WAB-R Aphasia Quotient is also problematic because it does not reliably reflect severity. Consider the fact that a score of 5 (“Often telegraphic but more fluent speech with some grammatical organization; marked word-finding difficulty. Paraphasias may be prominent; few, but more than two propositional sentences”) is weighted as “more severe” (fewer points toward normal) than a score of 7 (“Phonemic jargon with semblance to English syntax and rhythm with varied phonemes and neologisms. May talk excessively; must be fluent; characteristic of severe Wernicke’s aphasia.”). In calculating the WAB-R AQ, these scores carry a lot of weight, even though they actually are poor measures of aphasia severity. We would argue that a more appropriate outcome measure for clinical trials would be the WAB-R Summary Score (average scores for auditory comprehension, naming, and repetition sections), as used in Lazar et al., (Citation2008) as this summary score would be based on more objective and reliable scores that more accurately reflect severity of aphasia. Alternatively, collapsing the Fluency ratings from 10 to three categories that are more clearly ordinal (as in Hula et al., Citation2010) contributes usefully to estimating overall aphasia severity.

Landrigan and colleagues (2021) also found poor agreement between data driven classification based on machine learning versus the classic syndrome-based classification. They found a lesion-based classification reached 75% accuracy for the data-driven categories and only 60% accuracy for categories based on the classic fluent vs. non-fluent aphasia distinction. However, this poor accuracy could in part be due to flawed assessments of fluent versus non-fluent aphasia currently in use. The authors favored abandoning the classic vascular syndromes in favor of one based on semantic versus phonological deficits. In fact, the data-driven classification is more in line with recent dual stream models of language (Hickok & Poeppel, Citation2007; Saur et al., Citation2008).

Nevertheless, if aphasiologists wish to continue describing fluent versus non-fluent aphasia, our results indicate that they should rely on a gestalt trichotomous judgment to do so.

Limitations of our study include the relatively small number of samples used for the evaluations of agreement, drawn randomly from people living with stroke who participated in clinical studies of aphasia at Johns Hopkins Medical Institutions. Another limitation of the study is that language tasks used in the training set (YouTube videos) differed across samples and might have influenced fluency rating. Another limitation is that we did not evaluate different algorithms that have been or could be developed to evaluate fluent or non-fluent aphasia, but used two algorithms trained and tested with a relatively small number of samples.

Despite its limitations, the study makes an important clinical point that SLPs are more reliable in judging fluent, non-fluent, or mixed speech in persons with aphasia if they do not rely on the WAB-R fluency scale, but rely on a “gestalt” judgement. Future studies could develop machine learning algorithms to attempt to capture the “gestalt” judgements of SLPs, if the community of aphasiologists continue to use classic aphasia classifications. Furthermore, the paper makes the point that outcome measures for clinical trials should rely on objective measures of severity, such as the WAB-R Summary Score (Lazar et al., Citation2008), or a score that includes a 3 point ordinal scale to rate fluency, rather than the WAB-R fluency scale.

Acknowledgments

This research used data from studies supported by NIH (NIDCD), R01 DC05375 and P50 014664. We gratefully acknowledge this support and the participation of the people living with aphasia. We also are grateful to Lisa Bunker, PhD, CCC-SLP; Catherine Head Kelly, MA, CCC-SLP; Rachel Mace, MA, CCC-SLP; Kristina Ruch, MA, CCC-SLP; Melissa Stockbridge, PhD, CCC-SLP; Donna Tippett, MA, MPH, CCC-SLP, Emilia Vitti, MA, CCC-SLP

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This work was supported by the National Institutes of Health (the National Institute on Deafness and Other Communication Disorders) [P50 DC014664,R01 DC05375].

References

  • Brookshire, R. H. (1983). Subject description and generality of results in experiments with aphasic adults. The Journal of Speech and Hearing Disorders, 48(4), 342–346. 10.1044/jshd.4804.342
  • Caramazza, A. (1984). The logic of neuropsychological research and the problem of patient classification in aphasia. Brain and Language, 21(1), 9–20. 10.1016/0093-934X(84)90032-4
  • Clough, S, & Gordon, J. (2018). Understanding Fluency in Aphasia. Frontiers in Human Neuroscience., 12.
  • Clough, Sharice, & Gordon, J. K. (2020). Fluent or nonfluent? Part A. Underlying contributors to categorical classifications of fluency in aphasia. Aphasiology, 34(5), 515–539. 10.1080/02687038.2020.1727709
  • Ferro, J.M., Madureira, S. (1997) .Aphasia type, age and cerebral infarct localisation. Journal of Neurology, 244(8), 505–509. doi: 10.1007/s004150050133.
  • Geschwind, N. (1965). Disconnexion syndromes in animals and man. II. Brain : A Journal of Neurology, 88(3), 585–644. 10.1093/brain/88.3.585
  • Goodglass, H., Kaplan, E., & Weintraub, S. (2001). The Boston Diagnostic Aphasia Examination. Lippincott Williams & Wilkins.
  • Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews. Neuroscience, 8(5), 393–402. 10.1038/nrn2113
  • Hula, W., Donovan, N. J., Kendall, D. L., & Gonzalez-Rothi, L. J. (2010). Item response theory analysis of the Western Aphasia Battery. Aphasiology, 24(11), 1326–1341. 10.1080/02687030903422502
  • Järvelin, A., & Juhola, M. (2011). Comparison of machine learning methods for classifying aphasic and non-aphasic speakers. Computer Methods and Programs in Biomedicine, 104(3), 349–357. 10.1016/j.cmpb.2011.02.015
  • Kertesz, A. (2006). Western Aphasia Battery–Revised. APA.
  • Landrigan, J.-F., Zhang, F., & Mirman, D. (2021). A data-driven approach to post-stroke aphasia classification and lesion-based prediction. Brain : A Journal of Neurology, 144(5), 1372–1383. 10.1093/brain/awab010
  • Lazar, R. M., Speizer, A. E., Festa, J. R., Krakauer, J. W., & Marshall, R. S. (2008). Variability in language recovery after first-time stroke. Journal of Neurology, Neurosurgery, and Psychiatry, 79(5), 530–534. 10.1136/jnnp.2007.122457
  • Lyden, P., Lu, M., Jackson, C., Marler, J., Kothari, R., Brott, T., & Zivin, J. (1999). Underlying structure of the National Institutes of Health Stroke Scale: results of a factor analysis. NINDS tPA Stroke Trial Investigators. Stroke, 30(11), 2347–2354. 10.1161/01.str.30.11.2347
  • Mahmoud, S. S., Kumar, A., Li, Y., Tang, Y., & Fang, Q. (2021). Performance Evaluation of Machine Learning Frameworks for Aphasia Assessment. Sensors (Basel, Switzerland), 21(8). 10.3390/s21082582
  • Marsh, E. B., & Hillis, A. E. (2012). Aphasia and stroke. In Stroke Syndromes: Third Edition. 10.1017/CBO9781139093286.015
  • Matias-Guiu, J. A., Díaz-Álvarez, J., Cuetos, F., Cabrera-Martín, M. N., Segovia-Ríos, I., Pytel, V., Moreno-Ramos, T., Carreras, J. L., Matías-Guiu, J., & Ayala, J. L. (2019). Machine learning in the clinical and language characterisation of primary progressive aphasia variants. Cortex; a Journal Devoted to the Study of the Nervous System and Behavior, 119, 312–323. 10.1016/j.cortex.2019.05.007
  • Mirman, D., Kraft, A. E., Harvey, D. Y., Brecher, A. R., & Schwartz, M. F. (2019). Mapping articulatory and grammatical subcomponents of fluency deficits in post-stroke aphasia. Cognitive, Affective, & Behavioral Neuroscience, 19(5), 1286–1298. 10.3758/s13415-019-00729-9
  • Ochfeld, E., Newhart, M., Molitoris, J., Leigh, R., Cloutman, L., Davis, C., Crinion, J., & Hillis, A. E. (2010). Ischemia in broca area is associated with broca aphasia more reliably in acute than in chronic stroke. Stroke, 41(2). 10.1161/STROKEAHA.109.570374
  • Saur, D., Kreher, B. W., Schnell, S., Kümmerer, D., Kellmeyer, P., Vry, M.-S., Umarova, R., Musso, M., Glauche, V., Abel, S., Huber, W., Rijntjes, M., Hennig, J., & Weiller, C. (2008). Ventral and dorsal pathways for language. Proceedings of the National Academy of Sciences of the United States of America, 105(46), 18035–18040. 10.1073/pnas.0805234105
  • Trupe, A. E. . (1984). Reliability of rating spontaneous speech in the Western Aphasia Battery: Implications for classification. In R. H. Brookshire (Ed.), Clinical Aphasiology Conference Proceedings (pp. 55–69). BRK Publishers.