2,218
Views
4
CrossRef citations to date
0
Altmetric
Original Articles

Dealing with the unknown – addressing challenges in evaluating unintelligible speech

, & ORCID Icon
Pages 169-184 | Received 16 Jan 2019, Accepted 20 May 2019, Published online: 30 May 2019

ABSTRACT

When investigating the interaction between speech production and intelligibility, unintelligible speech portions are often of particular interest. Therefore, the fact that the standard quantification of speech production – the Percentage of Consonants Correct (PCC) – is only computed on intelligible speech is unsatisfying. Our purpose was to evaluate a new quantification of speech production: the Percentage of Intelligible and Correct Syllables (PICS) designed to address this limitation. A secondary purpose was to explore a task designed to elicit connected speech – concurrent commenting – offering more control of intended speech production compared to free conversation. Nine children with SSD participated in four speech elicitation tasks, varying with respect to ecological validity, and to degree of control: (1) word imitation, (2) picture naming, (3) concurrent commenting of a silent short video, and (4) free conversation. Speech accuracy was analysed in terms of PCC-Revised (PCC-R) and PICS, and intelligibility with regards to the Proportion of Intelligible Syllables (PINTS). A strong correlation was observed between PICS and PCC-R, supporting the construct validity of PICS. Further, a moderate correlation was seen between PICS and PINTS, presumably reflecting that these measures capture different – although related – constructs. No difference was seen between concurrent commenting and free conversation in terms of articulation proficiency or intelligibility; however, this needs further investigation based on more data. Nevertheless, we suggest concurrent commenting as a useful method for eliciting connected speech; in retaining more control over intended target words compared to free conversation, this task may be particularly useful in the context of unintelligible speech.

Background

A key feature of speech sound disorders (SSDs) in children is problems in speech production compromising their ability to making themselves understood (ASHA, Citation2018; Flipsen, Citation2010). An important part in the clinical care of children with SSDs is the assessment of speech production, often expressed as the extent to which speech production differs from expected target forms (Shriberg & Kwiatkowski, Citation1982). In assessing speech production, there are different considerations weighing into the decision of what type of speech elicitation task to base assessment upon (Masterson, Bernhardt, & Hofheinz, Citation2005). Basing assessments on connected free speech is desirable from the perspective of ecological validity, but might prove challenging, as intended target words are often difficult to identify in the child’s speech (Flipsen, Citation2006; Gordon-Brannan & Hodson, Citation2000). Therefore, speech elicitation tasks that optimize ecological validity while retaining some control over target words are of value. However, a remaining challenge, even in speech materials where much of the speech topic is known, is to quantify speech production in stretches of speech that remain unintelligible. This draws attention to a potential need for alternative measures of speech production.

The standard quantification of speech production (hereafter: “articulation proficiency”) is the Percentage of Consonants Correct (PCC; Shriberg & Kwiatkowski, Citation1982), calculated as the proportion of consonants in a speech sample that match target consonants in the (assumed) intended production. As such, the PCC is a measure of speech accuracy, and cannot be calculated on unintelligible portions of speech, where the intended production is unknown. Although the PCC was originally designed for use on samples of connected speech (Shriberg & Kwiatkowski, Citation1982), it has subsequently been used also for samples of isolated words (e.g., Klintö, Salameh, Svensson, & Lohmander, Citation2011; Lousada, Jesus, Hall, & Joffe, Citation2014; McLeod, Crowe, & Shahaeian, Citation2015). Many different variants of the PCC have been suggested, each constituting adaption to specific purposes (Shriberg, Austin, Lewis, McSweeny, & Wilson, Citation1997). One such adaption is the PCC-Revised (PCC-R; Shriberg et al., Citation1997), where allophones (such as both usual and unusual distortions) are treated as correct, and only omissions and substitutions are counted as incorrect. However, as any quantification of articulation proficiency that only accounts for the aspect of speech accuracy, no version of the PCC is applicable to portions of unintelligible speech. In other words, articulatory proficiency – as inferred from speech accuracy – cannot be determined in speech which is only partially intelligible (Flipsen, Hammer, & Yost, Citation2005). In the study of the association between articulation proficiency and effects on intelligibility, this restricted quantification of articulation proficiency is particularly unsatisfying, as it may leave significant portions of speech unaccounted for – portions that may, in fact, be of most interest. As an illustration, consider two different samples of spontaneous speech, A and B, with the same speech accuracy score of, say, 60%. Whereas sample A is 100% intelligible, B is only 75% intelligible. Hence, for these two samples, the articulation proficiency score of 60% is based on samples of different sizes; for A it describes the complete sample, whereas, for B, it only describes 75% of the sample. Or, in other words, whereas 60% of the speech production in sample A is “accurate”, 60% of the 75% intelligible portion of B is “accurate”, i.e., only 45% of the complete sample. This illustrates an inherent complication in comparing speech accuracy across samples that vary with regards to intelligibility.

A potential remediation to the challenge of quantifying articulation proficiency in partially unintelligible speech could be to include identifiable units in the unintelligible portions of speech, categorizing these as incorrect. This, one may argue, would better reflect that speech production has indeed failed to meet identifiable targets, or in other words, that speech production adequacy is reduced. For example, counting and including the total number of consonants in the calculation of speech production adequacy – that is, including consonants in unintelligible portions of a speech sample – would result in a lower articulation proficiency score than the original PCC score, as it would increase the value of the denominator in the equation. This score would then reflect the more general concept of speech production adequacy, rather than the more specific speech production accuracy. However, although this may be theoretically feasible, it would be practically inadvisable, as tallying of consonants in unintelligible portions of speech is a difficult task, raising concerns regarding reliability. Instead, knowing that tallying of syllables is reliable, even in unintelligible stretches of speech (Flipsen, Citation2006), syllables may serve as more fitting units of interest.

In the quantification of intelligibility, the syllable has already been suggested as the unit of interest. Lagerberg, Åsberg, Hartelius, and Persson (Citation2014) have suggested a syllable-based measure of intelligibility, calculated as the proportion of perceived syllables, out of the total number of syllables in a speech sample. Notably, no consideration is paid to whether listeners perceive syllables correctly or not; as such, the measure does not depend on knowledge of what the speaker intended to say (Lagerberg et al., Citation2014). A similar, although not identical, approach has been used in the study of speech recognition in individuals with hearing impairments (Seldran et al., Citation2011). Here, intended target words are known, and listeners’ performance is measured as the proportion of syllables correctly perceived. A potential advantage of using the syllable as the unit of interest instead of the word, as has been suggested by for example Gordon-Brannan and Hodson (Citation2000) and Shriberg et al. (Citation1997), is the generalization of findings across languages. Notably, languages differ regarding what constitutes a “word” – the “word” football shoe in English, for example, is made up by two orthographic words, whereas the corresponding “word” in Swedish is fotbollssko, i.e. constituted by one orthographic word. Hence, a suggested intelligibility measure like the proportion of words understood (Gordon-Brannan & Hodson, Citation2000), would mean something quite different in English compared to Swedish (and other languages having similar morphological structure), whereas an intelligibility measure based on the proportion of syllables understood can be assumed to be more comparable across languages.

Investigations of variation in speech accuracy across tasks have often compared picture naming tasks, where isolated words are elicited, to narrative or conversational tasks, from which connected speech is obtained (Klintö et al., Citation2011; Masterson et al., Citation2005; Morrison & Shriberg, Citation1992; Wolk & Meisler, Citation1998). Whereas many have reported somewhat lower speech accuracy in speech materials constituted of isolated words from speech production assessment tests (Masterson et al., Citation2005; Morrison & Shriberg, Citation1992; Wolk & Meisler, Citation1998), differences have generally been small. Klintö et al. (Citation2011) have reported an opposite pattern. Possible explanations to this may involve the different characteristics of the children included – Klintö et al. (Citation2011) studied speech production of children with cleft palate, whereas the other listed studies were of children with SSDs with other aetiologies. Another explanation may be sought in the complexity of the elicited isolated words – some assessment materials may simply be more challenging than others.

When investigating new quantifications of articulation proficiency, it is desirable to include speakers and speech material that offer variation in terms of the construct in question, i.e., articulation proficiency. One way of doing that is to include children with different types and severities of SSD. Another way is to include elicitation tasks that vary with regards to control and complexity, as described above. The limitation of the PCC metric with regards to dealing with unintelligible portions of speech is, obviously, most apparent when quantifying articulation proficiency where intended target words are unknown, which is often the case in samples of conversational speech in children with SSD. A task suggested by Dollaghan, Campbell, and Tomlin (Citation1990) presents itself as an interesting adjunct to free conversation. Here, the task for the child is to describe ongoing events in a silent video, to an adult assessor who does not see what happens on the screen. For the assessment of speech accuracy, the fact that the topic, as well as the timing and order of events, is known, alleviates challenges in deciphering the speaker’s intended speech production. For the assessment of intelligibility, the fact that listeners cannot be expected to be familiar with the verbatim form of the utterances produced ameliorates concerns of listener familiarity with the speech material. Hence, the task of concurrent commenting offers beneficial features of speech materials in the context of studying links between articulation and intelligibility.

The aim of this investigation is to evaluate a new quantification of articulation proficiency, or more specifically, of speech production adequacy: the Percentage of Intelligible Correct Syllables (PICS). A secondary aim is to explore the task of concurrent commenting, in terms of its properties of articulation proficiency and intelligibility in relation to a task of free conversation. The following research questions are investigated:

  1. Does the Proportion of Intelligible Correct Syllables (PICS) provide a valid quantification of articulation proficiency (or more specifically: of speech production adequacy)?

This question will be explored through analysis of construct and convergent validity, through correlation analysis against the PCC-R, and through analysis against the proportion of intelligible syllables (PINTS; Lagerberg et al., Citation2014), respectively. As the PICS and the PCC-R are both quantifications of articulation proficiency, a strong correlation between the two is expected. As intelligibility is assumed to be a different – although related – construct, a more moderate correlation is expected between PICS and PINTS.

  • (2) How do speech materials elicited though concurrent commenting differ from speech materials obtained from free conversation with regards to (a) articulation proficiency and (b) intelligibility?

This question will be operationalized as a comparison of the two types of speech materials with regards to PCC-R and PICS for quantification of articulation proficiency, and the PINTS (Lagerberg et al., Citation2014) for intelligibility. Higher levels of intelligibility are expected for concurrent commenting speech samples compared to free conversation, as listeners’ knowledge of the contents of the commented video clip makes them better equipped when interpreting spoken samples, compared to when the topic is less restricted.

Method

Participants

Nine children with SSD participated as speakers in the present study. The children were recruited from SLP clinics in Stockholm and Gothenburg, based on the criteria that (a) their treating SLP evaluated their problems in speech production as affecting intelligibility, and (b) they were between 4 and 10 years of age. All children were monolingual Swedish, except one (ID 6), for whom Swedish was still the strongest language, by parental report. The children varied both with respect to the type of SSD and to functional intelligibility, as assessed via the Swedish version of the parental questionnaire Intelligibility in Context Scale (ICS; McLeod, Harrison, & McCormack, Citation2012), as presented in . (It can be noted that two of the children – ID 3 and 6 – had an ICS score of 5, i.e., the maximum score, indicating no parent-reported difficulties in making themselves understood in an everyday context. They still, however, fulfilled the inclusion criterion of having speech production problems that their SLP judged as affecting intelligibility.)

Table 1. Demographic information of participating speakers, including – apart from sex and age – average score value on the intelligibility in context scale (ICS; McLeod et al., Citation2012), and information regarding SLP diagnoses.

Material

For each of the participating children, speech collected through four speech elicitation tasks was included: (1) imitated isolated words, (2) isolated words elicited through picture naming, (3) concurrent commenting during the playback of a silent short film, and (4) free conversation. Each of these tasks is described in the following.

Imitation

Imitated isolated words were elicited through STI-CH – Swedish Test of Intelligibility for Children (Lagerberg et al., Citation2015), where the task for the child is to repeat back isolated words produced by an adult speaker through headphones. STI-CH is based on unique 60-item sets of words, designed for the purpose of blinding the assessor’s familiarity of the target words, while still ensuring balance with respect to phonological characteristics (for details, see Lagerberg et al., Citation2015). In the present study, all nine sets of words were unique.

Picture naming

Isolated words were also elicited via picture naming, through LINUS (Blumenthal & Lundeborg Hammarström, Citation2014), a test used for assessment of phonological production. In LINUS, 107 photographic pictures representing isolated words are presented to the child, whose task is to name the pictures. Target words are primarily nouns and are selected for the purpose of representing all Swedish consonants at least twice, in initial, medial and final position (where possible). All Swedish vowels are also represented in stressed positions.

Concurrent commenting

Connected speech was elicited through a task of concurrent commenting, where the child was seated in front of a laptop screen to watch an episode of Pingu (either Pingu as a Chef or Pingu Helps around the House, both available from The Official Pingu Youtube Channel, https://www.youtube.com/user/pingu/featured). The selection of which episode to watch was negotiated between child and test leader; two children (ID 4 and 7) watched Pingu Helps around the House, whereas the other children watched Pingu as a Chef. The child was instructed to continuously describe the events in the film, to the adult assessor who could not see the screen, but who took notes of what the child said. Hence, the task was designed as a collaborative task between the child and test leader, with the challenge of covering as much as possible of the events in the film.

Free conversation

Connected speech was also elicited through conversation. This conversation was introduced by the test leader asking the child what they would do if they won one million Swedish crowns. The test leader strived to promote as much speech production as possible from the child, by supporting the child with short and open follow-up questions.

From the two tasks eliciting isolated words – the imitation and picture naming tasks – all word items were included. (In some cases, however, the child did not produce all items, which explains why there are not 60 naming items and 107 imitation items for all children in .) From the two tasks eliciting connected speech, first 100 words were included, whenever possible, in line with procedures in similar studies (e.g., Klintö et al., Citation2011; Lagerberg et al., Citation2014). Utterances containing the 100th word were included as a whole. Unintelligible portions of speech were also included among these words (see below for a description of the estimated number of words in unintelligible portions of speech). In five cases, the 100-word limit was not reached. Also, following procedures in Klintö et al. (Citation2011) and Lagerberg et al. (Citation2014), a minimum limit of 50 words was set. Utterances only containing “ja”/”nej” (Eng. yes/no), or variants of these, were excluded. Interrupted words were also excluded.

Table 2. Sample descriptions listing the number of intelligible words in each of the four samples for each child speaker. For the connected speech samples – concurrent commenting and free conversation – the number of unintelligible syllables are given within parentheses.

For estimation of word count in unintelligible portions of speech, Flipsen’s (Citation2006) formula Intelligibility Index-PS was used. Following this procedure, the number of syllables per word is estimated, for each child, in intelligible speech portions. The number of syllables in an unintelligible speech portion is then divided by this factor, yielding an estimated number of words in this unintelligible speech portion. One speaker (ID 4) was included based on the minimum 50-word criterion through the use of this formula. As seen in , the free conversation sample for this speaker only contains 26 intelligible words. It can be noted that for the PCC-R analysis (see below), this is considerably less than the recommended 100 (Shriberg et al., Citation1997).

Recordings were conducted in quiet rooms, at preschools or in the children’s homes, with only the child and the test leader present. Sessions were video recorded with a Zoom Q8 camera, and audio was recorded both via the camera’s internal microphone and via a clip-on microphone Sennheiser MKE-P-C. Only the latter was used in the subsequent processing and analysis.

Procedure

Speech analysis

Annotation, phonological transcription, and transcription of master transcripts were conducted in ELAN (Wittenburg, Brugman, Russel, Klassmann, & Sloetjes, Citation2006). The speech material had been collected from two different projects, under identical conditions except for regarding the construction of master transcripts. For five of the children (IDs 1, 2, 3, 6 and 7), the experimenter (who also conducted the recordings of the children) transcribed the continuous speech orthographically within a week after the recording. Then, caregivers were consulted to help decode unintelligible speech and verify words which were difficult to interpret. The resulting transcripts were appointed as master transcripts. For the remaining four children, orthographic transcription was made by two SLP students (authors MÖCA and CO) who performed the transcription in consultation with each other; the caregivers were not available and therefore not involved in the process. For these samples, these transcripts constituted the master transcripts.

Phonological transcription was done in SAMPA (Wells, Citation2005). Phonological transcriptions of expected target forms were automatically generated, through lexicon-lookup in the Swedish NST lexicon (Andersen, Citation2005). For words not found in the pronunciation dictionary, phonological transcriptions were generated through grapheme-to-phoneme conversion (Bisani & Ney, Citation2008). For a more detailed description, see Strömbergsson, Edlund, Götze, and Nilsson Björkenstam (Citation2017). Phonological transcription of observed speech production was conducted by two SLP students (authors MÖCA and CO). First, the two transcribers calibrated their transcriptions of a first practice sample (which was not included in the analysis). Then, all speech samples were assigned numbers and randomly allotted to either of the two transcribers. Syllables evaluated as unintelligible in the master transcript were transcribed with ‘?’ characters in the phonological transcriptions.

Evaluation of intelligibility

Evaluations of intelligibility of connected speech samples were conducted in two rounds, reflecting the order in which the speech samples had been collected (see above). To the highest degree possible, conditions were kept stable between the first and second evaluation round; summarizes the aspects that were not identical between evaluation rounds.

Table 3. Summary of differing conditions for the first and second round of evaluations of intelligibility.

Twenty individuals were recruited for the first evaluation round, 18 of whom were SLP students, and two of whom were newly graduated SLPs (with less than a year’s practice). All but one participant was female, and the age range was 22–34 years (M = 26.9 years, SD = 3.3 years). For the second evaluation round, 26 SLP students contributed with evaluations as a part of their course work during their second year of SLP studies. These students were in the ages 20–38 years (M = 24.2 years, SD = 4.0 years), all women. For all evaluations, only individuals with self-reported normal hearing, and Swedish as their dominant language were included.

Evaluations were conducted at university facilities, in quiet rooms with only evaluators and test leader present. The evaluators were equipped with Sony MDR-ZX550 headphones, seated by a computer, from which they could control the playback of speech samples. They were instructed not to listen to each sample more than twice. They could pause the samples at any time but were not allowed to replay parts of the sample. Further, they were instructed to orthographically transcribe all words and word parts they could understand, and to indicate all syllables they could not understand with “0”. This follows the procedure described in Lagerberg et al. (Citation2014). Also in line with Lagerberg et al. (Citation2014), the evaluators participating in the second evaluation round typed their transcriptions onto the computer. For the first evaluation round, on the other hand, transcriptions were made by pen and paper, and later re-typed verbatim onto a computer. For the evaluation of concurrent commenting samples, the participants were presented with the silent video used in the commenting task, whereas the evaluators of free conversation samples were informed that they would hear conversational samples representing the topic of winning a lot of money. For all evaluations, an introductory sample (not included in the analysis) was used as a practice sample, after which the evaluators were given a chance to ask the test leader for clarification regarding the procedure.

For the first evaluation round, five sets of samples were compiled, with four samples in each set. The 20 evaluators were randomly assigned to sample sets, such that each sample would obtain at least four evaluations. For the second evaluation round, five sets were also compiled, but instead with three samples in each set: two representing concurrent commenting, and one representing free conversation. The 26 evaluators were randomly assigned to one of these sets and transcribed 2–3 samples each, within the 45 min-long lab session they participated in. Only completely transcribed samples were included in the analysis.

Analyses

Percentage of consonants correct-revised (PCC-R)

The PCC-R (Shriberg et al., Citation1997) was calculated on all speech material, except for unintelligible syllables and utterances only containing “ja”/”nej” (Eng. yes/no), or variants of these. PCC-R was computed automatically, via a Perl script, and verified manually (on a 10% subset) by authors MÖCA and CO.

Percentage of intelligible correct syllables (PICS)

The Percentage of Intelligible Correct Syllables (PICS) was calculated on all speech material, based on the following equation:

naccurate_syllablesnaccurate_syllables+ninaccurate_syllables+nunintelligible_syllables

As for PCC-R, PICS was also computed automatically, via a Perl script, assigning the sum of all vowels and “?” characters in a sample as the total number of syllables in the sample (i.e., the denominator in the equation), and the number of exact syllable matches between target and observed transcriptions as the number of accurately produced intelligible syllables in the sample.

Percentage of intelligible syllables (PINTS)

For each orthographic listener transcription from the intelligibility evaluation, transcribed syllables were tallied (by counting vowels, in accordance with Lagerberg et al., Citation2014), as well as the number of unintelligible syllables (i.e., “0” characters in the transcriptions). This was done automatically, via a Perl script. An intelligibility score for each listener transcription was calculated in accordance with Lagerberg et al. (Citation2014) as the number of transcribed syllables divided by the total number of syllables in the transcript, multiplied by 100. For each speech sample, the intelligibility scores obtained from all listener transcriptions of the sample were weighed together, into a median intelligibility score. This score was obtained by excluding the extreme scores (50% of the highest and lowest scores) and averaging the remaining medium half of the scores, in line with the procedure in Gordon-Brannan and Hodson (Citation2000). This intelligibility score represented the PINTS score for this specific speech sample.

Reliability

As described above, PCC-R values were calculated automatically based on the comparison between automatically generated phonological transcriptions of expected target forms and manually created phonological transcriptions of observed speech production. In order to estimate the reliability of the manual phonological transcriptions, four samples transcribed by each of the transcribers were randomly selected and re-transcribed by the other transcriber. Complete word-by-word agreement between the two transcribers varied between 33% and 96% (M = 74%, SD = 20%). If only consonants were considered, word-by-word agreement varied between 79% and 99% (M = 92%, SD = 6%).

After three weeks, both transcribers also re-transcribed four randomly selected samples from their own original transcriptions, to enable an analysis of the intra-transcriber agreement. Complete word-by-word agreement varied between 65% and 98% (M = 89%, SD = 11%). If only consonants were considered, word-by-word agreement varied between 90% and 99% (M = 96%, SD = 3%).

Ethical considerations

The present study is part of a larger project entitled “Real-time assessments of intelligibility in children’s connected speech”, which has been approved by the Regional Ethical Review Board Stockholm (No. 2016/1628–31/1).

Results

Validity of the percentage of intelligible correct syllables (PICS)

shows the variation in articulation proficiency (as measured by PCC-R and PICS) across all four speech elicitation tasks, and intelligibility (as measured by PINTS) across the two tasks involving connected speech. The figures in the table illustrate that PICS scores are consistently lower than PCC-R scores, but also a large variation among speech samples with regards to the three measures, with, for example, a range between 35.64% and 97.12% between samples with the lowest to the highest articulation proficiency score, as measured by the PCC-R.

Table 4. Average (and SD) values of articulation proficiency, measured by PCC-R and PICS, across the four speech elicitation tasks (Imitation, naming, concurrent commenting and free conversation), and intelligibility, measured by PINTS, across the two tasks involving connected speech (Concurrent commenting and free conversation).

illustrates the association between the PICS and the reference measure of articulation proficiency (specifically: of speech accuracy), i.e., the PCC-R. In the figure, as well as in the correlation analysis, each datapoint represents a speech sample. As shown in , the correlation is very high, hence supporting the construct validity of the PICS. (This correlation holds high also when computed only across samples of connected speech: rs = 0.98, p < .001.) Regarding the association between the PICS and the reference measure of intelligibility (PINTS), together with coefficient data in illustrate a lower, although still significant, association. This lends support also to the convergent validity of the PICS, although with weaker strength.

Table 5. Spearman rank correlation coefficients (and p-values) for associations between the PICS, the PCC-R, and the PINT.

Figure 1. Associations between (a) PICS and PCC-R across all four tasks, and (b) PICS and PINTS, across the two tasks involving connected speech. Each dot in the graphs represents a speech sample.

Figure 1. Associations between (a) PICS and PCC-R across all four tasks, and (b) PICS and PINTS, across the two tasks involving connected speech. Each dot in the graphs represents a speech sample.

Comparison between concurrent commenting and free conversation

As seen from the figures in , the difference between the two types of connected speech materials – as obtained either through concurrent commenting or through free conversation – is small, both with regards to speech accuracy and with regards to intelligibility. A Wilcoxon signed ranks test revealed no statistical difference between the two tasks, neither for PCC-R (Z = −0.65, p = .52), for PICS (Z = −0.89, p = .37), nor for PINTS (Z = −0.42, p = .68). It can be noted in that of all four speech tasks, the variation between samples in terms of speech accuracy and intelligibility is large, although the smallest for samples elicited through concurrent commenting.

Discussion

Our primary aim was to explore the validity of a novel measure of articulation proficiency – one that does not exclude unintelligible portions of speech – the Proportion of Intelligible and Correct Syllables (PICS). The strong correlation between the PICS and the standard measure of articulation proficiency, the PCC, establishes the construct validity of the novel measure. In other words, although the PICS and the PCC may quantify different aspects of articulation proficiency (speech adequacy vs. speech accuracy), both can be regarded as quantifications of the more general construct of articulation proficiency. Further, correlation with a quantitative measure of intelligibility also suggests congruent validity of the PICS. The expected moderate strength of the association between speech production adequacy as measured by the PICS and intelligibility can be assumed to reflect the fact that speech adequacy and intelligibility are related, but different, constructs. As a secondary aim, we investigated concurrent commenting as an alternative speech elicitation task to free conversation, comparing the two with regards to variation in articulation proficiency and intelligibility. The finding that neither articulation proficiency nor intelligibility differed between the two speech tasks may be seen as an indication that the two tasks pose comparable articulatory demands on the speakers, and that listener intelligibility is not aided by prior knowledge of speech content. However, as will be discussed below, we recommend further investigation before such conclusions can be reached.

Although the results serve to establish PICS as a valid quantification of articulation proficiency, its advantages over the standard metric PCC (or variants thereof) may not be evident at first sight. Considering that extraordinary efforts (parental consultation) were paid to resolve unintelligible speech portions (for the majority of the speech samples), the speech portions that could be included in the calculation of PCC were larger than they would otherwise have been. It is reasonable to assume that the virtues of using the PICS are more apparent for speech containing more unintelligible portions.

An important motivation for introducing concurrent commenting as a speech elicitation task is the contextual cues that the commented film provides when recovering the intended spoken message in the recorded speech. A finding where speech elicited through concurrent commenting was more intelligible than speech elicited through free conversation would support that suggestion. Yet, this was not evidenced in the present study – neither in the original transcriptions of the material, nor in the listener transcriptions underlying the PINTS figures. It can be noted, however, that for the least intelligible speaker (ID 4), with lowest scores both on the ICS (see ) and in terms of the number of intelligible words in the original transcription of a speech sample (see ), there are actually no unintelligible words in the original transcription of the concurrent commenting task. It is possible that the cues that the known speech contents provide are most apparent in speech samples that are less intelligible than the majority of samples included here. We welcome further investigations of this issue based on more – and perhaps less intelligible – samples.

The variation in articulation proficiency (as captured by both PCC-R and PICS) is large across the included samples. Although not specified as a research question, it is interesting to relate the variation in articulation proficiency observed across tasks here to earlier comparisons of task-dependent variation in speech accuracy. In terms of average PCC-R and PICS values, higher levels of articulation proficiency were observed in connected speech samples, compared to samples of isolated words (see ). This tendency aligns with findings in previous observations (Masterson et al., Citation2005; Morrison & Shriberg, Citation1992; Wolk & Meisler, Citation1998), whereas it runs counter to observations reported by Klintö et al. (Citation2011). However, as the speakers in the present study do not all represent the same SSD type, and as little is known about the nature of their SSD, we refrain from speculation regarding underlying explanations to this tendency. Another observation relating to the variation in articulation proficiency across samples is that the variation in articulation proficiency between speakers – both in terms of standard deviation values and in terms of range from minimal and maximal values – is smaller for concurrent commenting samples than for other speech samples. (In fact, this pattern is not limited to articulation proficiency, but can also be observed for the measure of intelligibility.) We see no obvious explanation for this observation.

Conversational speech has been suggested as the type of speech material in which speakers are likely to present their greatest communicative competence (Flipsen, Citation2010). Different conversational settings may also influence the complexity of the language children produce; for example, Dollaghan et al. (Citation1990) observed more complex language in narration compared to the conversation. Interestingly, in their comparison between samples retrieved either through free conversation or concurrent commenting, Dollaghan et al. (Citation1990) observed higher linguistic complexity in the video narration task, both with regards to the proportion of linguistically complex utterances and with regards to utterance length. The present study did not involve any analysis of the linguistic complexity across the two types of connected speech task, but it should be acknowledged that this factor may affect evaluations both of articulation proficiency and intelligibility.

As the events in a video are rapidly changing, and their presentation does not adapt to the child’s narrating pace, the concurrent commenting task can be expected – in some ways – to put higher demands on information processing on the child than a task of spontaneous conversation (Dollaghan et al., Citation1990). Allowing the children the opportunity to prepare for the task by first watching the video in silence, was suggested by Dollaghan et al. (Citation1990) as a way of reducing information processing demands for the child. Although this was not done in the present study, we experienced no difficulties in obtaining large enough commenting speech samples from the included children. Also, if this task indeed was extra demanding, lower speech accuracy would have been expected in samples obtained through concurrent commenting compared to the other tasks. There were no indications in that direction. In retrospect, we realize that other children may have responded differently, and join Dollaghan et al.’s (Citation1990) recommendation to allow children the opportunity to first watch the video in silence.

Limitations

Although efforts were made to maximize variation in articulation proficiency among the included samples, the variation in intelligibility across samples was smaller; specifically, the least intelligible samples were still more than 50% intelligible (as measured by the PINTS). In retrospect, it would have been more ideal to include also less intelligible samples. In line with the reasoning above, this would provide better opportunities for demonstrating the advantages of the PICS, which – based on the present study – can only be argued on a theoretical level.

For practical reasons, the intelligibility evaluations were conducted in two rounds. And even though efforts were paid to keep conditions stable between the two evaluation rounds, it should be acknowledged that the differences between conditions may have introduced some uncertainty in the evaluation of intelligibility. Most critically, caution is due when comparing intelligibility across speech tasks (i.e., conversation vs. concurrent commenting), as the concurrent commenting samples were evaluated in only one of the evaluation rounds.

The selection of video episode was negotiated between child and test leader, as a way of allowing some choice for the child in the recording sessions. As a consequence, the control of the balance of video contents was relaxed across the concurrent commenting samples, resulting in two children selecting one video clip, and the remaining seven selecting another. It cannot be out ruled that this has had some influence on the results. For future studies involving concurrent commenting, a balanced set of samples is recommendable.

Methods for estimating the number of words based on the number of syllables have been suggested (Flipsen, Citation2006; Shriberg et al., Citation1997), and were also used in the compilation of speech material in the present study. Although we do not question the value of these suggestions, we propose that this estimation step may in fact not be necessary for the quantification of intelligibility and that they should even be avoided in comparisons between language samples of different linguistic complexities not to mention in comparisons across different languages. Instead, we suggest sample length is based on the number of syllables, rather than with reference to the number of words, for the same reasons as discussed in the introduction, that languages differ with regards to what constitutes a “word”, but can be assumed to be more comparable in terms of syllables.

As described above, measuring speech adequacy rather than speech accuracy is relevant when quantifying articulation proficiency in speech where intelligibility is reduced, and where target words are unknown. Notably, in contexts where target words are known – such as in samples collected from naming or repetition tasks – speech accuracy and speech adequacy will amount to the same values. Such speech materials offer the chance of estimating articulation proficiency in a convenient way with traditional means, with obvious advantages in a clinical context. On the downside, however, these materials are less representative of speech in everyday life. So, for researchers and clinicians concerned with quantifying articulation proficiency under ecologically valid conditions, for which speech materials of spontaneous conversational speech are probably preferred – with its concomitant challenge of dealing with partially unintelligible speech – the suggested PICS measure of speech production adequacy should be of value. As for the question of why speech is unintelligible, the PICS does not provide answers. Rather, it can provide useful descriptive information of speech underlying future research aiming to further uncover links between speech production and intelligibility.

Conclusions

The Percentage of Intelligible and Correct Syllables (PICS) is a valid measure of articulation proficiency (specifically: of speech adequacy), in theory particularly useful in the evaluation of speech containing unintelligible portions to which standard measures of speech accuracy are not applicable. However, further investigation – including samples representing lower levels of intelligibility – is needed to confirm its usefulness with empirical evidence. The assumed advantage of concurrent commenting over free conversation with respect to deciphering the intended verbal message was not empirically supported in the present study; further investigation – again including less intelligible speech samples – is needed before firm conclusions on this issue can be reached.

Declaration of interest

The authors report no declarations of interest.

Acknowledgments

The work was supported by the Swedish Research Council (VR 2015-01525). Our sincere thanks to SLPs Matilda Emander and Norea Kjerrman, for contributing to data collection and analysis, and to all children and listeners for their participation.

Additional information

Funding

This work was supported by the The Swedish Research Council [VR 2015-01525].

References

  • American Speech-Language-Hearing Association. (2018). Speech sound disorders: Articulation and phonology. Author. Retrieved from https://www.asha.org/practice-portal/clinical-topics/articulation-and-phonology/
  • Andersen, G. (2005). Gjennomgang og evaluering av språkresurser fra NSTs konkursbo. [Review and evaluation of language resources from NST’s bancrupt’s estate], Aksis, UNIFOB. Bergen, Norway: Bergen University.
  • Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5), 434–451. doi:10.1016/j.specom.2008.01.002
  • Blumenthal, C., & Lundeborg Hammarström, I. (2014). LINUS. LINköpingsUnderSökningen: Ett fonologiskt testmaterial från 3 år [LINköpingsUnderSökningen: A phonological test material from 3 years]. Sweden: Department of Neuroscience and Locomotion/Speech Pathology, Linköping University. Retrieved from http://www.diva-portal.org/smash/get/diva2:737467/FULLTEXT01.pdf
  • Dollaghan, C. A., Campbell, T. F., & Tomlin, R. (1990). Video narration as a language sampling context. Journal of Speech and Hearing Disorders, 55(3), 582–590. doi:10.1044/jshd.5503.582
  • Flipsen, P. J. (2006). Measuring the intelligibility of conversational speech in children. Clinical Linguistics & Phonetics, 20(4), 303–312. doi:10.1080/02699200400024863
  • Flipsen, P. J. (2010). Factors associated with the intelligibility of conversational speech produced by children with cochlear implants. In P. Rhea & P. Flipsen Jr. (Eds.), Speech sound disorders in children: In honor of Lawrence D. Shriberg (pp. 225–246). San Diego, CA: Plural Publishing.
  • Flipsen, P. J., Hammer, J. B., & Yost, K. M. (2005). Measuring severity of involvement in speech delay: Segmental and whole-word measures. American Journal of Speech-Language Pathology, 14(4), 298–312. doi:10.1044/1058-0360(2005/029)
  • Gordon-Brannan, M., & Hodson, B. W. (2000). Intelligibility/severity measurements of prekindergarten children’s speech. American Journal of Speech-Language Pathology, 9(1), 141–150. doi:10.1044/1058-0360.0902.141
  • Klintö, K., Salameh, E.-K., Svensson, H., & Lohmander, A. (2011). The impact of speech material on speech judgement in children with and without cleft palate. International Journal of Language and Communication Disorders, 46(3), 348–360.
  • Lagerberg, T. B., Åsberg, J., Hartelius, L., & Persson, C. (2014). Assessment of intelligibility using children’s spontaneous speech: Methodological aspects. International Journal of Language & Communication Disorders, 49(2), 228–239. doi:10.1111/1460-6984.12067
  • Lagerberg, T. B., Hartelius, L., Johnels, J. Å., Ahlman, A.-K., Börjesson, A., & Persson, C. (2015). Swedish test of intelligibility for children (STI-CH) – validity and reliability of a computer-mediated single word intelligibility test for children. Clinical Linguistics & Phonetics, 29(3), 201–215. doi:10.3109/02699206.2014.987925
  • Lousada, M., Jesus, L. M. T., Hall, A., & Joffe, V. (2014). Intelligibility as a clinical outcome measure following intervention with children with phonologically based speech–Sound disorders. International Journal of Language & Communication Disorders, 49(5), 584–601. doi:10.1111/1460-6984.12095
  • Masterson, J. J., Bernhardt, B. H., & Hofheinz, M. K. (2005). A comparison of single words and conversational speech in phonological evaluation. American Journal of Speech-Language Pathology, 14(3), 229–241. doi:10.1044/1058-0360(2005/023)
  • McLeod, S., Crowe, K., & Shahaeian, A. (2015). Intelligibility in context scale: Normative and validation data for english-speaking preschoolers. Language, Speech, and Hearing Services in Schools, 46(3), 266–276. doi:10.1044/2015_LSHSS-14-0120
  • McLeod, S., Harrison, L. J., & McCormack, J. (2012). The intelligibility in context scale: Validity and reliability of a subjective rating measure. Journal of Speech Language and Hearing Research, 55(2), 648–656. doi:10.1044/1092-4388(2011/10-0130)
  • Morrison, J. A., & Shriberg, L. D. (1992). Articulation testing versus conversational speech sampling. Journal of Speech and Hearing Research, 35(2), 259–273. doi:10.1044/jshr.3502.259
  • Mukaka, M. M. (2012). Statistics corner: A guide to appropriate use of correlation coefficient in medical research. Malawi Medical Journal: the Journal of Medical Association of Malawi, 24, 69–71.
  • Seldran, F., Gallego, S., Micheyl, C., Veuillet, E., Truy, E., & Thai-Van, H. (2011). Relationship between age of hearing-loss onset, hearing-loss duration, and speech recognition in individuals with severe-to-profound high-frequency hearing loss. Journal of the Association for Research in Otolaryngology, 12(4), 519–534. doi:10.1007/s10162-011-0261-8
  • Shriberg, L. D., Austin, D., Lewis, B. A., McSweeny, J. L., & Wilson, D. L. (1997). The percentage of consonants correct (PCC) metric: Extensions and reliability data. Journal of Speech Language and Hearing Research, 40(4), 708–722. doi:10.1044/jslhr.4004.708
  • Shriberg, L. D., & Kwiatkowski, J. (1982). Phonological disorders III: A procedure for assessing severity of involvement. Journal of Speech and Hearing Disorders, 47(3), 256–270. doi:10.1044/jshd.4703.256
  • Strömbergsson, S., Edlund, J., Götze, J., & Nilsson Björkenstam, K. (2017). Approximating phonotactic input in children’s linguistic environments from orthographic transcripts. Proceedings of Interspeech 2017, Stockholm, Sweden, pp. 2213–2217.
  • Wells, J. C. (2005). SAMPA - computer readable phonetic alphabet, UCL psychology & language sciences. Retrieved from http://www.phon.ucl.ac.uk/home/sampa/index.html
  • Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., & Sloetjes, H. (2006). ELAN: a professional framework for multimodality research. Max Planck Institute for Psycholinguistics. Retrieved from http://pubman.mpdl.mpg.de/pubman/item/escidoc:60436:2/component/escidoc:60437/LREC%202006_Elan_Wittenburg.pdf
  • Wolk, L., & Meisler, A. W. (1998). Phonological assessment: A systematic comparison of conversation and picture naming. Journal of Communication Disorders, 31(4), 291–313. doi:10.1016/S0021-9924(97)00092-0