386
Views
0
CrossRef citations to date
0
Altmetric
Original Article

Measurement and optimisation of the perceptual equivalence of the Dutch consonant-vowel-consonant (CVC) word lists using synthetic speech and list pairs

, , , &
Received 07 Nov 2022, Accepted 11 Jan 2024, Published online: 07 Feb 2024

Abstract

Objectives

(1) to determine whether the standard Dutch word lists for speech audiometry are equally intelligible in normal-hearing listeners (Experiment 1), (2) to investigate whether synthetic speech can be used to create word lists (Experiment 1) and (3) to determine whether the list effect found in Experiment 1 can be reduced by combining two lists into pairs (Experiment 2).

Design

Participants performed speech tests in quiet with the original (natural) and synthetic word lists (Experiment 1.). In Experiment 2, new participants performed speech tests with list pairs from the original lists constructed from the results of Experiment 1.

Study samples

Twenty-four and twenty-eight normal-hearing adults.

Results

There was a significant list effect in the natural speech lists; not in the synthetic speech lists. Variability in intelligibility was significantly higher in the former, with list differences up to 20% at fixed presentation levels. The 95% confidence interval of a list with a score of approximately 70% is around 10%-points wider than of a list pair.

Conclusions

The original Dutch word lists show large variations in intelligibility. List effects can be reduced by combining two lists per condition. Synthetic speech is a promising alternative to natural speech in speech audiometry in quiet.

Introduction

Speech recognition tests assess a listener’s ability to recognise speech and are routinely performed in clinics for hearing evaluation. They are of great importance in the diagnosis of hearing loss and in assessing the performance of listeners with hearing aids and cochlear implants (CIs). Typically, lists of words are presented at different sound pressure levels and the percentage correct is calculated at each level. Test lists must be perceptually balanced to ensure that the speech recognition test result is independent of the list used (ISO 8253-3:2022). Commonly used word lists in the English language are PAL PB-50, CID-W22 and NU-6 (Wilson, McArdle, and Roberts Citation2008). Developing new word lists is a costly and time-consuming process. It consists of recording and editing utterances of a professional speaker followed by listening experiments to create lists with equal recognition probabilities (ISO 8253-3:2022). Using a human speaker has some drawbacks as it can be difficult to keep speech level and speaking rate constant (Versfeld et al. Citation2000). In addition, it is hardly possible to replace words at a later stage if, for example, a word becomes obsolete. Uttering meaningless syllables with similar articulation as meaningful syllables can also be challenging. Finally, it is not possible to create instantaneously new speech material for hearing training purposes. In an attempt to overcome these drawbacks, we investigated whether synthetic speech can serve as a viable alternative to natural speech in the context of speech audiometry.

With the advancement of text-to-speech (TTS) systems, the quality of synthetic speech has greatly improved over the past decade. Synthetic speech has numerous applications such as navigation, customer service, personal assistants and many more. In recent years, extensive research has been done on improving the speaking style of synthetic speech generation (Stanton, Wang, and Skerry-Ryan Citation2018; Zhang et al. Citation2019) and the efficiency of neural audio synthesis (Kalchbrenner et al. Citation2018). The Google Cloud TTS, for example, produces natural sounding speech (Grimshaw, Bione, and Cardoso Citation2018) that achieves a mean opinion score (MOS) of 4.53 (on a scale from 1 to 5) comparable to a MOS of 4.58 for professionally recorded speech (Shen et al. Citation2018). This TTS system uses the non-free Google Cloud Text-to-Speech API to convert text input into audio data of speech. It provides several voices, is available in different languages and variants, and applies DeepMind’s research in WaveNet and Google’s neural networks (van den Oord et al. Citation2016). Ibelings, Brand, and Holube (Citation2022) examined the subjective assessments of three TTS systems in the German language. They reported that the TTS system from the Acapela Cloud Service (Acapela Group, Solna, Sweden) outperformed Google Cloud TTS in terms of naturalness and overall impression, even though both systems were often rated as similar.

A few studies have explored the use of synthetic speech in speech-in-noise tests. Nuesse et al. (Citation2019) showed that synthetic German matrix sentences could replace recordings from natural speakers for testing speech recognition in noise. Ibelings, Brand, and Holube (Citation2022) found that, although the subjective ratings of the TTS system were worse than of the natural reference, synthetic speech led to significantly better speech recognition scores in noise than natural speech. However, the difference in speech recognition scores was within the range of differences found between natural speakers. Ibelings, Brand, and Holube (Citation2022) also reported that listening effort was comparable for synthetic and natural speech. Schwarz et al. (Citation2022) compared speech recognition functions of German words in quiet (the Freiburg monosyllabic speech test) in the original (natural) voice and a synthetic voice and found no significant difference in the mean 50% correct level or the mean slope of the speech recognition function.

The Dutch lists of consonant-vowel-consonant (CVC) words (Bosman and Smoorenburg Citation1995) are the standard speech material used in Dutch clinics for performing speech audiometry in quiet. The material contains 15 lists of 12 CVC words each, mostly nouns, which normally occur in the Dutch language. The words were uttered by a female speaker. Bosman (Citation1989) argued that strict phonetical balancing of the lists would not result in equally intelligible lists, as in Dutch certain phonemes predominantly occur in particular positions, making them highly predictable. Moreover, when compiling a phonetically balanced list, many syllables per list are required to accommodate the rare phonemes, and a long list of syllables does not necessarily imply high accuracy, as listeners may lose attention. Therefore, he did not phonetically balance the CVC lists. Instead, the phonetic differences were minimised in an effort to decrease between-list variability. This was done by selecting the initial consonants, vowel and final consonants from three sets of phonemes. The set of initial consonants consisted of/t, k, χ, b, d, v, z, n, l, j, w, h/, the set of vowels of/α,▫, I, ↄ, i, u, a, e, o, au, ei/, and the set of final consonants of/p, t, k, f, s, χ, m, n, η, l, j, w/. The vowels were balanced across lists, the consonants were not. Bosman also argued that the average frequency of occurrence of a word influences word perception, so this could lead to inequality of the lists in terms of intelligibility. The use of meaningless syllables would solve this particular problem, but would most likely result in a bias towards meaningful words, especially in elderly participants (Bosman Citation1989).

Word lists must be equally intelligible so that the scores per list are comparable when all other factors (i.e. listener, listening condition, etc.) are fixed (ISO 8253-3:2022). If not, the clinician is unable to determine whether differences in scores are due to differences in performance or differences in lists. Since the introduction of the CVC speech material more than 30 years ago, the CVC lists have been widely used in Dutch clinics and hearing centres to assess speech recognition ability in quiet and to evaluate the fitting of hearing aids or CIs. However, it has not been thoroughly investigated whether these lists are actually equally intelligible for both normal-hearing people and people with hearing loss. Bosman did evaluate the lists with 30 normal-hearing listeners, but at the time found the differences in intelligibility among the lists small enough for clinical applications (Bosman, personal communication). de Graaff et al. (Citation2018) found that the lists are not equally intelligible for CI users, as the CVC list averages differed significantly in the same condition across a group of CI users. They recommended counterbalancing lists across the participants with CIs or randomising the order of presentation of the lists when using the current set of Dutch CVC lists in studies. In addition, they suggested creating new CVC lists with equal intelligibility for both cochlear implant users and normal-hearing individuals. Note that the variability between speech list items is larger in hearing impaired (HI) listeners than in normal-hearing listeners (Bosman Citation1989). TTS systems can potentially facilitate the process of creating different types of speech material, for different target groups and for listeners with all kinds of hearing loss.

When creating new speech recognition tests, the standard approach is to apply level corrections to the speech list items to create perceptually balanced lists for normal-hearing listeners (ISO 8253-3:2022). However, this was not done for the standard Dutch CVC lists. Therefore, the first aim of the present study was to determine whether there is a list effect in the original natural CVC material in normal-hearing listeners. The second aim was to determine if we can use synthetic speech in speech audiometry for creating Dutch lists of CVC words. Specifically, we recreated Bosman’s (Citation1989) CVC lists with a text-to-speech (TTS) system and compared the intelligibility of both speech materials, the variability between lists and typical listener errors (Experiment 1). Overall, we hypothesised that synthetic speech is well suited for applications in clinical audiology because utterances are very constant in articulation, speaking rate and level. Based on these results, Experiment 2 was conducted to investigate whether the found list effect in the original CVC material can be reduced by combining lists into list pairs. This would allow Dutch audiologists to continue to use the current standard material and keep the results comparable with previous measurements.

Experiment 1

Participants

Twenty-four native Dutch-speaking adults (18–24 years, M = 21 ± 2 years, 10 males) participated. Normal-hearing individuals with pure-tone thresholds of ≤20 dB HL at all octave frequencies from 0.25 to 8 kHz and a type A tympanogram (i.e. indicating normal middle ear functioning) in at least one ear were included. Only one ear was tested in the subsequent speech recognition experiments. The mean pure-tone average [PTA] at 0.5, 1, 2, and 4 kHz of the tested ears was 3.1 ± 2.9 dB HL. Left and right ears were alternated between participants, unless only one ear qualified. The participants were recruited from the university campus. They were paid and gave written informed consent. The VUmc medical research ethics committee declared that the study did not need further ethics evaluation (protocol number: 2020.0758).

Materials

The original (natural) CVC material (Bosman and Smoorenburg Citation1995), i.e. 15 lists with 12 CVC words each, as well as the same synthetic CVC words generated with TTS software were used for the listening experiment. The original material was read out by a female speaker. We applied the default scoring method, which is the percentage of correctly repeated phonemes while omitting the first word of each list. Hence, each list comprised 33 phonemes in 11 words that were scored as either correct or incorrect. The original materials were recorded with a Brüel & Kjaer 4155 microphone and a Brüel & Kjaer 2230 sound level metre connected to a Luxman KD 117 DAT-recorder (16 bit resolution, 48 kHz sampling rate) (Bosman and Smoorenburg Citation1992).

The synthetic material contained the same words as the original, and it was created with a female synthetic voice, namely the Google Cloud TTS voice (nl-NL Wavenet D). Various TTS systems were considered for this study. We ultimately chose Google Cloud TTS because of its high quality speech, its published technology, and the ease of use of an API to create mass speech materials. Google Cloud TTS produced separate 16-bit WAV files for each word at a sample rate of 44.1 kHz.

After the synthetic speech material was created, the spectrum was adjusted to equate the long-term average speech spectrum (LTASS) of the synthetic speech with the natural speech. This was achieved by determining the difference in the power spectra, in 0.1 kHz frequency bins, between all natural spoken words and all synthetic spoken words using the “To Ltas” command from PRAAT (Boersma and Weenink Citation2007). Then, for each synthetic spoken word, these values were used to change the power spectrum in each frequency bin. The LTASS of the original synthetic speech was not very different from the LTASS of the natural speech (SD of 1/3 octave differences between approximately 0.5 and 8 kHz of 3.4 dB) which reduced to an SD of 0.47 dB after the correction. The LTASS equalisation was done to prevent differences in intelligibility between the natural and synthetic speech material being caused by differences in the LTASS. Next, the root mean square (RMS) levels of all synthetic words were equalised (SD of the level corrections was 0.3 dB).

The presentation levels were based on reported speech recognition functions for normal-hearing listeners in literature (Bosman Citation1989; Bosman and Smoorenburg Citation1995) and pilot testing. The presentation levels were 17, 22, 27, 32, 37 and 65 dB SPL. These levels were the same for the natural and synthetic speech. With the lowest 5 levels, we aimed to achieve speech recognition scores between 30-90% to precisely determine the speech recognition function per list. We also presented speech at 65 dB SPL to determine whether a score of 100% could be achieved. The level of 65 dB SPL is representative of the speech level in daily life and is commonly used to assess speech recognition in quiet within the context of clinical practice in the Netherlands.

Test procedure

A within-subject repeated measures experimental study design was used. The testing took place in a sound-treated booth and all stimuli were monaurally presented via Sennheiser HDA200 headphones through a Sound Blaster Creative soundcard (THX) connected to a PC (Dell Optiplex780). Half of the participants started with the natural speech condition and the other half with the synthetic speech. Ideally, each participant would listen to all lists in both conditions at all six presentation levels. However, this would take far too long and the same words would appear very often (12 times) to the same listener, which could cause learning effects. Therefore, we presented each participant all 15 lists in the natural and synthetic speech condition twice at two (instead of all 6) presentation levels. Averaged across all participants, every list was presented 8 times at each of the 6 presentation levels. Per participant, the order of the 15 lists was kept fixed throughout the whole experiment so that the same lists were presented as far apart as possible to reduce learning effects. In an attempt to make the experiment less fatiguing, the order of the presentation levels was alternated between easier (32, 37 and 65 dB SPL) and more difficult (17, 22 and 27 dB SPL) presentation levels. The order of the presentation levels was kept fixed for both speech materials within each participant. The experiment started with a practice list that was not scored to surpass a potential learning effect. The practice list consisted of synthetic CVC words that did not occur in the original lists.

The participants were verbally instructed by the experimenter to repeat exactly what they had heard, even if it was a single phoneme or a nonsense syllable. The experimenter was the same throughout the whole experiment for all participants. Stimuli were not repeated and no feedback was provided. All answers were scored by the experimenter and the setup was double-blind to avoid biased results.

Statistical analysis

The approach proposed by Thornton and Raffin (Citation1978) is often used to calculate the test-retest reliability of speech recognition test scores. It assumes the same recognition probability for each item in a list. Hagerman (Citation1976) proposed a more accurate approach that takes into account the individual recognition probabilities of items in a list. Recently, Holube, Winkler, and Nolte-Holube (Citation2020a, Citation2020b) provided a general approach that took into account differences in recognition probabilities within lists and variance between lists. Equations (3) to (12) from Holube, Winkler, and Nolte-Holube (Citation2020a), using the approximation of Wilson (Citation1927), were used to calculate the confidence intervals. Note that the list scores and confidence intervals in Experiment 1 provide an estimate of the average recognition score across three presentation levels (22, 27 and 32 dB SPL).

Levene’s test was used to assess whether the natural and synthetic SRTs and list scores have equal variances. To test for a list effect, a linear mixed model (LMM) was fitted to the data, with list score the response variable, list number a categorical covariate, presentation level a 3-level ordinal covariate (levels representing 22, 27 and 32 dB SPL), and random intercepts to control for intra-participant correlation. This model, referred to as the “full model”, was compared to a reduced model in which the “list number” covariate was omitted. The goodness-of-fit of both models were then compared in the likelihood ratio test. A significant difference indicates that the effect of “list” significantly contributes to the score.

Results

The list-specific speech recognition (performance-intensity) functions for natural and synthetic speech are shown in . Each dot represents the list-specific mean proportion correct of 8 normal-hearing participants for the given presentation level (not every participant was presented with every list at every presentation level). The lines show the maximum likelihood fits of logistic functions to the raw data per CVC list. The logistic function has two parameters: the speech recognition threshold (SRT) and the slope at 50% correct. The SRT is the speech level at which an individual can recognise 50% of the phonemes. The scores at 65 dB SPL were not used for the estimation of these functions. shows that the speech recognition functions per list were more homogenous for the synthetic material than for the natural material. The mean SRT of the natural speech material (20.9 ± 1.8 dB SPL) was significantly lower than the mean SRT of the synthetic speech material (22.5 ± 0.7 dB SPL) (paired samples t-test, t(14) = 3.72, p = 0.006) with a significantly lower variance in the latter (Levene’s test, F(1,28) = 9.36, p = 0.005). The slopes of the synthetic lists (M = 3.8 ± 0.4%-points/dB) were significantly steeper than the ones of the natural lists (M = 3.4 ± 0.3%-points/dB) (paired samples t-test, t(14) = 3.02, p = 0.009). At 65 dB SPL, the participants reached an average score of 99.4 and 99.0% correct for the natural and synthetic speech, respectively.

Figure 1. Speech recognition functions (percent correct as a function of presentation level) for the original, natural CVC word lists (left panel) and for synthetic CVC word lists (right panel) for normal-hearing listeners. Each dot show the average score of 8 participants. The lines show the maximum likelihood fits to the raw data.

Figure 1. Speech recognition functions (percent correct as a function of presentation level) for the original, natural CVC word lists (left panel) and for synthetic CVC word lists (right panel) for normal-hearing listeners. Each dot show the average score of 8 participants. The lines show the maximum likelihood fits to the raw data.

shows the mean percent correct scores (i.e. list scores, based on 33 phonemes per list) with 95% confidence intervals of the natural (left panel) and synthetic (right panel) lists, averaged over presentation levels 22, 27, and 32 dB SPL. These three presentation levels yielded scores between 50 and 100% correct and are therefore the most clinically relevant (ISO 8253-3:2022). The dots in are based on 24 data points each (8 participants were presented with the same list at the same presentation level, and here, 3 presentation levels are pooled). The list-scores of the natural speech ranged from 57.1% (list 6) to 77.8% (list 3). The variance in CVC list scores was significantly higher for natural speech (, left panel) than for synthetic speech (, right panel) (F(1,28)= 12.26, p = 0.002), indicating that the synthetic voice creates more equally intelligible lists.

Figure 2. Mean correct scores and 95% confidence intervals per CVC list for the natural (left panel) and synthetic (right panel) speech material, averaged over 3 presentation levels (22, 27 and 32 dB SPL).

Figure 2. Mean correct scores and 95% confidence intervals per CVC list for the natural (left panel) and synthetic (right panel) speech material, averaged over 3 presentation levels (22, 27 and 32 dB SPL).

There was a significant list effect in the original (natural) CVC material (χ2(14) = 119.97, p < 0.001). The list effect was still present when list 6 was omitted from the data (χ2(13) = 89.15, p < 0.001). There was no significant list effect in the synthetic lists (χ2(14) = 18.39, p = 0.19).

The phoneme confusions of the CVC syllables were very similar for natural and synthetic speech (Supplementary Material). To test for differences in phoneme confusions between natural and synthetic speech, we used the method proposed by Jürgens and Brand (Citation2009). This revealed no significant differences in phoneme confusions between both voices, except for consonant/d/where the percentage correct scores for the natural and synthetic speech were 63% (95%-confidence interval [58 − 68%]) and 47% [42 − 52%], respectively. All other phoneme confusions differed no more than 10% between the two voices.

Finally, we checked whether word scores of frequently used words are different from less frequently used words. There was no significant correlation between the average word scores and their word frequency in daily use as retrieved from the “Wordfreq” open data set (Speer Citation2022) (Pearson correlation, R = 0.18).

Experiment 2

Experiment 1 showed that there is a significant list effect in the original CVC material. This has important practical consequences for Dutch speech audiometry. Test conditions (for example, with and without a hearing aid) cannot be accurately compared and speech recognition over time cannot be properly measured if the existing list effect is ignored. To offer Dutch audiologists a practical solution, the aim of experiment 2 was to investigate whether the list effect can be averaged out or at least reduced by using combinations of an easier and more difficult list.

Participants

Twenty-eight new participants were recruited (mean age 25; range 20–32 years; 19 females). Pure-tone thresholds of both ears were better than or equal to 20 dB HL for all octave frequencies from 0.25 to 8 kHz. All participants gave written informed consent.

CVC list pairing

Based on the results from Experiment 1, we paired the original lists two by two so that the average scores of the seven list pairs were as similar as possible. Specifically, the lists were first ranked from worst to best scoring list and then the worst scoring list was paired with the best scoring list, the second worst list was paired with the second best list, etc. We chose to omit list 6 from the pairing as it yielded the lowest scores in both speech materials. The pairs were based on the average natural list scores at 22, 27 and 32 dB SPL and are: 3&8, 2&14, 1&15, 11&13, 4&7, 5&10 and 9&12. The order of the two lists within a pair was always the same.

Test procedure

The test equipment (audiometer, soundcard, headphones and computer) and experimenter were the same as in Experiment 1. A within-subject repeated measures study design was used. Pure-tone thresholds were measured prior to the speech recognition test. Each participant was presented with the 7 list pairs consecutively, all at a fixed presentation level of 30 dB SPL. In clinical practice, presentation levels are multiples of 5 dB. The order of the list pairs was counterbalanced across participants in a Latin-square manner. The experiment again started with a practice list that was not scored to surpass a potential learning effect. Participants were instructed in the same way as in Experiment 1. The total duration of the testing procedure was ∼30 min.

Statistical analysis

Confidence intervals of the individual lists and list pairs were calculated according to Holube, Winkler, and Nolte-Holube (Citation2020a), as described in Experiment 1, except that now there was only one presentation level and all participants were presented with all lists (and therefore all list pairs). A list (pair) effect was also tested, as in Experiment 1. The approach by Holube, Winkler, and Nolte-Holube (Citation2020b) was used to estimate an overall confidence interval for a single list or a single list pair, taking into account between-list or between-list pair variability.

Results

The average list scores and average list pair scores with 95% confidence intervals are presented in in the left and right panel, respectively. The individual list averages (left panel, ) are strongly correlated with those found in Experiment 1 (left panel, , Pearson correlation, R = 0.89), and there is again a significant list effect on scores, when comparing the full model with random intercepts and list as predictor to the random intercepts only model (χ2(13) = 107.07, p < 0.001), confirming the findings from Experiment 1. The difference between the highest scoring list and the lowest scoring list is 14.9%, similar to Experiment 1 when list 6 (the list omitted in Experiment 2) is left out (15.3%). For the list pairs (right panel, ) there is still a significant list pair effect (χ2(6) = 24.44, p < 0.001). The difference, however, between the highest and lowest scoring list pairs is 5.0%. The confidence intervals of the list pair scores are, as expected, smaller than those of the individual lists due to the double number of phonemes. A list pair can be interpreted as an individual list with 66 phonemes. Note that the confidence intervals for individual list scores (left panel ) are larger than the ones found in Experiment 1 because there was only one presentation level as opposed to three.

Figure 3. Mean correct scores per CVC list (left panel) and per CVC list pair (right panel) at 30 dB SPL presentation level. The confidence intervals are larger here than in Experiment 1 because there was only one presentation level.

Figure 3. Mean correct scores per CVC list (left panel) and per CVC list pair (right panel) at 30 dB SPL presentation level. The confidence intervals are larger here than in Experiment 1 because there was only one presentation level.

The mean score did not increase over time, indicating no training effect in the speech task after a practice list was performed at the start of the test (simple linear regression, slope = −0.1%-points/dB, p = 0.384).

The between-list standard deviation (SD), i.e. the SD among average test list scores, was 4.3%; the between-list pair SD was 1.8%. Note that list 6 was omitted from this experiment, and the between-list variability would likely have been even greater if it had been included, as it yields the lowest scores of all lists. The mean list score (and therefore also the mean list pair score) was 71.7%. The lower and upper bounds of the 95% confidence interval for the true value of an individual list score of 71.7% were 55.8 and 84.6%, and 61.6 and 80.2% for a true list pair score of 71.7%, taking into account both the variability in phoneme recognition probabilities and between-list (pair) variability.

Discussion

The first aim of this study was to evaluate whether the lists of standard speech material used in Dutch clinical audiometry, i.e. the CVC lists, are equally intelligible. Our results show a significant list effect. Thus, the mean intelligibility of the natural CVC lists is not the same for all lists, with average list scores ranging from 57.1% (ME = 1.8%; list 6) to 77.7% (ME = 1.5%; list 3), averaged across presentation levels of 22, 27 and 32 dB SPL. This is an intelligibility difference of more than 20% between CVC lists that are assumed to be interchangeable. The score of the lowest scoring list (list 6) was much lower than that of the second lowest scoring list (62.1%, ME = 1.7%, list 8).

From our results, it seems that a similar phonetic composition of each list is insufficient to ensure equal intelligibility across the lists. There are several factors that can cause list effects in CVC material. Firstly, only the vowel composition of each list was constant. The imbalance in consonant composition in the lists may at least partly contribute to intelligibility differences. Secondly, lexical properties play a role in word recognition (Winkler, Carroll, and Holube Citation2020). For example, not all words occur equally often in everyday speech. Common words are recognised faster, the so-called word frequency effect (Broadbent Citation1967). However, this does not explain our results, as the correlation between the average word scores and their word frequency in daily use as retrieved from the “Wordfreq” open data set (Speer Citation2022) is not significant (R = 0.18). Note that “Wordfreq” data are based on written language, not spoken language. Another lexical parameter is neighbourhood density, which refers to the number of words that are similar to a target word (Luce and Pisoni Citation1998). Words with many similar words, or neighbours, are said to have a dense neighbourhood, while words with few neighbours have a sparse neighbourhood. Both word frequency and neighbourhood density have shown to affect speech recognition of monosyllabic words (Winkler, Carroll, and Holube Citation2020). It may be that the neighbourhood density of the words differs across lists. However, this is beyond the scope of the current study. Thirdly, no level corrections were applied to the words to create perceptually balanced lists. These factors, however, do not explain why there is significantly more variation between the natural lists than between the synthetic lists (p = 0.002). This is visible in , where the speech recognition functions of the synthetic lists are more similar to each other than the natural ones. In addition, there was no significant list effect found in the synthetic lists as opposed to the natural lists. Nuesse et al. (Citation2019) found that the speech production of the TTS system was at least as homogeneous as natural speech. Ibelings, Brand, and Holube (Citation2022) and Schwarz et al. (Citation2022) found similar SRTs and slopes of the synthetic and natural speech. Presumably there are more irregularities in the pronunciation by a human speaker than by a synthetic speaker, and therefore the TTS may produce more equally intelligible lists. This is an important and promising advantage of using synthetic speech in the context of creating audiometric speech material in the future. Another major benefit of using TTS systems to create speech material is the considerably reduced development cost and evaluation time. Finally, TTS systems offer the possibility to post-produce any amount of material.

The second aim of this study was to assess whether synthetic speech material can be used as an alternative to natural speech material for speech audiometry. Our results demonstrate that this is the case, at least for Dutch CVC words. The mean SRT of the synthetic speech material (22.5 dB SPL) was significantly higher (poorer) than the natural one (20.9 dB SPL) (p = 0.002). This is in line with some studies that found that TTS voices are generally less intelligible than natural voices (Aoki, Cohn, and Zellou Citation2022; Simantiraki, Cooke, and King Citation2018). However, other publications found no difference in intelligibility (Ibelings, Brand, and Holube Citation2022; Schwarz et al. Citation2022). Nevertheless, this average SRT-difference was small (1.6 dB) and not clinically relevant for speech recognition in quiet. In addition, phoneme confusions of both voices were highly correlated (R = 0.62), although this is strongly influenced by the use of meaningful words, which reduces the number of candidate phonemes. Note that the mean SRT of the natural CVC lists (20.9 dB SPL) is very close to the value of 20.7 dB SPL found by Bosman (Citation1989). At 65 dB SPL, all synthetic words were nearly perfectly intelligible, as the average synthetic speech recognition scores reached 99% at this level. Schwarz et al. (Citation2022) found that some isolated synthetic words in quiet were recognised significantly worse by the participants due to unnatural pronunciation. They used the TTS system of Acapela Group (Mons, Belgium) and found no significant difference between the synthetic and natural SRTs. The same Acapela TTS system was used by Nuesse et al. (Citation2019) to produce German Matrix sentences in noise. Ibelings, Brand, and Holube (Citation2022) used an improved version of the Acapela TTS system (Acapela Group, Solna, Sweden) based on deep neural networks to produce speech material of the GÖSA (i.e. everyday sentences in noise). All three studies concluded that German synthetic speech could replace recordings of natural speech for speech recognition tests in both quiet (Schwarz et al. Citation2022) and noise (Ibelings, Brand, and Holube Citation2022; Nuesse et al. Citation2019). Our results show that their findings can be generalised to speech recognition tests in linguistically similar languages like Dutch, for different TTS systems and different speech material. Future research should determine whether the use of synthetic speech in speech tests is also suitable for unrelated languages (e.g. tonal languages).

A disadvantage of the Google Cloud TTS system used in our study, is that it is a cloud-hosted service and its output may change over time. The TTS application utilises the latest innovations in deep learning, including neural network-based speech synthesis, automatic speech segmentation, and word pronunciation modelling. Further progress in the field of TTS models involves the development of larger and more comprehensive models, along with the utilisation of larger datasets (Mehrish et al. Citation2023). So although the output, which in our case is the speech material, can change, it is reasonable to assume that the quality of the speech will only improve. On the other hand, the more realistic synthetic speech becomes (i.e. more human-like), the more likely it is that variability in pronunciation, articulation, speaking rate etc. will be introduced. The demand for reproducible speech synthesis would require capturing the state of the TTS system to make the synthesis reproducible. This seems challenging given the steady innovations in this field. However, a similar argument applies to the reproducibility of speech of a human speaker, which changes over time.

Combining CVC lists and effects on reliability

The standard CVC lists are not equally intelligible for normal-hearing listeners. This is in line with previous research in cochlear implant users (de Graaff et al. Citation2018). The standard approach to minimising test-retest reliability among speech lists is to apply level corrections to make each word (and therefore each list) equally intelligible (ISO 8253-3:2022). This approach has the advantage of providing steep speech recognition functions, and thus accurate estimates of the SRTs and comparisons between patients and conditions. So, although our results clearly indicate that such adjustments would lead to a better Dutch test for speech audiometry, it would have significant drawbacks. Most importantly, the scores of these new modified speech lists would not be comparable to previous scores except for scores of 50%. The original CVC lists have been used in research and in clinics for more than 30 years, making the impact of such a change very large. Moreover, determining and applying level corrections of the words is a very tedious and time-consuming task, requiring a very large number of participants in a listening experiment. It would be possible to interchange words between lists based on the individual word scores obtained in this study in order to equalise list intelligibility. However, the scores of the newly assembled lists would still not be comparable to previous results and it could further compromise the phonetic balance between lists. Therefore, we suggest to omit list 6 and use list-pairs to reduce the list effect. In Experiment 2, we found that combining a “harder than average” list with an “easier than average” list greatly reduces the list effect (see right panel of ). We have used the approach by Holube, Winkler, and Nolte-Holube (Citation2020b) to estimate an overall confidence interval when using a single list or one of the proposed list pairs. The binomial distribution, as suggested by Thornton and Raffin (Citation1978), is often used for this but it neglects differences in recognition probabilities between phonemes in a list and differences between lists. For example, the phoneme scores in List 1 ranges from 0.13 (word 2) to 0.88 (word 12) and these differences need to be taken into account (Supplementary Material Table 1, natural speech). The suggested list pairs reduce the width of the confidence interval from 28.7% to 18.6% around a score of 71.7%. The difference between the highest and lowest scoring list pair is 5% (Experiment 2), as opposed to 20% for the individual lists (Experiment 1). We estimated the SRTs and slopes of the speech recognition functions from the seven list pairs using the data from Experiment 1. The range of the 7 SRTs was only 1.8 dB, suggesting that the speech recognition functions for the list pairs are very similar to the average speech recognition function.

Holube, Winkler, and Nolte-Holube (Citation2020a, Citation2020b) used the Poisson binomial distribution to determine confidence intervals and critical differences for speech recognition scores. They showed that confidence intervals for list scores become smaller when the variability between items in a list increases. The between-list variance increases the confidence interval for scores on a random list of a test. Wide confidence intervals and large critical differences are unwanted or even problematic. Usually, in clinical practice, the speech recognition function (the part between 50 and 100% intelligibility) is estimated and lists are presented at different presentation levels. The effect of between-list variability is likely to be less prominent this way because the list differences can be averaged out. However, when evaluating hearing aid or cochlear implant fittings, usually only one or two lists are presented at a fixed level and between-list variability can have a high impact on the result. Note that the between-list variance is added to the total variance of the score of a (random) list and may have a relatively small effect on the confidence interval size of a randomly chosen test list score. However, this may give an incorrect indication of the effect of a very deviating list, such as list 6 (see Experiment 1). For example, if differences between two conditions differ just significantly based on the overall critical difference, this difference will not be significant if we know that the lowest score was achieved on test list 6. Thus, the critical difference depends on which lists are being compared. In practice it is very inconvenient to check which lists are used and to take into account the specific list differences when testing for significance. Therefore, it is recommended to minimise between-list differences; and to calculate confidence intervals and critical differences for test lists using the approach from Holube, Winkler, and Nolte-Holube (Citation2020b). Omitting list 6 and using the suggested list pairs is one way to achieve this.

A limitation of using list pairs is that only 7 pairs are available with the current material this way. Repeating the same list over several appointments is common in clinical practice which can potentially lead to a training effect. It is also common that multiple lists are used to determine progress in CI rehabilitation or differences between hearing aids. The results from Experiment 2 showed no learning effect, but the same lists were not repeated in that experiment. Using the same lists could lead to a training effect (content learning). However, this has not been investigated in the current experiments. The 15 unique word lists are available in a different word order per list, making a total of 45 lists, which may reduce a potential learning effect.

This study has demonstrated for the first time the list effects and the wide confidence intervals of the Dutch CVC test lists, even after optimisation by combining two lists and by using synthetic speech material. Clinicians need to be aware of the limited reliability of the CVC test, especially if clinical decisions are based on it, such as the provision of a hearing aid or CI, or the comparison between different devices or settings. Presumably, other test formats like sentence tests (Versfeld et al. Citation2000) yield a much higher reliability for a given amount of testing time. Therefore, the clinician should select the most suitable test material for answering the clinical question, while taking the reliability of the test outcome into account. The current paper provides the appropriate information about the Dutch CVC test to make this decision.

With the recommendation to combine lists, we offer Dutch audiologists a practical solution to deal with the list effect in the existing CVC material. In the future, synthetic speech should be considered for the creation of new speech material. A synthetic speech test seems more suitable for clinical audiology than the current natural speech test because of its consistent pronunciation, ease of use and comparable speech recognition scores per list.

Conclusion

The original CVC word lists (Bosman and Smoorenburg Citation1995) show wide variability in intelligibility, with some list differences exceeding 20%. Synthetic word lists show significantly less variability and yield comparable speech recognition scores to the natural word lists. Thus, synthetic speech proves to be a promising alternative to natural speech in speech audiometry. The list effects of the original material can be considerably reduced by combining two lists per condition. However, because even the list pairs have wide confidence intervals, any clinical decision based on the CVC test results should be treated with caution.

Author contributions

C.S. designed the experiments, F. S.H. and S.P. collected and analyzed the data and S.P. wrote the manuscript. All authors discussed the results and commented on the manuscript.

Supplemental material

Supplemental Material

Download MS Word (35.2 KB)

Acknowledgements

The authors would like to thank Hans van Beek for creating the software of the experiments. Special thanks to reviewers Birger Kollmeier and Hendrik Husstedt, and to IJA associate editor Inga Holube. Their constructive feedback and insightful comments greatly improved this paper.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The data that support the findings of this study are available from the corresponding author, S. Polspoel, upon reasonable request.

Additional information

Funding

This study was supported by Health∼Holland, Top Sector Life Sciences & Health (https://www.health-holland.com).

References

  • Aoki, N. B., M. Cohn, and G. Zellou. 2022. “The Cle/car Speech Intelligibility Benefit for Text-to-Speech Voices: Effects of Speaking Style and Visual Guise.” JASA Express Letters 2 (4): 045204. https://doi.org/10.1121/10.0010274
  • Boersma, P., and D. Weenink. 2007. PRAAT: Doing Phonetics by Computer (Version 5.2.34) [Computer software]. http://www.fon.hum.uva.nl/praat/
  • Bosman, A. J. 1989. “Speech Perception by the Hearing-Impaired.” PhD thesis, University of Utrecht.
  • Bosman, A. J., and G. F. Smoorenburg. 1995. “Intelligibility of Dutch CVC Syllables and Sentences for Listeners with Normal Hearing and With Three Types of Hearing Impairment.” Audiology 34 (5): 260–284. https://doi.org/10.3109/00206099509071918
  • Bosman, A. J., and Smoorenburg, G. F. 1992. Woordenlijst voor spraakaudiometrie (Compact Disc)". Gouda, the Netherlands. Electro Medical Instrument bv & Veenhuis Medical Audio b.
  • Broadbent, D. E. 1967. “Word-Frequency Effect and Response Bias.” Psychological Review 74 (1): 1–15. https://doi.org/10.1037/h0024206
  • de Graaff, F., E. Huysmans, P. Merkus, S. Theo Goverts, and C. Smits. 2018. “Assessment of Speech Recognition Abilities in Quiet and in Noise: A Comparison Between Self-Administered Home Testing and Testing in the Clinic for Adult Cochlear Implant Users.” International Journal of Audiology 57 (11): 872–880. https://doi.org/10.1080/14992027.2018.1506168
  • Grimshaw, J., T. Bione, and W. Cardoso. 2018. “Who’s Got Talent? Comparing TTS Systems for Comprehensibility, Naturalness, and Intelligibility.” In P. Taalas, J. Jalkanen, L. Bradley & S. Thouësny (Eds.), Future-proof CALL: Language Learning as Exploration and Encounters – Short Papers from EUROCALL 2018 (pp. 83-88). Research-publishing.net. https://doi.org/10.14705/rpnet.2018.26.817
  • Hagerman, B. 1976. “Reliability in the Determination of Speech Discrimination.” Scandinavian Audiology 5 (4): 219–228. https://doi.org/10.3109/01050397609044991
  • Holube, I., A. Winkler, and R. Nolte-Holube. 2020a. “Modeling the reliability of the Freiburg monosyllabic speech test in quiet with the Poisson binomial distribution.” Does the Freiburg Monosyllabic Speech Test Contain 29 Words per List? Zeitschrift für Audiologie (Audiological Acoustics). .
  • Holube, I., A. Winkler, and R. Nolte-Holube. 2020b. "Modeling and verifying the test-retest reliability of the Freiburg monosyllabic speech test in quiet with the Poisson binomial distribution." Zeitschrift für Audiologie (Audiological Acoustics). 2. 1-25. .
  • Ibelings, S., T. Brand, and I. Holube. 2022. “Speech Recognition and Listening Effort of Meaningful Sentences Using Synthetic Speech.” Trends in Hearing 26: 23312165221130656. https://doi.org/10.1177/23312165221130656
  • Jürgens, T., and T. Brand. 2009. “Microscopic Prediction of Speech Recognition for Listeners with Normal Hearing in Noise Using an Auditory Model.” The Journal of the Acoustical Society of America 126 (5): 2635–2648. https://doi.org/10.1121/1.3224721
  • Kalchbrenner, N., E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu. 2018. “Efficient Neural Audio Synthesis.” International Conference on Machine Learning.
  • Luce, P. A., and D. B. Pisoni. 1998. “Recognizing Spoken Words: The Neighborhood Activation Model.” Ear and Hearing 19 (1):1–36. https://doi.org/10.1097/00003446-199802000-00001
  • Mehrish, A., N. Majumder, R. Bharadwaj, R. Mihalcea, and S. Poria. 2023. “A Review of Deep Learning Techniques for Speech Processing.” Information Fusion 99: 101869. https://doi.org/10.1016/j.inffus.2023.101869
  • Nuesse, T., B. Wiercinski, T. Brand, and I. Holube. 2019. “Measuring Speech Recognition With a Matrix Test Using Synthetic Speech.” Trends in Hearing 23: 2331216519862982. https://doi.org/10.1177/2331216519862982
  • Schwarz, T., M. Frenz, A. Bockelmann, and H. Husstedt. 2022. "Untersuchung einer synthetischen Stimme für den Freiburger Einsilbertest". GMS Z Audiol (Audiol Acoust). 4:Doc04
  • Shen, J., R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al. 2018.“Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. (ICASSP), IEEE, 2018: 4779–4783.
  • Simantiraki, O., M. Cooke, and S. King. 2018. “Impact of Different Speech Types on Listening Effort.” Interspeech 2018, 2267-2271. https://doi.org/10.21437/Interspeech.2018-1358
  • Speer, R. 2022. “Wordfreq 3.0.1.” rspeer/wordfreq: v3.0. Zenodo. doi: 10.5281/zenodo.7199437
  • Stanton, D., Y. Wang, and R. J. Skerry-Ryan. 2018. “Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis.” IEEE Spoken Language Technology Workshop (SLT), 595-602. https://doi.org/10.1109/SLT.2018.8639682
  • Thornton, A. R., and M. J. Raffin. 1978. “Speech-discrimination scores modeled as a binomial variable.” Journal of Speech and Hearing Research 21 (3): 507–518. https://doi.org/10.1044/jshr.2103.507
  • van den Oord, A., S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. 2016. “WaveNet: A Generative Model for Raw Audio.” ArXiv abs/1609.03499
  • Versfeld, N., L. Daalder, J. Festen, and T. Houtgast. 2000. “Method for the Selection of Sentence Materials for Efficient Measurement of the Speech Reception Threshold.” The Journal of the Acoustical Society of America 107 (3):1671–1684. https://doi.org/10.1121/1.428451
  • Wilson, E. B. 1927. “Probable Inference, the Law of Succession, and Statistical Inference.” Journal of the American Statistical Association 22 (158): 209–212. https://doi.org/10.1080/01621459.1927.10502953
  • Wilson, R. H., R. McArdle, and H. Roberts. 2008. “A Comparison of Recognition Performances in Speech-Spectrum Noise by Listeners with Normal Hearing on PB-50, CID W-22, NU-6, W-1 Spondaic Words, and Monosyllabic Digits Spoken by the Same Speaker.” Journal of the American Academy of Audiology 19 (6): 496–506. https://doi.org/10.3766/jaaa.19.6.5
  • Winkler, A., R. Carroll, and I. Holube. 2020. “Impact of Lexical Parameters and Audibility on the Recognition of the Freiburg Monosyllabic Speech Test.” Ear and Hearing 41 (1): 136–142. https://doi.org/10.1097/aud.0000000000000737
  • Zhang, Y.-J., S. Pan, L. He, and Z. Ling. 2019. “Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis.” ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6945–6949.