1,319
Views
37
CrossRef citations to date
0
Altmetric
Regular Articles

Structure in talker variability: How much is there and how much can it help?

ORCID Icon
Pages 43-68 | Received 15 Jan 2018, Accepted 09 Jul 2018, Published online: 30 Jul 2018
 

ABSTRACT

One of the persistent puzzles in understanding human speech perception is how listeners cope with talker variability. One thing that might help listeners is structure in talker variability: rather than varying randomly, talkers of the same gender, dialect, age, etc. tend to produce language in similar ways. Listeners are sensitive to this covariation between linguistic variation and socio-indexical variables. In this paper I present new techniques based on ideal observer models to quantify (1) the amount and type of structure in talker variation (informativity of a grouping variable), and (2) how useful such structure can be for robust speech recognition in the face of talker variability (the utility of a grouping variable). I demonstrate these techniques in two phonetic domains—word-initial stop voicing and vowel identity—and show that these domains have different amounts and types of talker variability, consistent with previous, impressionistic findings. An R package (phondisttools) accompanies this paper, and the source and data are available from osf.io/zv6e3.

View correction statement:
Erratum

Acknowledgments

I gratefully acknowledge Cynthia Clopper, Shannon Heald, Andy Wedel, and Noah Nelson for sharing their measurements of speech production data with me. Without their generosity this work would not have been possible. The techniques proposed here were originally developed jointly with Kodi Weatherholtz. I thank Florian Jaeger for feedback on earlier versions of this work, as well as Rory Turnbull and two anonymous reviewers. I also thank the developers of the R language (R Core Team, Citation2017) as well as the following packages: tidyverse (Wickham, Citation2017), rmarkdown (Allaire et al., Citation2017),knitr (Xie, Citation2015), cowplot (Wilke, Citation2017), mvtnorm (Genz & Bretz, Citation2009), and ggbeeswarm (Clarke & Sherrill-Mix, Citation2017).

Disclosure statement

No potential conflict of interest was reported by the authors.

ORCID

Dave F. Kleinschmidt http://orcid.org/0000-0002-7442-2762

Notes

1 There are other, potentially important uses for tracking group-specific distributions, even when they don't aid speech perception per se. For instance, listeners could use group-specific phonetic cue distributions to infer the age, gender, regional origin, etc. of an unfamiliar talker (Kleinschmidt et al., Citation2018), and such inferences may play an important role in coordinating group behaviour (e.g. Cohen, Citation2012)

2 Using the mean and cov functions in R 3.4.1 (R Core Team, Citation2017).

3 All three are available as R packages on Github: nspvowels, healdvowels, and votcorpora (which contains additional VOT measurements from other sources as well).

4 Without access to the raw data, it is not possible to normalise the Heald and Nusbaum (Citation2015) vowels.

5 See, for instance, http://stanford.edu/jduchi/projects/general_notes.pdf, p. 13. The math is the same for the univariate special case, as with VOT.

6 This is true even when considering just F1 or F2 in isolation. The KL divergence for distributions of two, independent cues is the sum of the independent cues. For vowels, the F1×F2 informativity is approximately equal to the sum of the individual F1 and F2 informativities.

7 Thanks to Rory Turnbull for suggesting this interpretation.

8 I refer to previous evidence for more talker variability in vowels than stop voicing as “qualitative” because no attempt has been made to measure talker variability in a directly comparable way across the two systems, even though there have been quantitative measurements of talker variability in each system.

9 An ideal observer's actual responses (and thus its accuracy) in, e.g. a phonetic classification task additionally depend on the decision rule (or loss function). However, any reasonable decision rule will be constrained by the amount of evidence in favour of the talker's intended category, and so the posterior probability of that category is a reasonable proxy for the current purpose. Also, note that using a winner-take-all decision rule with likelihood derived from normal distributions is the same as quadratic discriminant analysis, as used for instance by Adank et al. (Citation2004) in assessing the effectiveness of various vowel normalisation techniques.

10 This is true even in the presence of additional (independent) cues.

11 This quantity is not exactly information in the information-theoretic sense because it's not weighted by the probability of observing cue x under the true category model.

12 This benefit is for first encountering a novel talker from a socio-indexical group prior to further adaptation. In the general discussion, I return to this point and why the utility measure might underestimate the benefit of implicit knowledge about group-specific category distributions, as this knowledge likely serves as the starting point for talker-specific adaptation (Kleinschmidt & Jaeger, Citation2015).

13 I do not report on significance of individual vowel effects here because they are estimated using a randomised procedure—both at the level of subsampling talkers to estimate the accuracy, and at the level of bootstrapping to estimate statistical significance—and all p>0.01 after correcting for false discovery rate. I found that, even with a reasonably large number of subsampling and bootstrap iterations (100 and 1000, respectively), individual effects that are weakly significant in one run (0.05>p>0.01) are often only “marginally significant” (0.1>p>0.05) in another. Properly assessing the reliability of these effects is best left to future experiments designed to detect them.

Additional information

Funding

This work was partially funded by Eunice Kennedy Shriver National Institute of Child Health and Human Development (NIH NICHD) R01 HD075797 and NIH NICHD F31 HD082893. The views expressed here are those of the author and not necessarily those of the funding agencies.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.