ABSTRACT
Current theories of auditory comprehension assume that the segmentation of speech into word forms is an essential prerequisite to understanding. We present a computational model that does not seek to learn word forms, but instead decodes the experiences discriminated by the speech input. At the heart of this model is a discrimination learning network trained on full utterances. This network constitutes an atemporal long-term memory system. A fixed-width short-term memory buffer projects a constantly updated moving window over the incoming speech onto the network's input layer. In response, the memory generates temporal activation functions for each of the output units. We show that this new discriminative perspective on auditory comprehension is consistent with young infants' sensitivity to the statistical structure of the input. Simulation studies, both with artificial language and with English child-directed speech, provide a first computational proof of concept and demonstrate the importance of utterance-wide co-learning.
Acknowledgements
This research was supported by an Alexander von Humboldt research chair awarded to the first author. The authors are indebted to Petar Milin, James Blevins, and Lotte Meteyard for many discussions of the issues raised in this article, and to Dennis Norris and two other, anonymous, reviewers for excellent critical discussion.
Disclosure statement
No potential conflict of interest was reported by the authors.
Notes
1. Note that we associate the modifier polka-dot with three different lexomes. When a particular lexome is experienced especially frequently (e.g. in a series such as polka-dot dress, polka-dot shirt, polka-dot pants, etc.), it will acquire stronger associations during learning, and hence will dominate understanding. This is how our approach explains the experimental results that CARIN theory (Gagné, Citation2001; Gagné & Shoben, Citation1997) accounts for by means of abstract decontextualised concepts (such as “polka dot”) and an associated probability distribution over a set of abstract conceptual relations.
2. Milin et al. (Citation2015) report a computational modelling study in which a Rescorla–Wagner network was trained on 4.8 million utterances (subtitles accompanying movie scenes) from an English subtitle corpus. With only a year's worth of reading experience (some 22 million word tokens, 84,000 types) the model correctly predicted the highest activations for all lexomes in an utterance for 65% of the utterances.