Search in:

Language, Cognition and Neuroscience Volume 38, 2023 - Issue 4: Special Issue: Emergence of speech and language from prediction error: Error-driven language models

Submit an article Journal homepage

326

Views

CrossRef citations to date

Altmetric

REGULAR ARTICLES

Learning fast while avoiding spurious excitement and overcoming cue competition requires setting unachievable goals: reasons for using the logistic activation function in learning to predict categorical outcomes

Vsevolod KapatsinskiDepartment of Linguistics, University of Oregon, Eugene, OR, USACorrespondence[email protected]

Pages 575-596 | Received 04 Jun 2020, Accepted 26 Apr 2021, Published online: 17 May 2021

Cite this article
https://doi.org/10.1080/23273798.2021.1927120
CrossMark

Sample our Language & Literature journals, sign in here to start your FREE access for 14 days

Full Article
Figures & data
References
Supplemental
Citations
Metrics
Reprints & Permissions
Read this article /doi/full/10.1080/23273798.2021.1927120?needAccess=true

ABSTRACT

Language learning often involves predicting categorical outcomes based on a set of cues. Error in predicting a categorical outcome is the difference between zero or one and the outcome’s current level of activation. The current activation level of a categorical outcome is argued to be a non-linear, logistic function of activation the outcome receives from the cues. Crucially, the logistic activation function asymptotically approaches zero and one without ever reaching or overshooting them. This allows error-driven learning to avoid settling on spurious associations between cues and outcomes that never co-occur (“spurious excitement”). In an artificial language experiment, humans are also not observed to show spurious excitement. The logistic activation function is compared to alternative solutions to spurious excitement, and shown to have important advantages. It enables one-shot learning and steep, S-shaped learning curves, and explains why cue competition in language learning can be overcome with additional training.

KEYWORDS:

Rescorla-Wagner model
language acquisition
perceptron
logistic activation function
cue competition
blocking
overshadowing

Acknowledgements

I am grateful for financial support from the University of Oregon Fund for Faculty Excellence. Many thanks to Richard Futrell for a useful discussion, to Zara Harmon, Jessie Nixon and two anonymous reviewers for helpful comments on earlier versions of this manuscript, and to Zachary Houghton for assistance with running subjects. Parts of this work were presented in a lecture for the “ABRALIN en vivo” series.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 In deep networks, which have a very large number of hidden layers, the logistic function has computational issues that have led to its replacement by rectified linear functions that zero out input activations below a certain threshold but transmit them faithfully otherwise (e.g., X. Wang, Qin, Y. Wang et al., Citation2019). However, it is not clear whether a truly deep approach is necessary for language learning. Thus, Arnold et al. (Citation2017) showed that even simple two-layer networks should not be discounted, by demonstrating that they can perform as well as state-of-the-art deep networks on a challenging spoken word recognition task. Furthermore, the sigmoid function has an advantage of biological plausibility because, unlike rectified linear units, biological neurons cannot exceed a maximum firing rate. From a learning-theoretic perspective, however, the proof is in capturing the effects of experience on behaviour.

2 The only effect of s is to make the treatment of error more or less categorical. In the original RW model, error is treated continuously, and there is nothing special about making predictions that are categorically correct or incorrect. With very high slopes (dashed line in ), an error is inhibiting an outcome when one should have excited it or vice versa, and how far one is over the line does not matter much. With low slopes, distance from 0 and 1 exerts a continuous influence on learning rate, as in the original version of RW.

3 Jamieson et al. (Citation2012) develop a configural model that accounts for all types of cue competition, by proposing that (1) cues and outcomes are represented by features, and (2) expected feature values are not represented as robustly as unexpected ones. However, as in linear RW, cues can be overexpected and are then represented as opposite feature values. It is therefore unclear whether this model would avoid spurious excitement.

4 See also Hayes (Citation2020, p. 34) for problematic predictions of the Z-shaped activation function of the humble teacher for speech category identification and (arguably) spurious cue interactions.

5 This early superiority of EA and AF is likely the right prediction because, as seen in , responding with X to A grows much more slowly than responding with X to any stimulus containing E or F, because A is paired with Y on AB and CA trials.

6 To determine significance, the model was run 64 times, matching the number of subjects, for each learning rate. Learning rates are specified to the third decimal space. For example, a minimum learning rate of .007 means that .0064 yielded a non-significant result while .007 yielded a significant one.

7 There is debate on whether learning in these paradigms is indeed really fast, or if the apparent rapid changes in behavior are due to decision-making processes (i.e., maximizing; e.g., McMurray et al., Citation2012). If it can be shown that the underlying association weights actually change slowly (e.g., by utilizing tasks that do not require learners to make categorical decisions), or that the apparently fast learning involves the learner replaying the experience to themselves, learning a little from each repetition, this would provide evidence for RW as a model of learning.

8 Another theoretical advantage of the activation function approach is that it allows Λ to remain the product of cue and outcome salience, as proposed by Rescorla and Wagner (Citation1972), rather than being a free parameter that is adjusted to fit experiment design.

9 These results differ from those reported by Dawson (Citation2008, p. 81, 83), who states that there is almost complete blocking of D and G in the logistic model. This discrepancy may be due to Dawson stopping training of the model as soon as it achieved criterial accuracy (“converged”), rather than allowing it to learn for a specified number of trials.

10 With length-based encoding and very fast learning rates (>.3), D by itself is incorrectly predicted to cue X (as in the linear model) but not because D cues X. Instead, length=1 becomes associated with X more strongly than D is associated with Y (because length=1 stimuli cueing X are more diverse than those cueing Y, i.e., X has higher type frequency even though token frequencies are controlled). For all other encodings and learning rates, the configuration of D by itself always cues Y.

11 Griffiths et al. (Citation2011) acknowledge that learning in associative models can be arbitrarily fast, but suggest that one-shot learning is still problematic because such models do not explain when learning can occur quickly. However, as shown by McClelland et al. (Citation1995), these models do predict when learning can be fast: it can be fast to the extent that representations of cue configurations involve unique cues. Otherwise, catastrophic interference can result from rapid learning. That is, learning can be fast when cue configurations rooting for different outcomes either are easily discriminable, in the way they are represented.

12 One could argue that the “correct” activation for an outcome is its true probability of occurrence given the set of cues. Thus, when the probability of occurrence of an outcome given the cues is 0.5, the correct activation is 0.5. However, even in the linear RW model, there will always be error in predicting an outcome that is not deterministically predictable by the cues. Thus the model considers its beliefs about a set of cues to be correct, and therefore not in need of updating, only if it is perfectly certain in what outcomes these cues predict.

Wang, X., Qin, Y., Wang, Y., Xiang, S., & Chen, H. (2019). ReLTanh: An activation function with vanishing gradient resistance for SAE-based DNNs and its application to rotating machinery fault diagnosis. Neurocomputing, 363, 88–98. https://doi.org/10.1016/j.neucom.2019.07.017

Web of Science ®Google Scholar

Arnold, D., Tomaschek, F., Sering, K., Lopez, F., & Baayen, R. H. (2017). Words from spontaneous conversational speech can be recognized with human-like accuracy by an error-driven learning algorithm that discriminates between meanings straight from smart acoustic features, bypassing the phoneme as recognition unit. PloS One, 12(4), e0174623. https://doi.org/10.1371/journal.pone.0174623

PubMed Web of Science ®Google Scholar

Jamieson, R. K., Crump, M. J., & Hannah, S. D. (2012). An instance theory of associative learning. Learning & Behaviour, 40(1), 61-82. https://doi.org/10.3758/s13420-011-0046-2

Google Scholar

Hayes, B. (2020). Deriving the wug-shaped curve: A criterion for assessing formal theories of linguistic variation. (Unpublished manuscript, UCLA). https://linguistics.ucla.edu/people/hayes/papers/HayesWugShapedCurve.pdf

Google Scholar

McMurray, B., Horst, J. S., & Samuelson, L. K. (2012). Word learning emerges from the interaction of online referent selection and slow associative learning. Psychological Review, 119(4), 831–877. https://doi.org/10.1037/a0029872

PubMed Web of Science ®Google Scholar

Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the eﬀectiveness of reinforcement and nonreinforcement. In A. H. Black, & W. F. Prokasy (Eds.), Classical conditioning II: Current research and theory (pp. 64–99). Appleton-Century-Crofts.

Google Scholar

Dawson, M. R. (2008). Connectionism and classical conditioning. Comparative Cognition & Behaviour Reviews, 3, 1–115. https://doi.org/10.3819/ccbr.2008.30008

Google Scholar

Griffiths, T. L., Sobel, D. M., Tenenbaum, J. B., & Gopnik, A. (2011). Bayes and blickets: Effects of knowledge on causal induction in children and adults. Cognitive Science, 35(8), 1407–1455. https://doi.org/10.1111/j.1551-6709.2011.01203.x

PubMed Web of Science ®Google Scholar

McClelland, J. L., McNaughton, B. L., & O'Reilly, R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102(3), 419–457. https://doi.org/10.1037/0033-295X.102.3.419

PubMed Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Learning fast while avoiding spurious excitement and overcoming cue competition requires setting unachievable goals: reasons for using the logistic activation function in learning to predict categorical outcomes

Information for

Open access

Opportunities

Help and information

Learning fast while avoiding spurious excitement and overcoming cue competition requires setting unachievable goals: reasons for using the logistic activation function in learning to predict categorical outcomes

ABSTRACT

Acknowledgements

Disclosure statement

Notes

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature