326
Views
3
CrossRef citations to date
0
Altmetric
REGULAR ARTICLES

Learning fast while avoiding spurious excitement and overcoming cue competition requires setting unachievable goals: reasons for using the logistic activation function in learning to predict categorical outcomes

Pages 575-596 | Received 04 Jun 2020, Accepted 26 Apr 2021, Published online: 17 May 2021
 

ABSTRACT

Language learning often involves predicting categorical outcomes based on a set of cues. Error in predicting a categorical outcome is the difference between zero or one and the outcome’s current level of activation. The current activation level of a categorical outcome is argued to be a non-linear, logistic function of activation the outcome receives from the cues. Crucially, the logistic activation function asymptotically approaches zero and one without ever reaching or overshooting them. This allows error-driven learning to avoid settling on spurious associations between cues and outcomes that never co-occur (“spurious excitement”). In an artificial language experiment, humans are also not observed to show spurious excitement. The logistic activation function is compared to alternative solutions to spurious excitement, and shown to have important advantages. It enables one-shot learning and steep, S-shaped learning curves, and explains why cue competition in language learning can be overcome with additional training.

Acknowledgements

I am grateful for financial support from the University of Oregon Fund for Faculty Excellence. Many thanks to Richard Futrell for a useful discussion, to Zara Harmon, Jessie Nixon and two anonymous reviewers for helpful comments on earlier versions of this manuscript, and to Zachary Houghton for assistance with running subjects. Parts of this work were presented in a lecture for the “ABRALIN en vivo” series.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 In deep networks, which have a very large number of hidden layers, the logistic function has computational issues that have led to its replacement by rectified linear functions that zero out input activations below a certain threshold but transmit them faithfully otherwise (e.g., X. Wang, Qin, Y. Wang et al., Citation2019). However, it is not clear whether a truly deep approach is necessary for language learning. Thus, Arnold et al. (Citation2017) showed that even simple two-layer networks should not be discounted, by demonstrating that they can perform as well as state-of-the-art deep networks on a challenging spoken word recognition task. Furthermore, the sigmoid function has an advantage of biological plausibility because, unlike rectified linear units, biological neurons cannot exceed a maximum firing rate. From a learning-theoretic perspective, however, the proof is in capturing the effects of experience on behaviour.

2 The only effect of s is to make the treatment of error more or less categorical. In the original RW model, error is treated continuously, and there is nothing special about making predictions that are categorically correct or incorrect. With very high slopes (dashed line in ), an error is inhibiting an outcome when one should have excited it or vice versa, and how far one is over the line does not matter much. With low slopes, distance from 0 and 1 exerts a continuous influence on learning rate, as in the original version of RW.

3 Jamieson et al. (Citation2012) develop a configural model that accounts for all types of cue competition, by proposing that (1) cues and outcomes are represented by features, and (2) expected feature values are not represented as robustly as unexpected ones. However, as in linear RW, cues can be overexpected and are then represented as opposite feature values. It is therefore unclear whether this model would avoid spurious excitement.

4 See also Hayes (Citation2020, p. 34) for problematic predictions of the Z-shaped activation function of the humble teacher for speech category identification and (arguably) spurious cue interactions.

5 This early superiority of EA and AF is likely the right prediction because, as seen in , responding with X to A grows much more slowly than responding with X to any stimulus containing E or F, because A is paired with Y on AB and CA trials.

6 To determine significance, the model was run 64 times, matching the number of subjects, for each learning rate. Learning rates are specified to the third decimal space. For example, a minimum learning rate of .007 means that .0064 yielded a non-significant result while .007 yielded a significant one.

7 There is debate on whether learning in these paradigms is indeed really fast, or if the apparent rapid changes in behavior are due to decision-making processes (i.e., maximizing; e.g., McMurray et al., Citation2012). If it can be shown that the underlying association weights actually change slowly (e.g., by utilizing tasks that do not require learners to make categorical decisions), or that the apparently fast learning involves the learner replaying the experience to themselves, learning a little from each repetition, this would provide evidence for RW as a model of learning.

8 Another theoretical advantage of the activation function approach is that it allows Λ to remain the product of cue and outcome salience, as proposed by Rescorla and Wagner (Citation1972), rather than being a free parameter that is adjusted to fit experiment design.

9 These results differ from those reported by Dawson (Citation2008, p. 81, 83), who states that there is almost complete blocking of D and G in the logistic model. This discrepancy may be due to Dawson stopping training of the model as soon as it achieved criterial accuracy (“converged”), rather than allowing it to learn for a specified number of trials.

10 With length-based encoding and very fast learning rates (>.3), D by itself is incorrectly predicted to cue X (as in the linear model) but not because D cues X. Instead, length=1 becomes associated with X more strongly than D is associated with Y (because length=1 stimuli cueing X are more diverse than those cueing Y, i.e., X has higher type frequency even though token frequencies are controlled). For all other encodings and learning rates, the configuration of D by itself always cues Y.

11 Griffiths et al. (Citation2011) acknowledge that learning in associative models can be arbitrarily fast, but suggest that one-shot learning is still problematic because such models do not explain when learning can occur quickly. However, as shown by McClelland et al. (Citation1995), these models do predict when learning can be fast: it can be fast to the extent that representations of cue configurations involve unique cues. Otherwise, catastrophic interference can result from rapid learning. That is, learning can be fast when cue configurations rooting for different outcomes either are easily discriminable, in the way they are represented.

12 One could argue that the “correct” activation for an outcome is its true probability of occurrence given the set of cues. Thus, when the probability of occurrence of an outcome given the cues is 0.5, the correct activation is 0.5. However, even in the linear RW model, there will always be error in predicting an outcome that is not deterministically predictable by the cues. Thus the model considers its beliefs about a set of cues to be correct, and therefore not in need of updating, only if it is perfectly certain in what outcomes these cues predict.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.