49
Views
3
CrossRef citations to date
0
Altmetric
Original Articles

Atomistic learning in non-modular systems

Pages 313-325 | Published online: 23 Jan 2007
 

Abstract

We argue that atomistic learning—learning that requires training only on a novel item to be learned—is problematic for networks in which every weight is available for change in every learning situation. This is potentially significant because atomistic learning appears to be commonplace in humans and most non-human animals. We briefly review various proposed fixes, concluding that the most promising strategy to date involves training on pseudo-patterns along with novel items, a form of learning that is not strictly atomistic, but which looks very much like it ‘from the outside’.

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. 0137255. We would like to acknowledge Chris May, Alexa Lee, Robert French, Jim Garson, and Paul Teller for helpful comments and suggestions on earlier drafts of this paper.

Notes

 Bayesian models, for instance, assume we can assign probabilities to individual beliefs and modify those probabilities individually as a function of current evidence.

 Of course, given finite memory, there will always be a point where a system can no longer add new information without deleting old information. So this point only applies to cases where there is enough room in memory to add the new information in the first place. Also, this constraint is not intended to rule out cases where, upon learning an atom, an inference rule results in updating (and therefore possibly not conserving) other relevant beliefs.

 Networks can, of course, be stochastic, in which case there need not be a unique output on a given input. In multi-layer feed-forward networks, output is W n  × W n −1 × ··· × W 1 × I, where I is the input vector, and W x is the weight matrix connecting layer x and layer x + 1. In recurrent nets, the process that generates an output vector from an input vector and the weights is more complex and varies with the architecture. Nevertheless, given a fixed architecture, the input and the weights suffice to fix the output vector (modulo whatever randomness there is in the system).

 Any change in input–output behavior that can be effected by weight changes can be effected by fixing the weights and altering the activation functions of the individual units. In some cases, learning not only affects weights, but also effects changes in activation functions, e.g., by altering thresholds.

 The sense in which a network is said to be distributed (i.e., non-modular) should be distinguished from the sense in which its activation vectors are said to use a distributed encoding scheme. An encoding scheme for a vector v is fully distributed relative to a domain of representational targets D if every element of v is involved in the representation of every element of D. A fully distributed network can utilize a local encoding scheme, and a local network can utilize a distributed encoding scheme.

 In this case, the whole class is novel with respect to previous learning, in that the new pairs are not accommodated by generalization from previously learned cases. The pairs in the class are not ‘novel’ with respect to each other, since, owing to their similarity to each other, training on a few will improve performance on others via generalization. In the latter case, we have a whole class of ‘exception’ cases with inputs similar to a previously learned input.

 In a strictly unstructured domain, preventing generalization might be desirable. But cortical circuits cannot be expected to know in advance whether the domains they are recruited to learn are structured or not. Moreover, as we emphasized earlier, an important case of novel case learning involves mastery of exception cases in otherwise highly structured domains. Blocking generalization in these cases would evidently be catastrophic unless it could be blocked only around the exceptions. But this would evidently require supernatural prescience.

 For example, Page (Citation2000). Page thinks that localist solutions can preserve generalization and graceful degradation. However, we are skeptical. First, the kind of generalization that he is talking about is not the kind of generalization that we describe here. Rather, it is the kind of generalization that one finds in Rumelhart and McClelland's (Citation1986) Jets and Sharks network: given the activation of a node that represents a Shark (or Jet), several other nodes that represent various properties of a typical Shark (or Jet) will be activated. This is a very different sort of generalization from the one discussed here, where what is relevant is similarity of response given similarity of input. Second, it is unclear whether Page's networks are local in the sense in which we use the term. He seems to be concerned with the issue of localism at the level of representation in the activation vectors, while we are discussing localism in the sense of modularity.

 This is an improvement on previous work in which catastrophic forgetting was avoided for networks that trained on ‘a series of, “static” (non-temporal) patterns’ but not for temporal patterns (CitationAns & Rousset, 1997; CitationFrench, 1997; CitationRobbins, 1995). Though these earlier networks would suffice to illustrate the learning of new I/O pairs without catastrophic interference, we use the RSRN network because it represents the latest, most powerful incarnation of the pseudo-pattern strategy.

 Dual-network architectures are not without neurological and theoretical precedent. See McClelland et al. (Citation1995) for an account of such architectures in terms of the observed limitations and successes of connectionist networks. We present a brief summary in the next section.

 Even the proponents of pseudo-pattern rehearsal seem to understate this point. They write about pseudo-patterns reflecting what has been ‘previously learned’ as if the pseudo-pattern encodes which aspects of the resulting weight matrix (or of the function it computes) were due to a learning episode as opposed to being already there. But from what they tell us about pseudo-patterns, RSRNs do not use such a distinction.

 Incidentally, the RSRN approach of using pseudo-patterns seems to solve both the creation and storage of such representative knowledge.

 Consider intrusions, i.e., spontaneously generated but robust errors. These are difficult to explain in standard network models, because error signals during learning are generated only by comparison with correct responses. Really robust errors that are not simply the result of limited capacity have to be explicitly trained, and are therefore not spontaneous but due to existing errors in the training set. RSRNs, however, are trained on pseudo-patterns. Training on pseudo-patterns, as opposed to explicit retro-training, conserves everything equally: previously acquired responses to non-significant inputs are on all fours with previously learned responses to significant inputs. The process that preserves previous learning is blind to the distinction between inputs that are significant and those that are not. A lot of ‘trash’ is preserved. Preservation of ‘trash’ pairs induces generalization effects in exactly the same way that preservation of significant pairs does, and this can be expected to introduce intrusions when ‘trash’ inputs are sufficiently close to significant ones. Moreover, errors, once introduced, will tend to be preserved unless they are explicitly overwritten.

 For this last property, see Husband, Smith, Jakobi and O'shea (Citation1998).

 Of course, other beliefs may seem to ‘come along for the ride’, but here we must recognize those additional beliefs that have some epistemic justification. For example, upon learning what my aunt's phone number is I may form beliefs such as that the prefix of her number is identical to the prefix of my dentist's number, or that my aunt has a number that contains no evens. As we see it, these are beliefs that are inferred (with some epistemic justification) from the original belief which pairs my aunt with a specific number. That other beliefs can be and are inferred from one belief is, on its own, no reason to hold that the one belief is not learned individually.

 We have some sympathy with this idea as well. See Cummins et al. (Citation2004).

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 480.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.