563
Views
10
CrossRef citations to date
0
Altmetric
Article

The Acquisition of Anaphora by Simple Recurrent Networks

, &
Pages 181-227 | Received 31 Jan 2012, Accepted 28 Aug 2012, Published online: 18 Jun 2013
 

Abstract

This article applies Simple Recurrent Networks (SRNs; CitationElman 1991, Citation1993) to the task of assigning an interpretation to reflexive and pronominal anaphora. This task demands more refined sensitivity to syntactic structure than has been previously explored. Measured quantitatively, SRNs perform quite well. However, the way in which they achieve such performance diverges in key respects from the target grammar: (i) linear N-V-reflexive/pronoun sequences affect the SRN's interpretations, even without a relevant structural relation, yielding errors unlike those made by humans; (ii) the SRN's representations distinguish sentence types, inhibiting structural generalization; (iii) the SRN's knowledge of the conditions on anaphoric dependencies fails to generalize to novel lexical items. These results have important consequences not only for the viability of SRNs as models of language learning but also for the systematicity of generalization in neural networks (CitationHadley 1994; CitationMarcus 1998).

ACKNOWLEDGMENTS

For much useful discussion, suggestions, and criticisms, we are grateful to Esteban Buz, Janet Fodor, Christo Kirov, Jeff Lidz, Jeff Markowitz, Lisa Pearl, William Sakas, Paul Smolensky, John Stowe, our anonymous reviewers, as well as audiences at numerous colloquium and conference presentations. Financial support was provided by National Science Foundation (NSF) grant SBR-0446929. The material in this article is based in part on work done while the third author was employed as an NSF Program Director. Any opinion, findings, and conclusions expressed in this article are those of the authors and do not necessarily reflect the views of the U.S. National Science Foundation.

Notes

1Other approaches to language learning marry statistical induction to a language-specific (UG) conception of inductive bias. See CitationYang (2004) and CitationPearl (2011) for representative examples. Such approaches, which we think are promising, hold out the possibility of reducing, though not eliminating, the explanatory burden placed on language-specific innate knowledge.

2 CitationJoanisse & Seidenberg (2003) have also applied SRNs to the problem of anaphoric interpretation. Their focus, however, was quite different than ours: on modeling the effects of impaired working memory of phonology on performance in this domain. Perhaps because of this, they did not study in great detail the successes and failures of their intact network on the anaphora task.

3Note that lexical abstractness is not universally assumed for the acquisition of all grammatical processes. For instance, CitationTomasello (1992, Citation2000), among others, argues that subcategorization frames and argument structure alternations are learned on a verb-by-verb basis. See CitationFisher (2002) for a different view. Whatever the resolution of this debate, we take it to be uncontroversial that the acquisition of anaphora does not work in this way, and indeed it is conspicuous that to our knowledge such a lexically specific proposal has never been made in this domain.

4Hadley also defines a notion of weak systematicity, requiring only that the an agent be able to process sentences that are novel combinations of its vocabulary, without the generalization of words to novel positions. One might argue that a network satisfying this condition demonstrates an extremely limited sort of lexical abstraction, but no more than the sort that might be exhibited by a bigram model, which can process novel sequences so long as the constituent bigrams have been observed during training.

5Indeed, languages show a certain degree of variation in what characterizes the local domain both for pronouns and reflexives, though interestingly there is almost always complementarity between the contexts in which pronouns and reflexives may occur with a particular antecedent. For some relevant discussion, see CitationKoster & Reuland (1991). We put aside discussion of this issue for the remainder of this article, apart from noting that in current work we are exploring the question of whether there are any properties of SRNs that would lead us to expect such complementarity.

6An anonymous reviewer observes that it might be more reasonable to provide probability distributions as target outputs. We do not do this, however. We tentatively assume that these targets stem from the learner's hypothesis about the interpretation of a pronoun. Such hypotheses are derived from multiple clues, including aspects of the discourse context as well as her understanding of the speaker's intentions. Even if the learner's hypothesis is unlikely to be correct in every occasion, there is little reason to expect that this hypothesis will match the set of structurally accessible antecedents. Nonetheless, even in the face of unique target outputs, the optimal output (in the sense of minimizing error) will correspond to a probability distribution over possible outputs. In future work, we plan to explore the effect of introducing noise into the targets for pronominal reference.

7For a number of the simulations reported here, we have also performed identical simulations using Elman's starting small regimen, in which the network's hidden units' activations were reset during training after an increasingly long delay. In Elman's paper, this had an effect on learning identical to that of changing the relative proportion of simple and complex sentences over training, namely it permitted the network to learn the long-distance subject-verb dependencies. In none of our simulations did we find that this starting-small regimen improved the network's performance, either in terms of error minimization or in terms of generalization to held-out sentence types.

8Here and throughout, percentages reported in this way give a 95% confidence interval.

9This measure of accuracy on reflexive interpretation might be seen as too lenient: Since network activations have a natural interpretation as probabilities, a referent having the highest activation is only the one that the network takes to be most likely. Given that the interpretation of reflexives is always unique given a particular structure in our training set, it might be appropriate to require a nearly unimodal probability distribution to count as success. One way to test this is to require that the activation of the target referent must be above some threshold. If we set this threshold to be .8, average network accuracy decreases slightly, to 77%.

10We conducted a similar linearity analysis on sentences containing pronominal objects, using the comparisons described in section 5.2. Though this analysis did turn up many instances of linearity effects across all six of our networks, there were also three instances of statistically significant effects in the direction opposite from what is predicted by linearity. As already noted, however, the overall level of performance in the task of pronominal reference was not very high, and network outputs did not closely reflect the desired activations: When the target activation was 0, the mean activation across networks was .292, when the target was .333 mean activation was .287, and when the target was .5, mean activation was .32. As a result, we are reluctant to conclude much from these results.

11When a sentence included two levels of relative clause embedding, the outer relative was always a subject relative. The gender of the object of the outer relative was allowed to vary freely. The two sentence types are thus illustrated by the following pair of sentences:

(i) a. ObjRel: John who sees Alice who Harold visits admires himself.

b. SubjRel: John who sees Alice who visits Harold admires himself.

12This puts aside a good many occurrences of reflexive pronouns, all inconsistent with a linearity-based generalization. However, these other sentences do not bear on the question of whether linearity is a possible source of information in the context in which it is confounded with a structural relation. Furthermore, as we shall see, SRNs appear to learn distinct generalizations about anaphoric dependencies in distinct structural contexts. Consequently, it seems appropriate to examine the support for such a generalization in a particular context.

13We report results here for only five of the then six networks that we trained using this regimen. Data from the sixth network, which was not excluded for any reason related to its behavior, have been corrupted and are no longer available for analysis.

14Of course, there may be similarities between these representations that are being clouded by other distinctions, and these (clouded) similarities could nonetheless provide the network with a basis for generalization across sentence types. Nonetheless, to a first degree of approximation, the kind of representational distinctiveness detected by HCA often diagnoses behaviorally relevant representational distinctions, and we therefore consider whether this is true in this case.

15Attempts to facilitate generalization by reducing the number of hidden units, as discussed in the last section, had no effect on the outcome in this case.

16Indeed CitationMarcus (1998:266) already noted the similarity of his “a rose is a rose” problem to that of establishing a relation between a reflexive and its antecedent.

17Marcus discusses a modification of his “a rose is a rose” experiment that also sidesteps the training space issue, but where the network nonetheless fails. To the previous training set, he adds sentences containing the previously withheld word rose that are not of the form an X is an X (the example he gives in the article is the bee sniffs the rose). In the face of such examples, the network does have reason to modify its connections from and to the input and output units corresponding to the word rose. However, even with this modification, the SRN still fails on the “a rose is a rose” task. Marcus suggests that this result nonetheless follows from his training space conception. He proposes that prediction in the different sentence types constitutes distinct functions that the network must compute, and the notion of training space is defined “with respect to some particular function.” Though Marcus's intuition about multiple functions is reasonably clear, it is not at all obvious what it would mean for the network to divide its task into separate functions or, even in the face of a coherent definition of such a notion, why we should expect that the network would divide it along the lines that seem most natural to the experimenter. Thus, without considerable analysis of the network's functioning, it will be difficult indeed to advance a function-specific training space argument.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.