342
Views
3
CrossRef citations to date
0
Altmetric
Articles

Evaluation, use, and refinement of knowledge representations through acquisition modeling

Pages 126-147 | Received 04 Dec 2015, Accepted 18 Apr 2016, Published online: 11 Aug 2016
 

ABSTRACT

Generative approaches to language have long recognized the natural link between theories of knowledge representation and theories of knowledge acquisition. The basic idea is that the knowledge representations provided by Universal Grammar enable children to acquire language as reliably as they do because these representations highlight the relevant aspects of the available linguistic data. So, one reasonable evaluation of any theory of representation is how useful it is for acquisition. This means that when we have multiple theories for how knowledge is represented, we can try to evaluate these theoretical options by seeing how children might use them during acquisition. Computational models of the acquisition process are an effective tool for determining this, since they allow us to incorporate the assumptions of a representation into a cognitively plausible learning scenario and see what happens. We can then identify which representations work for acquisition and what those representations need to work. This in turn allows us to refine both our theories of how knowledge is represented and how those representations are used by children during acquisition. I discuss two case studies of this approach for representations in metrical stress and syntax and consider what we learn from this computational acquisition evaluation in each domain.

Acknowledgments

I am especially grateful to Jon Sprouse, Tim Ho, and Zephyr Detrano for being delightful collaborators on the projects discussed here, and to Joanna Lee for her help analyzing the metrical stress data. I’m also indebted to Jeff Lidz and Jeff Heinz for being wonderfully supportive of this work, and to Rachel Dudley for organizing the workshop where these ideas were presented. These ideas have additionally benefited from discussion with Pranav Anand, Misha Becker, Bob Berwick, Adrian Brasoveanu, Alex Clark, Sandy Chung, Bob Frank, Norbert Hornstein, Jim McCloskey, Armin Mester, Colin Phillips, William Sakas, Virgina Valian, Matt Wagers, Charles Yang, and several anonymous reviewers, as well as the audiences at ISA 2012, UMaryland Mayfest 2012, NYU Linguistics 2012, JHU Cognitive Sciences 2013, UC Irvine IMBS 2013, UC Irvine LPS 2013, UC Santa Cruz Linguistics 2014, BLS 2014, and GALANA 2015.

Funding

This work has additionally been supported by NSF grants BCS-0843896 and NSF BCS-1347028.

Notes

1 The syllabified and stressed annotations for the word types are available at http://www.socsci.uci.edu/~lpearl/CoLaLab/uci-brent-syl-structure-Jul2014.xlsx.

2 In particular, OT theorists often reserve the term grammar for a mapping from underlying representations to surface representations (e.g., underlying /lɪ ɾɪl/ to surface /(lÍ) ɾl/). This can easily correspond to multiple explicit rankings of the constraints available in the prodKR, as it does here for English. This contrasts with how I use the term grammar here, where I mean a single combination of the variable values in the prodKR—in particular, a single explicit ranking of the constraints. I will continue with this latter use of grammar in order to be fair to the HV and Hayes prodKRs, whose grammars are also defined by the distinct combinations of the variable values available (which are parameter values for those prodKRs).

3 The HV prodKR allows three syllabic distinctions, the Hayes prodKR allows four, and the OT prodKR allows eight. See Pearl, Ho & Detrano (Citation2014, CitationIn press) for a more detailed discussion of these distinctions.

4 I note that this is in the same spirit as the classical principle of Empirical Risk Minimization (ERM) in statistical learning theory (Vapnik Citation1992, Citation2013), where the learner picks a hypothesis that minimizes error on the training data. In this case, it would mean children should pick a hypothesis that best accounts for the data available, i.e., the data in their acquisitional intake. Interestingly, this can lead to a problem of overfitting, where the learner attempts to match the data too closely and loses generalization capability. This is very much a problem children face, as they want a grammar that correctly accounts for data they have yet to encounter—i.e., that generalizes appropriately. Many thanks to an anonymous reviewer for bringing my attention to this.

5 This can be helpful for preventing overfitting, which happens when the learner attempts to account for data that actually aren’t relevant for generalization. That is, these data are in the input, but they’re anomalies in the sense that they aren’t generated by the productive system for the language. So, these data can lead the child astray when identifying the correct productive system for the language.

6 Interestingly, just because variation is removed doesn’t mean the productive data are completely captured by any single grammar. This can be seen by the learnability potential still being less than 1 in . This is because “productive” according to the Tolerance Principle is simply about relative frequency and doesn’t necessarily accord with the variables a prodKR uses to define its grammars. See Pearl, Ho & Detrano (CitationIn press) for an explicit demonstration of this point.

7 Note that trigram encodings typically represent the beginning and ending of sequences with special symbols (start and end here), since this is relevant information.

8 Note that all log probabilities are negative because raw probabilities are between 0 and 1, and so the log probability is between negative infinity and 0 (e.g., log(0.000001) = −6 while log(1) = 0). This means the numbers closer to zero are more positive and appear higher on the y axis—these represent structures judged by the modeled learner as “more acceptable.” Numbers further from zero are more negative and appear lower on the y axis—these represent structures judged “less acceptable.”

9 Because we currently don’t have a precise theory for translating probabilities into acceptability judgments, it doesn’t make as much sense to look for a quantitative match. This is because actual acceptability judgments are based on many factors that are not included in this model, such as lexical item choice, semantic probability, and processing difficulty (Schütze Citation1996; Cowart Citation1997; Keller Citation2000; Sprouse Citation2009).

Additional information

Funding

This work has additionally been supported by NSF grants BCS-0843896 and NSF BCS-1347028.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.