3,010
Views
74
CrossRef citations to date
0
Altmetric
Formal vs. Processing Approaches to Syntactic Phenomena

The need for quantitative methods in syntax and semantics research

&
Pages 88-124 | Received 14 Dec 2009, Accepted 06 Aug 2010, Published online: 27 Oct 2010
 

Abstract

The prevalent method in syntax and semantics research involves obtaining a judgement of the acceptability of a sentence/meaning pair, typically by just the author of the paper, sometimes with feedback from colleagues. This methodology does not allow proper testing of scientific hypotheses because of (a) the small number of experimental participants (typically one); (b) the small number of experimental stimuli (typically one); (c) cognitive biases on the part of the researcher and participants; and (d) the effect of the preceding context (e.g., other constructions the researcher may have been recently considering). In the current paper we respond to some arguments that have been given in support of continuing to use the traditional nonquantitative method in syntax/semantics research. One recent defence of the traditional method comes from Phillips (2009), who argues that no harm has come from the nonquantitative approach in syntax research thus far. Phillips argues that there are no cases in the literature where an incorrect intuitive judgement has become the basis for a widely accepted generalisation or an important theoretical claim. He therefore concludes that there is no reason to adopt more rigorous data collection standards. We challenge Philips' conclusion by presenting three cases from the literature where a faulty intuition has led to incorrect generalisations and mistaken theorising, plausibly due to cognitive biases on the part of the researchers. Furthermore, we present additional arguments for rigorous data collection standards. For example, allowing lax data collection standards has the undesirable effect that the results and claims will often be ignored by researchers with stronger methodological standards. Finally, we observe that behavioural experiments are easier to conduct in English than ever before, with the advent of Amazon.com's Mechanical Turk, a marketplace interface that can be used for collecting behavioural data over the internet.

Acknowledgments

This research conducted here was supported by the National Science Foundation under Grant No. 0844472, “Bayesian Cue Integration in Probability-Sensitive Language Processing”. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

We would like to thank Diogo Almeida, Leon Bergen, Joan Bresnan, David Caplan, Nick Chater, Morten Christiansen, Mike Frank, Adele Goldberg, Helen Goodluck, Greg Hickok, Ray Jackendoff, Nancy Kanwisher, Roger Levy, Maryellen MacDonald, James Myers, Colin Phillips, Steve Piantadosi, Steve Pinker, David Poeppel, Omer Preminger, Ian Roberts, Greg Scontras, Jon Sprouse, Carson Schütze, Mike Tanenhaus, Vince Walsh, Duane Watson, Eytan Zweig, members of TedLab, and three anonymous reviewers for their comments on an earlier draft of this paper. We would also like to thank Kristina Fedorenko for her help in constructing the materials for the experiment in Case Study 3.

Notes

1In a nonquantitative (single-participant/single-item) version of the acceptability judegment task a 2- or 3-point scale is typically used, usually consisting of “good”/“natural”/“grammatical”/“acceptable” vs. “bad”/“unnatural”/“ungrammatical”/“unacceptable” (usually annotated with an asterisk “*” in the papers reporting such judgements), and sometimes including a judgement of “in between”/“questionable” (usually annotated with a question mark “?”). In a quantitative version of the acceptability judgement task (with multiple participants and items) a fixed scale (a “Likert scale”) with five or seven points is typically used. Alternatively, a geometric scale is used where the acceptability of each target sentence is compared to a reference sentence. This latter method is known as magnitude estimation (Bard, Robertson, & Sorace, Citation1996). Although some researchers have hypothesised that magnitude estimation allows detecting more fine-grained distinctions than Likert scales (Bard et al., 1996; Featherston, Citation2005, 2007; Keller, Citation2000), controlled experiments using both Likert scales and magnitude estimation suggest that the two methods are equally sensitive (Fedorenko & Gibson, 2010a; Fukuda, Michel, Beecher, & Goodall, Citation2010; Weskott & Fanselow, Citation2008, Citation2009).

2It should be noted that some researchers have criticised the acceptability judgement method because it requires participants to be aware of language as an object of evaluation, rather than simply as a means of communication (Edelman & Christiansen, Citation2003). Whereas this concern is worth considering with respect to the research questions that are being evaluated, one should also consider the strengths of the acceptability judgement method: (1) it is an extremely simple and efficient task; and (2) results from acceptability judgement experiments are highly systematic across speakers and correlate with other dependent measures, presumably because the same factors affect participants’ responses across different measures (Schütze, Citation1996; Sorace & Keller, Citation2005).

3In analysing quantitative data, it is important to examine the distributions of individual responses in order to determine whether further analyses may be necessary, in cases where the population is not sufficiently homogeneous with respect to the phenomena in question. A wide range of analysis techniques are available for not only characterising the population as a whole, but also for detecting stable sub-populations within the larger population or for characterising stable individual differences (e.g., Gibson et al., Citation2009).

4In fact, some methods in cognitive science and cognitive neuroscience were specifically developed to get at representational questions (e.g., lexical/syntactic priming methods, neural adaptation or multi-voxel pattern analyses in functional MRI).

5See http://www.talkingbrains.org/2010/06/weak-quantitative-standards-in.html for a recent presentation and discussion of some of these arguments.

6Colin Phillips and some of his former students/postdocs have commented to us that, in their experience, quantitative acceptability judgement studies almost always validate the claim(s) in the literature. This is not our experience, however. Most experiments that we have run which attempt to test some syntactic/semantic hypothesis in the literature end up providing us with a pattern of data that had not been known before the experiment (e.g., Breen, Fedorenko, Wagner, & Gibson, 2010; Fedorenko & Gibson, 2010a; Patel et al., 2009; Scontras & Gibson, 2010).

7Note that there is a tension between this claim and the Chomsky an idea that there is a universal language faculty possessed by all native speakers of a language, including naïve subjects. Moreover, as discussed above, the need to ignore irrelevant features of examples should be eliminated by good experimental design: the experimenter reduces the possibility of confounding influences by controlling theoretically irrelevant variables in the materials to be compared.

8These two sentences are of course not a minimal pair, because of several uncontrolled differences between the items, including (a) the wh-question in (11) is a matrix wh-question, while the wh-question in (12) is an embedded wh-question; and (b) the lexical items in the two sentences aren't the same (“I” vs. “you”). These differences were controlled in the experimental comparison reported below.

9An anonymous reviewer has noted that “the importance of vacuous movement has plummeted greatly since Barriers days, and thus that the judgements in question really don't matter all that much”. Although this may be true, this is certainly a case that, in the words of Phillips (2009) “adversely affected the progress of the field of syntax”. In particular, Chomsky's writings have a much greater impact than other syntacticians’ writings, so any errors in his work are exacerbated in the field for years to come. To give a specific example, the first author of this paper (Edward Gibson) began to work in the field of syntax in the late 1980s, but was so disenchanted by several of the judgements in Chomsky (1986) that he shifted his research focus to a different topic within the area of language research.

For summaries of issues relevant to basic experimental design for language research, see e.g., Ferreira (2005) and Myers (2009).

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.