1,402
Views
25
CrossRef citations to date
0
Altmetric
Articles

A computational approach to detecting collocation errors in the writing of non-native speakers of English

, , &
Pages 353-367 | Received 15 Jul 2008, Published online: 01 Oct 2008
 

Abstract

This paper describes the first prototype of an automated tool for detecting collocation errors in texts written by non-native speakers of English. Candidate strings are extracted by pattern matching over POS-tagged text. Since learner texts often contain spelling and morphological errors, the tool attempts to automatically correct them in order to reduce noise. For a measure of collocation strength, we use the rank-ratio statistic calculated over one billion words of native-speaker texts. Two human annotators evaluated the system's performance. We report the overall results, as well as detailed error analyses, and discuss possible improvements for the future.

Acknowledgments

We would first like to thank our two annotators, Sarah Ohls and Vicky Pszonka, for their hard work. We would also like to thank the anonymous reviewers, as well as Klaus Zechner and Xiaoming Xi, for their valuable comments.

Notes

1. For example, the equivalent of English collocation ‘strong tea’ would be something like ‘thick tea’ in Japanese; ‘take medicine’ would be ‘drink medicine’.

2. We used a maximum entropy POS tagger, which has accuracy of approximately 97%. See Ratnaparkhi (Citation1998) for details.

3. Our system takes into account that the collocates may not be immediately adjacent to each other, and allows for some syntactic variation. For example, the verb + direct object pattern allows the object noun to be modified by a determiner and one or more adjectives.

4. Training set is not a part of the actual annotation set.

5. For discussion of learner corpora error tagging in general, see Díaz-Negrillo and Fernández-Domínguez (Citation2006) and works cited therein.

6. TOEFL CBT (computer-based test) administered between April 2001 and March 2003.

7. There were eight cases, or less than 0.6% of all strings, in which mistagging unrelated to misspelling in the candidate string itself, occurred. Assuming that every misspelling is a source for mistagging, 37 remaining misspellings (under ‘Spelling’ in ) and the eight mistagged cases account for 23% of all miscategorizations. Van Rooy and Shäfer (Citation2003) compared the performance of three taggers, TOSCA, Brill and CLAW, on learner texts. They found that correcting misspelling significantly improved tag accuracy of each tagger. Our plan to upgrade the spelling correction portion of the system, as mentioned in subsection 0, and to re-tag the spell-corrected candidate strings, is likely to improve this problem.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.