1,153
Views
1
CrossRef citations to date
0
Altmetric
Articles

Using collocation clusters to detect and correct English L2 learners’ collocation errors

ORCID Icon & ORCID Icon
Pages 270-296 | Published online: 16 May 2019
 

Abstract

In this article, we describe an online English collocation explorer developed to help English L2 learners produce correct and appropriate collocations. Our tool, which is able to visually represent relevant correct/incorrect collocations on a single webpage, was designed based on the notions of collocation clusters and intercollocability proposed by Cowie and Howarth. As they pointed out, in a collocation cluster L2 learners generally cannot distinguish true collocations (e.g., tell truth, state truth, and state fact) from impossible combinations (e.g., *say fact and *say truth). Accordingly, our tool applies natural language processing techniques to construct collocation clusters to enable learners to easily differentiate between correct and incorrect pairs. Relying on data from a reference corpus, our system instantaneously processes the collocability of users’ target combination (verb–noun or adj–noun) and all other relevant words and presents true/false collocations that L2 learners should master/avoid. To assess our tool, we investigated its performance in detecting and correcting learners’ V–N and A–N errors, with results comparable to those of most previous studies. Piloted using a sample of 13 intermediate- or upper-intermediate level English as a foreign language learners, our tool was found to help them self-correct their collocation errors effectively. Compared with similar tools or approaches, our tool requires much less data resources, but still demonstrates a remarkable capability to detect/correct errors and generate useful collocational knowledge in English.

Notes

1 The mean reciprocal rank is a measure frequently adopted in information retrieval to evaluate the quality of collected responses. We discuss this measure in more detail later when we report the correction performances of our own tool.

2 Wordnet is an enormous English lexical database. In the database, words are organized into sets of synonyms, with each describing an individual concept. These sets, which are called synsets, are further structured and inter-connected by specifications of a variety of lexical relations, including hypernymy, hyponymy, meronymy, holonymy, troponymy, entailment, and coordinate terms. Wordnet can be effectively used as a dictionary or thesaurus. With its abundant knowledge and structures, it has been considered a valuable E-tool for computational linguistics and NLP. Wordnet is accessible via its web-browser: http://wordnetweb.princeton.edu/perl/webwn (Princeton University, Citation2010).

3 In fact, Dahlmeier and Ng (Citation2011) did consult two judges who manually evaluated the candidate collocates found by their system. However, the judges were only required to check the top three suggestions. The researchers found that the features mixed together could achieve precisions of 38.2, 32.87, and 29.3% at ranks 1–3, respectively.

4 NetCollo is available at: http://www.netcollo.info/

5 It should be noted that Wordnet provides different lexical relations for different parts of speech. Synsets of adjectives, for example, are not inter-connected by hypernym and hyponym relations. On NetCollo, we thus calculate and decide the semantic similarities of adjectives based solely on the synonym information of WordNet. The unavailability of the relation information, however, somewhat restricts the construction and usefulness of NetCollo clusters involving adjectives. We discuss this further in the next section when we present and evaluate NetCollo’s performance in correcting A-N errors.

6 The 5 (frequency) × 4.0 (MI) combination was used on 34 of the overall 242 tested items. In each of the 34 combinations, one or both components were low-frequency words (e.g., the word vocabulary in the wrong collocation *catch vocabulary). As noted, on NetCollo, a cluster is generated and developed by shared collocates. Accordingly, we had to lower the frequency threshold for the low-frequency words to obtain at least some collocates first, after which our tool could find other words that share collocates with them. In this study, if one or two components of a tested combination held a frequency number lower than 2,000 in the BNC, we utilized the low-frequency measure. On the NetCollo interface we also suggest that users lower the frequency numbers to 5 or 3 if they would like to input low-frequency keywords.

7 The XML edition of the BNC comprises around 100 million running tokens.

8 In our evaluation, reliable clusters refer to those clusters capable of suggesting correct alternatives to collocation errors. To examine which MI and frequency combination produced the best performances, we tested MI scores ranging from 1.0 to 7.5 and frequency thresholds ranging from 5 to 60 and found that the 4.0 (MI) × 10 (frequency) combination collected the most gold answers to our V–N and A–N errors. We present the evaluation results in the Performance of collocation error correction section.

9 For advanced learners of English, we suggest that they choose semantically similar words by themselves to construct more useful clusters. When users search the wrong collocation *make income on NetCollo, for example, they will see that similar nouns for income include profit, earnings, dividend, and amount, and NetCollo does not automatically provide appropriate corrections for make. Users, however, can use the Reselect function on NetCollo to choose intuitively similar words earnings and wage for income, and in this way they can get a good verb alternative: earn.

Additional information

Notes on contributors

Ping-Yu Huang

Ping-Yu Huang is an assistant professor at the General Education Center, the Ming Chi University of Technology. He focuses his research works on corpus linguistics, digital language learning, and second language acquisition.

Nai-Lung Tsao

Nai-Lung Tsao is an assistant researcher at the Office of Information Services of Tamkang University. His main research interests include information retrieval and natural language processing.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 339.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.