1,327
Views
5
CrossRef citations to date
0
Altmetric
Articles

Can automated machine translation evaluation metrics be used to assess students’ interpretation in the language learning classroom?

ORCID Icon & ORCID Icon
 

Abstract

The use of translation and interpreting (T&I) in the language learning classroom is commonplace, serving various pedagogical and assessment purposes. Previous utilization of T&I exercises is driven largely by their potential to enhance language learning, whereas the latest trend has begun to underscore T&I as a crucial skill to be acquired as part of transcultural competence for language learners and future language users. Despite their growing popularity and utility in the language learning classroom, assessing T&I is time-consuming, labor-intensive and cognitively taxing for human raters (e.g., language teachers), primarily because T&I assessment entails meticulous evaluation of informational equivalence between the source-language message and target-language renditions. One possible solution is to rely on automated quality metrics that are originally developed to evaluate machine translation (MT). In the current study, we investigated the viability of using four automated MT evaluation metrics, BLEU, NIST, METEOR and TER, to assess human interpretation. Essentially, we correlated the automated metric scores with the human-assigned scores (i.e., the criterion measure) from multiple assessment scenarios to examine the degree of machine-human parity. Overall, we observed fairly strong metric-human correlations for BLEU (Pearson’s r = 0.670), NIST (r = 0.673) and METEOR (r = 0.882), especially when the metric computation was conducted on the sentence level rather than the text level. We discussed these emerging findings and others in relation to the feasibility of operationalizing MT metrics to evaluate students’ interpretation in the language learning classroom.

Supplemental data for this article is available online at https://doi.org/10.1080/09588221.2021.1968915 .

Disclosure statement

No potential conflict of interest was reported by the authors.

Funding

This study was supported by the National Education Examinations Authority and British Council English Assessment Research Grant and the Fundamental Research Funds for the Central Universities (no. 2072021116).

Notes on contributors

Dr. Chao Han’s research interests include testing and assessment issues in translation and interpreting (T&I), pedagogically-oriented T&I studies and methodological aspects of T&I research.

Dr. Xiaolei Lu’s research interests include corpus processing, translation technology and automated translation/interpreting assessment.

Notes

1 In MT, the quality estimation (QE) methodology can be used to estimate MT quality without recourse to target-language references. However, actualization of QE to evaluate T&I in the language learning classroom is remotely possible, due to its sophisticated computation.

2 Please also refer to Han et al. (Citation2021) from which our data were derived from and in which the rater-generated scores based on the source- and target-language references were compared to examine human raters’ scoring patterns.

3 The undergraduate English majors selected interpreting as their optional course, while the postgraduate students majored in English language and literature with a special focus on English-Chinese interpreting.

4 TEM-4 and TEM-8 are two primary English proficiency tests developed specifically for English majors in China’s mainland.

5 In hindsight, we also calculated a new document-level metric, called Corpus BLEU, which accounts for the micro-average precision at a corpus level. The corpus BLEU metric computed on the basis of one reference text had relatively strong correlations with the human raters-generated scores in each of the four conditions: r = 0.737 for the target text and the sentence level condition, r = 0.723 for the target text and the text level condition, r = 0.716 for the source text and the sentence level condition, and r = 0.728 for the source text and the text level condition. These results were slightly larger (in excess of about r = 0.05) than the correlation coefficients associated with BLEU scores previously computed and reported in Table 5. This means that the previous text-level BLEU scores are largely comparable with the new document-level Corpus BLEU scores.

6 https://github.com/luxiaolei930/CALL-Metrics-Sentence-level-scoring

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.