ABSTRACT
Lexical diversity (LD) is an important indicator of second language lexical development. Much research has investigated LD indices, with a focus on learners of English. However, further research is needed in languages that are typologically distinct from English, such as Korean. In this study, we evaluated the reliability and validity of LD indices applied to argumentative writing produced by Korean learners. The results indicated that HD-D, MATTR, and MTLD were reliable across different text lengths and were correlated with holistic proficiency scores. However, the meaningful differences were found across Korean-specific tokenization types related to the way morphemes are processed.
Disclosure statement
We used the NIKL dataset with permission required. We strictly followed all stipulated guidelines to respect the interests of the data providers. No potential conflict of interest was reported by the author(s).
Supplementary material
Supplemental data for this article can be accessed online at https://doi.org/10.1080/15434303.2024.2311728
Notes
1 The National Institute of Korean Language (NIKL) is a government institution established with the aim of developing the Korean language and enhancing its usage in daily life. Its primary objectives include conducting research on language policies, managing linguistic data, and publishing Korean dictionaries.
2 Precision measures how many of the parsed tokens were correctly identified. Recall indicates how many of the actual tokens were correctly identified and parsed by the tokenizer. To encapsulate these two metrics into a single performance measure, we calculated the F1 score using the formula: 2 * (precision * recall)/(precision + recall).
3 An odds ratio shows the change in the odds of an outcome with one-unit (i.e., proficiency level) in the predictor, assuming all other factors remain unchanged.