Abstract
This study reported in this article conducts a quantitative analysis of the Sesotho sa Leboa National Centre for Human Language Technology (NCHLT) part-of-speech annotated data set, and compares the quality of the NCHLT and CTexT part-of-speech taggers based on the data set. The two taggers were developed as part of the NCHLT Text project and are both based on Taljard et al.’s fine-grained tagset, aligned to the morphological structure of Sesotho sa Leboa. A gold standard data set of 7 153 tokens is utilised for comparison and evaluation of the overall performance of the two part-of-speech taggers and to perform fine-grained error analysis. We find that the NCHLT and CTexT taggers obtain 88.40% and 94.18% accuracy respectively and describe the linguistic nature of the most frequent errors observed in the two taggers.