103
Views
4
CrossRef citations to date
0
Altmetric
Original Articles

Developing a tagset and tagger for the African languages of South Africa with special reference to Xhosa

Pages 223-237 | Published online: 12 Nov 2009
 

Abstract

There are currently two distinct but not necessarily mutually exclusive approaches to the retrieval of information from linguistic corpora. 'Corpus-driven' approaches rely solely on the corpus itself to yield significant patterns. With the exception of orthographic spacing, no additional annotations to a 'raw' corpus are used to guide searches and the retrieval of information from the corpus. Typically, key word in context (KWIC) analyses are applied to relevant concordance lines to extract statistically significant lexical and grammatical patterns. In 'corpus-based' approaches, on the other hand, information is retrieved from an enriched corpus on the basis of annotations in the form of linguistic tags and annotations. That is, the annotations are used to direct the searches to specific grammatical and lexical phenomena in a corpus.

In this article, we propose a corpus-based approach and a tagset to be used on a corpus of spoken language for the African languages of South Africa. A number of problematic linguistic phenomena such as fixed expressions, agglutination, morphemic merging and spoken language phenomena such as interrupted words etc., often have some effect on tagging principles. These problematic phenomena are discussed and illustrated. The development of the tagset is based on the morphosyntactic properties of Xhosa for reasons that are outlined in the article.

Manual tagging of a large corpus would be quite a daunting and time-consuming task, not to mention the potential for various kinds of errors. This problem is solved in a two-step process. Firstly, a computer-based drag-and-drop tagger was developed to facilitate the manual tagging of a so-called training corpus. This training corpus then forms the input to the development of an automatic tagger. The principles and procedures for the development of an automatic tagger for African languages are also discussed.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.