Toward an effective Igbo part-of-speech tagger

Onyenwe, Ikechukwu E. and Hepple, Mark and Chinedu, Uchechukwu and Ezeani, Ignatius (2019) Toward an effective Igbo part-of-speech tagger. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 18 (4): 42. ISSN 2375-4699

[thumbnail of towards_effective_Igbo_pos_tagger]
Text (towards_effective_Igbo_pos_tagger)
towards_effective_Igbo_pos_tagger.pdf - Accepted Version
Available under License Creative Commons Attribution-NonCommercial.

Download (936kB)

Abstract

Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments conducted using an Igbo corpus as a test bed for identifying the POS taggers and the Machine Learning (ML) methods that can achieve a good performance with the small dataset available for the language. Experiments have been conducted using different well-known POS taggers developed for English or European languages, and different training data styles and sizes. Igbo has a number of language-specific characteristics that present a challenge for effective POS tagging. One interesting case is the wide use of verbs (and nominalizations thereof) that have an inherent noun complement, which form “linked pairs” in the POS tagging scheme, but which may appear discontinuously. Another issue is Igbo's highly productive agglutinative morphology, which can produce many variant word forms from a given root. This productivity is a key cause of the out-of-vocabulary (OOV) words observed during Igbo tagging. We report results of experiments on a promising direction for improving tagging performance on such morphologically-inflected OOV words.

Item Type:
Journal Article
Journal or Publication Title:
ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)
Additional Information:
© ACM, 2019. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Asian and Low-Resource Language Information Processing, 18, 4, 2019 http://doi.acm.org/10.1145/3314942
Uncontrolled Keywords:
/dk/atira/pure/subjectarea/asjc/1700/1700
Subjects:
?? african languagecorporacorpus annotationigbolanguage technologymachine learningmorphological analysisnatural language processing (nlp)part-of-speech (pos) taggingpos taggertagsettext processinggeneral computer science ??
ID Code:
142310
Deposited By:
Deposited On:
30 Mar 2020 11:30
Refereed?:
Yes
Published?:
Published
Last Modified:
23 Nov 2024 01:37