MasakhaNER 2.0:Africa-centric Transfer Learning for Named Entity Recognition

Adelani, David Ifeoluwa and Neubig, Graham and Ruder, Sebastian and Rijhwani, Shruti and Beukman, Michael and Palen-Michel, Chester and Lignos, Constantine and Alabi, Jesujoba O. and Muhammad, Shamsuddeen Hassan and Nabende, Peter and Dione, Cheikh M. Bamba and Bukula, Andiswa and Mabuya, Rooweither and Dossou, Bonaventure F. P. and Sibanda, Blessing and Buzaaba, Happy and Mukiibi, Jonathan and Kalipe, Godson and Mbaye, Derguene and Taylor, Amelia and Kabore, Fatoumata Ouoba and Emezue, Chris Chinenye and Anuoluwapo, Aremu and Ogayo, Perez and Gitau, Catherine and Munkoh-Buabeng, Edwin and Koagne, Victoire Memdjokam and Tapo, Allahsera Auguste and Macucwa, Tebogo and Marivate, Vukosi and Mboning, Elvis and Gwadabe, Tajuddeen and Adewumi, Tosin P. and Ahia, Orevaoghene and Nakatumba-Nabende, Joyce and Mokono, Neo L. and Ezeani, Ignatius and Chukwuneke, Chiamaka and Adeyemi, Mofetoluwa and Hacheme, Gilles and Abdulmumin, Idris and Ogundepo, Odunayo and Yousuf, Oreen and Ngoli, Tatiana Moteu and Klakow, Dietrich (2022) MasakhaNER 2.0:Africa-centric Transfer Learning for Named Entity Recognition. arXiv, abs/22. ISSN 2331-8422

Full text not available from this repository.

Abstract

African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages.

Item Type:
Journal Article
Journal or Publication Title:
arXiv
ID Code:
183640
Deposited By:
Deposited On:
17 Jan 2023 15:05
Refereed?:
Yes
Published?:
Published
Last Modified:
17 Jan 2023 15:05