MasakhaNEWS:News Topic Classification for African languages

Adelani, David Ifeoluwa and Chukwuneke, Chiamaka I. and Masiak, Marek and Azime, Israel Abebe and Alabi, Jesujoba Oluwadara and Tonja, Atnafu Lambebo and Mwase, Christine and Ogundepo, Odunayo and Dossou, Bonaventure F. P. and Oladipo, Akintunde and Nixdorf, Doreen and Emezue, Chris Chinenye and al-azzawi, Sana Sabah and Sibanda, Blessing K. and David, Davis and Ndolela, Lolwethu and Mukiibi, Jonathan and Ajayi, Tunde Oluwaseyi and Ngoli, Tatiana Moteu and Odhiambo, Brian and Mbonu, Chinedu E. and Owodunni, Abraham Toluwase and Obiefuna, Nnaemeka C. and Muhammad, Shamsuddeen Hassan and Abdullahi, Saheed Salahudeen and Yigezu, Mesay Gemeda and Gwadabe, Tajuddeen and Abdulmumin, Idris and Bame, Mahlet Taye and Awoyomi, Oluwabusayo Olufunke and Shode, Iyanuoluwa and Adelani, Tolulope Anu and Kailani, Habiba Abdulganiy and Omotayo, Abdul-Hakeem and Adeeko, Adetola and Abeeb, Afolabi and Aremu, Anuoluwapo and Samuel, Olanrewaju and Siro, Clemencia and Kimotho, Wangari and Ogbu, Onyekachi Raphael and Fanijo, Samuel and Ojo, Jessica and Awosan, Oyinkansola F. and Guge, Tadesse Kebede and Sari, Sakayo Toadoum and Nyatsine, Pamela and Sidume, Freedmore and Yousuf, Oreen and Oduwole, Mardiyyah and Kimanuka, Ussen and Tshinu, Kanda Patrick and Diko, Thina and Nxakama, Siyanda and Johar, Abdulmejid Tuni and Gebre, Sinodos and Mohamed, Muhidin and Mohamed, Shafie Abdi and Hassan, Fuad Mire and Mehamed, Moges Ahmed and Ngabire, Evrard and Stenetorp, Pontus (2023) MasakhaNEWS:News Topic Classification for African languages. Other. UNSPECIFIED.

Text (2304.09972v1)
2304.09972v1.pdf - Submitted Version
Available under License Creative Commons Attribution.

Download (485kB)


African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.

Item Type:
Monograph (Other)
Additional Information:
Accepted to AfricaNLP Workshop @ICLR 2023 (non-archival)
ID Code:
Deposited By:
Deposited On:
12 Jun 2023 12:20
Last Modified:
12 Sep 2023 04:27