Wang, Jiayi and Adelani, David Ifeoluwa and Agrawal, Sweta and Rei, Ricardo and Briakou, Eleftheria and Carpuat, Marine and Masiak, Marek and He, Xuanli and Bourhim, Sofia and Bukula, Andiswa and Mohamed, Muhidin and Olatoye, Temitayo and Mokayede, Hamam and Mwase, Christine and Kimotho, Wangui and Yuehgoh, Foutse and Aremu, Anuoluwapo and Ojo, Jessica and Muhammad, Shamsuddeen Hassan and Osei, Salomey and Omotayo, Abdul-Hakeem and Chukwuneke, Chiamaka and Ogayo, Perez and Hourrane, Oumaima and Anigri, Salma El and Ndolela, Lolwethu and Mangwana, Thabiso and Mohamed, Shafie Abdi and Hassan, Ayinde and Awoyomi, Oluwabusayo Olufunke and Alkhaled, Lama and Al-Azzawi, Sana and Etori, Naome A. and Ochieng, Millicent and Siro, Clemencia and Njoroge, Samuel and Muchiri, Eric and Kimotho, Wangari and Momo, Lyse Naomi Wamba and Abolade, Daud and Ajao, Simbiat and Adewumi, Tosin and Shode, Iyanuoluwa and Macharm, Ricky and Iro, Ruqayya Nasir and Abdullahi, Saheed S. and Moore, Stephen E. and Opoku, Bernard and Akinjobi, Zainab and Afolabi, Abeeb and Obiefuna, Nnaemeka and Ogbu, Onyekachi Raphael and Brian, Sam and Otiende, Verrah Akinyi and Mbonu, Chinedu Emmanuel and Sari, Sakayo Toadoum and Stenetorp, Pontus (2023) AfriMTE and AfriCOMET : Empowering COMET to Embrace Under-resourced African Languages. Other. Arxiv.
2311.09828v1.pdf - Published Version
Available under License Creative Commons Attribution.
Download (1MB)
Abstract
Despite the progress we have recorded in scaling multilingual machine translation (MT) models and evaluation data to several under-resourced African languages, it is difficult to measure accurately the progress we have made on these languages because evaluation is often performed on n-gram matching metrics like BLEU that often have worse correlation with human judgments. Embedding-based metrics such as COMET correlate better; however, lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with a simplified MQM guideline for error-span annotation and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET, a COMET evaluation metric for African languages by leveraging DA training data from high-resource languages and African-centric multilingual encoder (AfroXLM-Roberta) to create the state-of-the-art evaluation metric for African languages MT with respect to Spearman-rank correlation with human judgments (+0.406).
![[thumbnail of 2311.09828v1]](https://eprints.lancs.ac.uk/style/images/fileicons/text.png)
 Altmetric
 Altmetric Altmetric
 Altmetric