Sinhala Encoder-only Language Models and Evaluation

Ranasinghe, Tharindu and Hettiarachchi, Hansi and Pathirana, Nadeesha Chathurangi Naradde Vidana and Premasiri, Damith and Uyangodage, Lasitha and Nanomi Arachchige, Isuri and Plum, Alistair and Rayson, Paul and Mitkov, Ruslan (2025) Sinhala Encoder-only Language Models and Evaluation. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) :. Association for Computational Linguistics, Vienna, Austria, pp. 8623-8636. ISBN 9798891762510

Full text not available from this repository.

Abstract

Recently, language models (LMs) have produced excellent results in many natural language processing (NLP) tasks. However, their effectiveness is highly dependent on available pre-training resources, which is particularly challenging for low-resource languages such as Sinhala. Furthermore, the scarcity of benchmarks to evaluate LMs is also a major concern for low-resource languages. In this paper, we address these two challenges for Sinhala by (i) collecting the largest monolingual corpus for Sinhala, (ii) training multiple LMs on this corpus and (iii) compiling the first Sinhala NLP benchmark (Sinhala-GLUE) and evaluating LMs on it. We show the Sinhala LMs trained in this paper outperform the popular multilingual LMs, such as XLM-R and existing Sinhala LMs in downstream NLP tasks. All the trained LMs are publicly available. We also make Sinhala-GLUE publicly available as a public leaderboard, and we hope that it will enable further advancements in developing and evaluating LMs for Sinhala.

Item Type:
Contribution in Book/Report/Proceedings
ID Code:
231195
Deposited By:
Deposited On:
09 Sep 2025 13:35
Refereed?:
Yes
Published?:
Published
Last Modified:
13 Sep 2025 11:46