NSina : A News Corpus for Sinhala

Hettiarachchi, Hansi and Dola Mullage, Damith and Uyangodage, Lasitha and Ranasinghe, Tharindu (2024) NSina : A News Corpus for Sinhala. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) :. ELRA and ICCL, ITA, pp. 12307-12312. ISBN 9782493814104

Text (2024.lrec-main.1076)
2024.lrec-main.1076.pdf - Published Version
Available under License Creative Commons Attribution-NonCommercial.
Download (234kB)

Abstract

The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSina, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSina aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSina is the largest news corpus for Sinhala, available up to date.

Item Type:

Contribution in Book/Report/Proceedings

Departments:

Faculty of Science and Technology > School of Computing & Communications

ID Code:

221459

Deposited By:

ep_importer_pure

Deposited On:

20 Nov 2024 14:35

Refereed?:

Yes

Published?:

Published

Last Modified:

27 Feb 2026 00:12

URI:

https://eprints.lancs.ac.uk/id/eprint/221459

Altmetric