Construction and annotation of a corpus of contemporary Nepali

Yadava, Yogendra and Hardie, Andrew and Lohani, Ram and Regmi, Bhim N. and Gurung, Srishtee and Gurung, Amar and McEnery, Tony and Allwood, Jens and Hall, Pat (2008) Construction and annotation of a corpus of contemporary Nepali. Corpora, 3 (2). pp. 213-225. ISSN 1749-5032

Full text not available from this repository.

Abstract

In this paper, we describe the construction of the 14-million-word Nepali National Corpus (NNC). This corpus includes both spoken and written data, the latter incorporating a Nepali match for FLOB and a broader collection of text. Additional resources within the NNC include parallel data (English–Nepali and Nepali–English) and a speech corpus. The NNC is encoded as Unicode text and marked up in CES-compatible XML. The whole corpus is also annotated with part-of-speech tags. We describe the process of devising a tagset and retraining tagger software for the Nepali language, for which there were no existing corpus resources. Finally, we explore some present and future applications of the corpus, including lexicography, NLP, and grammatical research.

Item Type:

Journal Article

Journal or Publication Title:

Corpora

Uncontrolled Keywords:

/dk/atira/pure/subjectarea/asjc/3200/3200

Subjects:

?? general psychologylinguistics and languagegeneral arts and humanitieslanguage and linguisticspsychology(all)arts and humanities(all) ??

Departments:

Faculty of Arts & Social Sciences > Linguistics & English Language
Faculty of Arts & Social Sciences

ID Code:

62715

Deposited By:

ep_importer_pure

Deposited On:

05 Mar 2013 14:17

Refereed?:

Yes

Published?:

Published

Last Modified:

30 Jun 2026 10:28

URI:

https://eprints.lancs.ac.uk/id/eprint/62715