Constructing corpora of South Asian languages.

Baker, Paul and Hardie, Andrew and McEnery, Tony and Jayaram, BD (2003) Constructing corpora of South Asian languages. In: Corpus Linguistics 2003, 2003-03-01.

[thumbnail of McEnery.pdf]
Preview
PDF (McEnery.pdf)
McEnery.pdf

Download (352kB)

Abstract

The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.

Item Type:
Contribution to Conference (Paper)
Journal or Publication Title:
Corpus Linguistics 2003
Uncontrolled Keywords:
/dk/atira/pure/researchoutput/libraryofcongress/p1
Subjects:
?? corpussouth asian languagesemilleencodingunicodeannotationcorpus buildingp philology. linguistics ??
ID Code:
104
Deposited By:
Deposited On:
16 Dec 2005
Refereed?:
No
Published?:
Published
Last Modified:
18 Dec 2023 02:23