EMILLE, A 67-million word corpus of indic languages : Data collection, mark-up and harmonisation

Baker, Paul and Hardie, Andrew and McEnery, Tony and Cunningham, Hamish and Gaizauskas, Rob (2002) EMILLE, A 67-million word corpus of indic languages : Data collection, mark-up and harmonisation. In: 3rd International Conference on Language Resources and Evaluation, LREC 2002, 2002-05-29 - 2002-05-31.

Full text not available from this repository.

Abstract

The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.

Item Type:

Contribution to Conference (Paper)

Journal or Publication Title:

3rd International Conference on Language Resources and Evaluation, LREC 2002

Uncontrolled Keywords:

/dk/atira/pure/subjectarea/asjc/3300/3310

Subjects:

?? linguistics and languagelanguage and linguisticseducationlibrary and information sciences ??

Departments:

Faculty of Science and Technology > Lancaster Environment Centre

ID Code:

134896

Deposited By:

ep_importer_pure

Deposited On:

22 Jun 2019 09:46

Refereed?:

Yes

Published?:

Published

Last Modified:

10 Dec 2025 13:13

URI:

https://eprints.lancs.ac.uk/id/eprint/134896