Lancaster EPrints

Constructing corpora of South Asian languages.

Baker, Paul and Hardie, Andrew and McEnery, Tony and Jayaram, BD (2003) Constructing corpora of South Asian languages. In: Corpus Linguistics 2003, 2003-03-01, Lancaster.

[img]
Preview
PDF (McEnery.pdf)
Download (343Kb) | Preview

    Abstract

    The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.

    Item Type: Conference or Workshop Item (Paper)
    Journal or Publication Title: Corpus Linguistics 2003
    Uncontrolled Keywords: corpus ; South Asian languages ; EMILLE ; encoding ; Unicode ; annotation ; corpus building
    Subjects: P Language and Literature > P Philology. Linguistics
    Departments: Faculty of Arts & Social Sciences > Linguistics & English Language
    ID Code: 104
    Deposited By: Dr Andrew Hardie
    Deposited On: 16 Dec 2005
    Refereed?: No
    Published?: Published
    Last Modified: 05 Mar 2013 09:02
    Identification Number:
    URI: http://eprints.lancs.ac.uk/id/eprint/104

    Actions (login required)

    View Item