Lancaster EPrints

The computational analysis of morphosyntactic categories in Urdu.

Hardie, Andrew (2004) The computational analysis of morphosyntactic categories in Urdu. PhD thesis, UNSPECIFIED.

[img] PDF (hardie_thesis_2004.zip)
Download (2794Kb)

    Abstract

    Urdu is a language of the Indo-Aryan family, widely spoken in India and Pakistan, and an important minority language in Europe, North America, and elsewhere. This thesis describes the development of a computer-based system for part-of-speech tagging of Urdu texts, consisting of a tagset, a set of tagging guidelines for manual tagging or post-editing, and the tagger itself. The tagset is defined in accordance with a set of design principles, derived from a survey of good practice in the field of tagset design, including compliance with the EAGLES guidelines on morphosyntactic annotation. These are shown to be extensible to languages, such as Urdu, that are closely related to those languages for which the guidelines were originally devised. The description of Urdu grammar given by Schmidt (1999) is used as a model of the language for the purpose of tagset design. Manual tagging is undertaken using this tagset, by which process a set of tagging guidelines are created, and a set of manually tagged texts to serve as training data is obtained. A rule-based methodology is used here to perform tagging in Urdu. The justification for this choice is discussed. A suite of programs which function together within the Unitag architecture are described. This system (as well as a tokeniser) includes an analyser (Urdutag) based on lexical look-up and word-form analysis, and a disambiguator (Unirule) which removes contextually inappropriate tags using a set of 274 rules. While the system's final performance is not particularly impressive, this is largely due to a paucity of training data leading to a small lexicon, rather than any substantial flaw in the system.

    Item Type: Thesis (PhD)
    Uncontrolled Keywords: part-of-speech tagging ; morphosyntactic tagging ; Urdu ; Unicode ; rule-based tagging ; disambiguation ; EAGLES guidelines ; tagset ; lexicon
    Subjects: P Language and Literature > P Philology. Linguistics
    Departments: Faculty of Arts & Social Sciences > Linguistics & English Language
    ID Code: 106
    Deposited By: Dr Andrew Hardie
    Deposited On: 16 Dec 2005
    Refereed?: No
    Published?: Unpublished
    Last Modified: 05 Mar 2013 08:58
    Identification Number:
    URI: http://eprints.lancs.ac.uk/id/eprint/106

    Actions (login required)

    View Item