Lancaster EPrints

Automatic standardisation of texts containing spelling variation: How much training data do you need?

Baron, Alistair and Rayson, Paul (2009) Automatic standardisation of texts containing spelling variation: How much training data do you need? In: Proceedings of the Corpus Linguistics Conference. Lancaster University, Lancaster.

[img]
Preview
PDF (314_FullPaper.pdf)
Download (945Kb) | Preview

    Abstract

    Large quantities of spelling variation in corpora, such as that found in Early Modern English, can cause significant problems for corpus linguistic tools and methods. Having texts with standardised spelling is key to making such tools and methods accurate and meaningful in their analysis. Gaining access to such versions of texts can be problematic however, and manual stan- dardisation of the texts is often too time-consuming to be feasible. Our solution is a piece of software named VARD 2 which can be used to manually and automatically standardise spelling variation in individual texts, or corpora of any size. This paper evaluates VARD 2’s performance on a corpus of Early Modern English letters and a corpus of children’s written English. The software’s ability to learn from manual standardisation is put under particular scrutiny as we examine what effect different levels of training have on its performance.

    Item Type: Contribution in Book/Report/Proceedings
    Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
    Departments: Faculty of Science and Technology > School of Computing & Communications
    ID Code: 42529
    Deposited By: ep_importer_comp
    Deposited On: 13 Aug 2010 18:01
    Refereed?: No
    Published?: Published
    Last Modified: 25 Mar 2014 23:26
    Identification Number:
    URI: http://eprints.lancs.ac.uk/id/eprint/42529

    Actions (login required)

    View Item