Normalising the corpus of English dialogues (1560-1760) using VARD2 : decisions and justifications

Archer, Dawn and Kytö, Merja and Baron, Alistair and Rayson, Paul (2014) Normalising the corpus of English dialogues (1560-1760) using VARD2 : decisions and justifications. In: 35th ICAME conference, 2014-04-30 - 2014-05-04, University of Nottingham.

Full text not available from this repository.

Abstract

The development of (semi-)automatic tools such as the VARD (Baron and Rayson, 2008) has afforded compilers of historical corpora the opportunity to normalise variant spellings relatively quickly – following, that is, a dedicated period of manual training using relevant corpus samples (see, e.g., Lehto et al. 2010). In the case of VARD2, this period of manual training involves the user: (i) reading a given text, via the VARD interface, (ii) distinguishing variants within the text – via the tool’s recommended list of (ranked) candidate replacements – or personally – by highlighting variant forms manually, (iii) choosing the most appropriate normalized form for each variant found – where relevant, being guided by the VARD’s known variant list or f-score calculation (derived from , e.g., letter replacement rules, edit distance measures and/or phonetic matching algorithms), (iv) replacing the variant with the normalised form – but in such a way that the original spelling is retained in an XML tag (Baron and Rayson, 2008). The corpus-linguistic argument for normalisation is that it helps improve automated techniques such as part-of-speech and keyword analysis, thereby allowing existing linguistic tools to be used unmodified (see, e.g., Archer et al. 2003; Rayson et al. 2007a/b; Rayson et al. 2009; Hiltunen and Tyrkkö 2013). But such normalisation needs to be handled sensitively: so that, for example, we can maintain - within the text - the original spelling of those forms which convey important morphosyntactic or orthographic information (as opposed to retaining these original spellings as part of the XML tag – see (iv)). Hence the inclusion of an IGNORE VARIANT facility within VARD. In this paper, we outline some of the decisions we have made, in respect to the Corpus of English Dialogues (CED), when determining which features required normalisation and which should be left as they were originally (and why). Compiled by Merja Kytö and Jonathan Culpeper, the CED covers a 200-year period (1560-1760) and contains speech-related texts representative of five genres – the courtroom, witness proceedings, comedy dramas, prose fiction and handbooks – as well as a group of texts subsumed under a miscellaneous category. In particular, we will discuss our treatment of: names; the genitive construction; auxiliaries and verbs; (open-hyphenated-closed) compounds; abbreviations; graphemes such as the tilde; terms which are now archaic, obsolete or rare; foreign terms; dialect terms; and personal pronouns. This work, although focussed on the CED, also has a wider aim: determining the feasibility of developing normalisation guidelines that are generalisable to other historical corpora such as ARCHER (A Representative Corpus of Historical English Registers) and EEBO (Early English Books Online). Hence, as part of the presentation, we will compare the normalisation decisions made in respect to the CED with those made in respect to the Early Modern English Medical Texts (see Lehto et al. 2010). REFERENCES Archer, D., McEnery, A. M., Rayson, P. and Hardie, A. (2003). Developing an automated semantic analysis system for Early Modern English. In D. Archer, P. Rayson, A. Wilson and A. M. McEnery (eds.) Proceedings of the Corpus Linguistics Conference 2003. Lancaster: University of Lancaster. 22–31. Baron, A. and Rayson, P. (2009). Automatic standardization of texts containing spelling variation, how much training data do you need? In M. Mahlberg, V. González-Díaz and C. Smith (eds.) Proceedings of the Corpus Linguistics Conference, CL2009, University of Liverpool, UK, 20-23 July 2009, See http://ucrel.lancs.ac.uk/publications/cl2009/314_FullPaper.pdf Baron, A., Rayson, P. and Archer, D. (2009). Word frequency and key word statistics in historical corpus linguistics. Anglistik: International Journal of English Studies, 20 (1), pp. 41–67. Baron, A. and Rayson, P. (2008). VARD 2: A tool for dealing with spelling variation in historical corpora. Proceedings of the Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham, UK, 22 May 2008. http://eprints.lancs.ac.uk/41666/1/BaronRaysonAston2008.pdf A Corpus of English Dialogues 1560-1760. (2006). Compiled under the supervision of Merja Kytö (Uppsala University) and Jonathan Culpeper (Lancaster University). Hiltunen, T. and Tyrkkö, J. (2013). Tagging Early Modern English Medical Texts. Corpus Analysis with Noise in the Signal (CANS) 2013. Lancaster University. See http://ucrel.lancs.ac.uk/cans2013/ Lehto, A., Baron, A., Ratia, M. and Rayson, P. (2010). Improving the precision of corpus methods: The standardized version of Early Modern English Medical Texts. In I. Taavitsainen and P. Pahta (eds.) Early Modern English Medical Texts: Corpus Description and Studies. Amsterdam: John Benjamins. 279-290. Rayson, P., Archer, D., Baron, A. and Smith, N. (2007a). Tagging historical corpora – the problem of spelling variation. In Proceedings of Digital Historical Corpora, Dagstuhl-Seminar 06491, International Conference and Research Center for Computer Science, Schloss Dagstuhl, Wadern, Germany, 3rd-8th December 2006. ISSN 1862-4405. http://www.comp.lancs.ac.uk/~paul/publications/rabs_extAbs_dagstuhl06.pdf Rayson, P., Archer, D., Baron, A., Culpeper, J. and Smith, N. (2007b). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In: Proceedings of the Corpus Linguistics Conference 2007. Birmingham: University of Birmingham. http://comp.eprints.lancs.ac.uk/1528/1/192_Paper.pdf

Item Type:

Contribution to Conference (Paper)

Journal or Publication Title:

35th ICAME conference

Departments:

Faculty of Science and Technology > School of Computing & Communications

ID Code:

72803

Deposited By:

ep_importer_pure

Deposited On:

30 Jan 2015 12:46