Lancaster EPrints

Part-of-speech ratios in English corpora.

Hardie, Andrew (2007) Part-of-speech ratios in English corpora. International Journal of Corpus Linguistics, 12 (1). pp. 55-81. ISSN 1384-6655

Full text not available from this repository.

Abstract

Using part-of-speech (POS) tagged corpora, Hudson (1994) reports that approximately 37% of English tokens are nouns, where 'noun' is a superordinate category including nouns, pronouns and other word-classes. It is argued here that difficulties relating to the boundaries of Hudson's 'noun' category demonstrate that there is no uncontroversial way to derive such a superordinate category from POS tagging. Decisions regarding the boundary of the 'noun' category have small but statistically significant effects on the ratio that emerges for 'nouns' as a whole. Tokenisation and categorisation differences between tagging schemes make it problematic to compare the ratio of 'nouns' across different tagsets. The precise figures for POS ratios are therefore effectively artefacts of the tagset. However, these objections to the use of POS ratios do not apply to their use as a metric of variation for comparing data-sets tagged with the same tagging scheme.

Item Type: Article
Journal or Publication Title: International Journal of Corpus Linguistics
Subjects: P Language and Literature > P Philology. Linguistics
Departments: Faculty of Arts & Social Sciences > Linguistics & English Language
ID Code: 1074
Deposited By: Dr Andrew Hardie
Deposited On: 30 Jan 2008 13:50
Refereed?: Yes
Published?: Published
Last Modified: 18 Sep 2013 15:57
Identification Number:
URI: http://eprints.lancs.ac.uk/id/eprint/1074

Actions (login required)

View Item