Understanding corpus text prototypicality: A multifaceted problem

Anthony, Laurence and Smith, Nicholas and Hoffmann, Sebastian and Rayson, Paul (2023) Understanding corpus text prototypicality: A multifaceted problem. In: International Computer Archive of Modern and Medieval English (ICAME 44), 2023-05-17 - 2023-05-21, North-West University and Emerald Resort.

Text (icame_44_anthony_et_al)
icame_44_anthony_et_al.pdf - Published Version
Available under License Creative Commons Attribution.
Download (2MB)

Abstract

Prototypicality is a complex, multifaceted concept relating to the centrality and typicality of examples in a category. While prominent in cognitive psychology and linguistics, it is often overlooked in corpus studies. Corpora are ideally built to be representative of a target domain or language variety. To achieve this goal, corpus builders need to identify an accurate sampling frame and collect relevant texts that capture the diversity of language in and across the sampling categories. In practice, however, corpora are built within the limitations of text availability, time, and human resources leading to questions about the suitability/prototypicality of individual texts in a corpus and their effect on the representativeness of the corpus as whole. Prototypicality also comes into play at the analysis stage. Most corpus analysis approaches use the corpus as a whole as the unit of analysis, including concordance and keyword analysis. To validate findings, a necessary but often omitted step is the close reading of individual texts. Here, a significant challenge is identifying which texts to read. A researcher may decide to randomly choose texts, but it is an open question if such texts are representative/prototypical of the corpus. Prototypicality also comes into play when corpora are used for pedagogic purposes, such as Data-Driven Learning (DDL). In these situations, there is often an implicit conflation of two facets of prototypicality, namely frequency of use and closeness to an ideal, particularly in the case of expert writing. In this paper, we first outline the multifaceted character of corpus text prototypicality. Next, we describe experiments that attempt to rank the prototypicality of individual corpus texts at different linguistic levels as a guide to choosing texts for close reading or excluding texts from a corpus at the data collection stage. Results using a modified version of the ProtAnt tool (Anthony and Baker, 2015) show prototypicality rankings can be dramatically affected by the linguistic level of analysis applied. Standard keywords effectively rank the prototypicality of texts in terms of topic, but the results can be enhanced using key semantic tags. On the other hand, key part-of-speech (POS) tags allow for a more nuanced view of text prototypicality centered on stylistics. The results also reveal the limitations of current corpus software tools and offer suggestions for how new tools might be developed to increase our understanding of prototypicality at the textual level.

Item Type:

Contribution to Conference (Paper)

Journal or Publication Title:

International Computer Archive of Modern and Medieval English (ICAME 44)

Departments:

Faculty of Science and Technology > School of Computing & Communications

ID Code:

194282

Deposited By:

ep_importer_pure

Deposited On:

31 May 2023 15:30

Refereed?:

Yes

Published?:

Published

Last Modified:

09 Apr 2026 23:14

URI:

https://eprints.lancs.ac.uk/id/eprint/194282