Extending the key semantic domains method beyond English corpora : Wmatrix version 5

Rayson, Paul (2021) Extending the key semantic domains method beyond English corpora : Wmatrix version 5. In: Corpus Linguistics 2021, 2021-07-13 - 2021-07-16, University of Limerick.

Text (Handout)
Paper_124_Handout_Paul_Rayson.pdf - Published Version
Available under License Creative Commons Attribution-NonCommercial-ShareAlike.
Download (76kB)

Text (Poster)
Paper_124_Paul_Rayson.pdf - Published Version
Available under License Creative Commons Attribution-NonCommercial-ShareAlike.
Download (846kB)

Abstract

The key semantic domains method (Rayson, 2008) implemented in Wmatrix (versions 1 to 4) extends the keywords approach which has been widely applied in corpus linguistics research. Key semantic domains facilitates the discovery of concepts and groups of words collected within semantic fields which are unusually frequent or infrequent compared to a reference corpus, and can exploit significance and effect size measures in the same way as the key words approach. Key semantic domains have proved useful in a number of different areas of linguistic research: literary characterisation (Balossi, 2014), language of psychopaths (Hancock et al., 2013), corpus-assisted discourse analysis of social work writing (Leedham et al., 2020), enhancing critical thinking in higher education (O’Halloran, 2020), and the construction of newsworthiness (Potts et al., 2015). However, one important drawback is that key semantic domains are currently restricted to one language only due to the inclusion of the CLAWS Part-of-Speech (POS) tagger (Garside and Smith, 1997) and the UCREL Semantic Analysis System (USAS) for English (Rayson et al., 2004). In recent years, semantic taggers for other languages have been developed (Piao et al., 2015; Piao et al., 2016) utilising freely available POS taggers and lemmatisers for new languages, and adapting a variety of methods ranging from bilingual dictionaries, parallel aligned corpora, machine translation, and crowdsourcing to bootstrap development of new semantic lexicons, and vector-based, pre-trained embeddings and machine learning methods to improve contextual disambiguation (Ezeani et al., 2019). Previously, a beta version of the Spanish semantic tagger has been incorporated into Wmatrix4. This poster will describe how the semantic taggers for further languages are being incorporated into Wmatrix5. Crucially, there is a need to support community crowdsourcing involvement for the extension and checking of the new semantic lexicons which are under varying stages of development to improve their coverage and accuracy.

Item Type:

Contribution to Conference (Poster)

Journal or Publication Title:

Corpus Linguistics 2021

Departments:

Faculty of Science and Technology > School of Computing & Communications

ID Code:

156972

Deposited By:

ep_importer_pure

Deposited On:

13 Jul 2021 08:40

Refereed?:

Yes

Published?:

Published

Last Modified:

28 Mar 2026 00:15

URI:

https://eprints.lancs.ac.uk/id/eprint/156972