Gillings, Mathew and Hollmann, Willem and Warmelink, Lara (2021) A corpus-based investigation into verbal cues to deception and their sociolinguistic distribution. PhD thesis, Lancaster University.
2020gillingsphd.pdf - Published Version
Restricted to Repository staff only until 18 May 2026.
Available under License Creative Commons Attribution.
Download (3MB)
Abstract
Research from psychology has shown that there are significant differences between the language of a truth teller and a liar, and investigating these differences in more detail may point towards cues to deception. However, the sociolinguistic nature of these cues has never been fully investigated. After collecting data from an experimental setting, this thesis uses corpus-based methods to investigate how cues vary across social groups. The thesis starts with an overview of the verbal deception detection literature, before describing a data collection experiment that included four interviews: a truthful and deceptive description of two scenarios. The first scenario (based on the work of Porter and Yuille, 1996) was designed to mimic a police interrogation. The second scenario (based on the work of Warmelink et al., 2012) was designed as a mock border control interview. The transcriptions of these interviews formed the 125,473-word corpus, with truthful language accounting for 60,823 words, and deceptive language accounting for 64,650 words. The dataset was tagged according to part-of-speech and semantic category, and uploaded to CQPweb (Hardie, 2012). Before carrying out an analysis of this dataset, an operationalisation exercise took place to map each previously-identified cue to deception onto the appropriate part-of-speech or semantic category for analysis using CQPweb. This was a theory-driven exercise, with each cue often consisting of several tags. In total, 8 cues were investigated: text length, pronoun usage (I/me and you), negation, filled pauses, exclusivisers, hedging, temporal information, and motion words. The analysis itself consisted of three parts – the first two were quantitative, and the third was qualitative in nature. The first two quantitative analyses were carried out using CQPweb (Hardie, 2012) and LIWC (Pennebaker et al., 2015) respectively; frequencies were gathered using each tool, and a mixed-effects model was used to identify significant differences across the data and consider sociolinguistic differences within it. The third analysis offered a qualitative approach using traditional corpus-based processes (frequency breakdown, distribution, collocation analysis, and concordance analysis) was carried out to look further into the data and determine what was actually being measured by those particular tags and cues. This was therefore a contrastive analysis to investigate differences between the two tools, followed by a corpus-based analysis to investigate the results within context. The main CQPweb analysis identified 16 significant results. There were no significant differences for any of the cues across veracity conditions alone, but there were 5 significant interactions between veracity and another variable: negation (x2), exclusivisers, hedging, and temporal information. The remaining 11 results identified by the model were differences across a single variable, either gender, age, socioeconomic status, or region. These cues included pronouns I/me, pronoun you, negation, filled pauses, exclusivisers, motion words, and text length. Overall, an analysis of this corpus appears to suggest that there are indeed differences between truthful and deceptive language, but only when interacting with other linguistic or sociolinguistic variables. It also appears to suggest that sociolinguistic differences are much more frequent than differences across veracity conditions. On a qualitative level, it appears that linguistic context is key to our understanding, and researchers working within deception detection must take this into account rather than relying on word-level findings.