Processing internet-derived text - creating a corpus of usenet messages.

Hoffmann, S. (2007) Processing internet-derived text - creating a corpus of usenet messages. Literary and Linguistic Computing, 22 (2). pp. 35-55. ISSN 1477-4615

Full text not available from this repository.

Abstract

In recent years, linguists have become increasingly interested in the language of the Internet—both as an object of investigation as well as a source of authentic data to complement traditional electronic corpora. However, Internet-derived data is typically very messy data and a conversion process is often required in order to enable researchers to carry out a reliable quantitative investigation of the patterns observed with the help of standard corpus tools. In this article, I discuss the technical and methodological aspects involved in creating a large corpus of asynchronous computer-mediated communication by downloading and post-processing hundreds of thousands messages posted in twelve Usenet newsgroups. After describing how messages can be arranged into hierarchically structured discussion threads, I focus at some length on the strategies that are required to correctly assign authorship to the different textual elements in individual messages. My algorithms have a success rate of well over 90% for most newsgroups and the resulting corpus can thus serve as a suitable basis for an investigation into the interactive strategies employed in this particular type of written communication.

Item Type:

Journal Article

Journal or Publication Title:

Literary and Linguistic Computing

Additional Information:

RAE_import_type : Journal article RAE_uoa_type : Linguistics

Uncontrolled Keywords:

/dk/atira/pure/subjectarea/asjc/3300/3310

Subjects:

?? linguistics and languageinformation systemsp philology. linguistics ??

Departments:

Faculty of Arts & Social Sciences > Linguistics & English Language

ID Code:

3942

Deposited By:

ep_importer

Deposited On:

05 Mar 2008 13:07

Refereed?:

Yes

Published?:

Published

Last Modified:

10 Dec 2025 21:35

URI:

https://eprints.lancs.ac.uk/id/eprint/3942