Hoffmann, S. (2007) Processing internet-derived text - creating a corpus of usenet messages. Literary and Linguistic Computing, 22 (2). pp. 35-55.
Full text not available from this repository.Abstract
In recent years, linguists have become increasingly interested in the language of the Internet—both as an object of investigation as well as a source of authentic data to complement traditional electronic corpora. However, Internet-derived data is typically very messy data and a conversion process is often required in order to enable researchers to carry out a reliable quantitative investigation of the patterns observed with the help of standard corpus tools. In this article, I discuss the technical and methodological aspects involved in creating a large corpus of asynchronous computer-mediated communication by downloading and post-processing hundreds of thousands messages posted in twelve Usenet newsgroups. After describing how messages can be arranged into hierarchically structured discussion threads, I focus at some length on the strategies that are required to correctly assign authorship to the different textual elements in individual messages. My algorithms have a success rate of well over 90% for most newsgroups and the resulting corpus can thus serve as a suitable basis for an investigation into the interactive strategies employed in this particular type of written communication.
| Item Type: | Article |
|---|---|
| Journal or Publication Title: | Literary and Linguistic Computing |
| Additional Information: | RAE_import_type : Journal article RAE_uoa_type : Linguistics |
| Subjects: | P Language and Literature > P Philology. Linguistics |
| Departments: | Faculty of Arts & Social Sciences > Linguistics & English Language |
| ID Code: | 3942 |
| Deposited By: | ep_importer |
| Deposited On: | 05 Mar 2008 13:07 |
| Refereed?: | Yes |
| Published?: | Published |
| Last Modified: | 26 Jul 2012 17:59 |
| Identification Number: | |
| URI: | http://eprints.lancs.ac.uk/id/eprint/3942 |
Actions (login required)
| View Item |

