Design and construction of an openly available Urdu web corpus

Jehangir, H. and Hardie, A. (2024) Design and construction of an openly available Urdu web corpus. Corpora, 19 (3). pp. 363-373. ISSN 1749-5032

Full text not available from this repository.

Abstract

Urdu corpus linguistics is in its infancy, partly because the field lacks large, openly and freely accessible corpora. General purpose Urdu corpora created to date are unsuitable as shared reference data for the field due to barriers of cost or copyright. The novel Lancaster Urdu Web Corpus (luwc) is designed to fill this gap. It encompasses data from three news websites and an online chat forum. The corpus contains 24 million tokens, and is part-of-speech (pos) tagged. To overcome problems with distributing a corpus whose texts’ intellectual property belongs to other parties, the luwc is available through a cqpweb server, disallowing access to full underlying data. However, the accessibility of source urls as text-level metadata gives users a means by which to see the full original context. In spite of issues of balance/representativeness the luwc can fulfil the role of a shared reference point for Urdu corpus analysis.

Item Type:
Journal Article
Journal or Publication Title:
Corpora
Subjects:
?? chat forume-newspapersopen-accessshared reference dataurdu monolingual corpus ??
ID Code:
233640
Deposited By:
Deposited On:
13 Nov 2025 12:40
Refereed?:
Yes
Published?:
Published
Last Modified:
13 Nov 2025 22:40