Design and construction of an openly available Urdu web corpus

Jehangir, H. and Hardie, A. (2024) Design and construction of an openly available Urdu web corpus. Corpora, 19 (3). pp. 363-373. ISSN 1749-5032

Full text not available from this repository.

Abstract

Urdu corpus linguistics is in its infancy, partly because the field lacks large, openly and freely accessible corpora. General purpose Urdu corpora created to date are unsuitable as shared reference data for the field due to barriers of cost or copyright. The novel Lancaster Urdu Web Corpus (luwc) is designed to fill this gap. It encompasses data from three news websites and an online chat forum. The corpus contains 24 million tokens, and is part-of-speech (pos) tagged. To overcome problems with distributing a corpus whose texts’ intellectual property belongs to other parties, the luwc is available through a cqpweb server, disallowing access to full underlying data. However, the accessibility of source urls as text-level metadata gives users a means by which to see the full original context. In spite of issues of balance/representativeness the luwc can fulfil the role of a shared reference point for Urdu corpus analysis.

Item Type:

Journal Article

Journal or Publication Title:

Corpora

Subjects:

?? chat forume-newspapersopen-accessshared reference dataurdu monolingual corpus ??

Departments:

Faculty of Arts & Social Sciences > Linguistics & English Language

ID Code:

233640

Deposited By:

ep_importer_pure

Deposited On:

13 Nov 2025 12:40

Refereed?:

Yes

Published?:

Published

Last Modified:

13 Dec 2025 13:14

URI:

https://eprints.lancs.ac.uk/id/eprint/233640