COUNTER - COrpus of Urdu News TExt Reuse

Muhammad, Sharjeel and Nawab, Rao Muhammad Adeel and Rayson, Paul Edward (2017) COUNTER - COrpus of Urdu News TExt Reuse. Language Resources and Evaluation, 51 (3). pp. 777-803. ISSN 1574-020X

Preview

PDF (counter-lre-v4)
counter_lre_v4.pdf - Accepted Version
Available under License Creative Commons Attribution.
Download (747kB)

Abstract

Text reuse is the act of borrowing text from existing documents to create new texts. Freely available and easily accessible large online repositories are not only making reuse of text more common in society but also harder to detect. A major hindrance in the development and evaluation of existing/new mono-lingual text reuse detection methods, especially for South Asian languages, is the unavailability of standardized benchmark corpora. Amongst other things, a gold standard corpus enables researchers to directly compare existing state-of-the-art methods. In our study, we address this gap by developing a benchmark corpus for one of the widely spoken but under resourced languages i.e. Urdu. The COUNTER (COrpus of Urdu News TExt Reuse) corpus contains 1,200 documents with real examples of text reuse from the field of journalism. It has been manually annotated at document level with three levels of reuse: wholly derived, partially derived and non derived. We also apply a number of similarity estimation methods on our corpus to show how it can be used for the development, evaluation and comparison of text reuse detection systems for the Urdu language. The corpus is a vital resource for the development and evaluation of text reuse detection systems in general and specifically for Urdu language.

Item Type:

Journal Article

Journal or Publication Title:

Language Resources and Evaluation

Additional Information:

The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-016-9367-2

Uncontrolled Keywords:

/dk/atira/pure/subjectarea/asjc/3300/3310

Subjects:

?? mono-lingual text reuseurdu news corpusurdu text reuse detectioncorpus generationlinguistics and languagelibrary and information sciences ??

Departments:

Faculty of Science and Technology > School of Computing & Communications

ID Code:

81443

Deposited By:

ep_importer_pure

Deposited On:

12 Sep 2016 15:34

Refereed?:

Yes

Published?:

Published

Last Modified:

30 Jun 2026 12:31

URI:

https://eprints.lancs.ac.uk/id/eprint/81443