The Multilingual Corpus of World’s Constitutions (MCWC) : MCWC

El-Haj, Mahmoud and Ezzini, Saad (2024) The Multilingual Corpus of World’s Constitutions (MCWC) : MCWC. In: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, 2024-05-20 - 2024-05-25. (In Press)

[thumbnail of mcwc-elhaj]
Text (mcwc-elhaj)
mcwc-elhaj.pdf - Accepted Version
Available under License Creative Commons Attribution-NonCommercial.

Download (755kB)

Abstract

The “Multilingual Corpus of World’s Constitutions” (MCWC) is a rich resource available in English, Arabic, and Spanish, encompassing constitutions from various nations. This corpus serves as a vital asset for the NLP community, facilitating advanced research in constitutional analysis, machine translation, and cross-lingual legal studies. To ensure comprehensive coverage, for constitutions not originally available in Arabic and Spanish, we employed a fine-tuned state-of-the-art machine translation model. MCWC prepares its data to ensure high quality and minimal noise, while also providing valuable mappings of constitutions to their respective countries and continents, facilitating comparative analysis. Notably, the corpus offers pairwise sentence alignments across languages, supporting machine translation experiments. We utilise a leading Machine Translation model, fine-tuned on the MCWC to achieve accurate and context-aware translations. Additionally, we introduce an independent Machine Translation model as a comparative baseline. Fine-tuning the model on MCWC improves accuracy, highlighting the significance of such a legal corpus for NLP and Machine Translation. MCWC’s diverse multilingual content and commitment to data quality contribute to advancements in legal text analysis within the NLP community, facilitating exploration of constitutional texts and multilingual data analysis.

Item Type:
Contribution to Conference (Paper)
Journal or Publication Title:
The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Uncontrolled Keywords:
Research Output Funding/yes_internally_funded
Subjects:
?? constitutionscorpusfine-tuningmachine translationlegal documentsyes - internally fundedno ??
ID Code:
219131
Deposited By:
Deposited On:
15 May 2024 12:55
Refereed?:
Yes
Published?:
In Press
Last Modified:
03 Nov 2024 00:53