Erjavec, Tomaž and Kopp, Matyáš and Ljubešić, Nikola and Kuzman, Taja and Rayson, Paul and Osenova, Petya and Ogrodniczuk, Maciej and Çöltekin, Çağrı and Koržinek, Danijel and Meden, Katja and Skubic, Jure and Rupnik, Peter and Agnoloni, Tommaso and Aires, José and Barkarson, Starkaður and Bartolini, Roberto and Bel, Núria and Calzada Pérez, María and Darģis, Roberts and Diwersy, Sascha and Gavriilidou, Maria and van Heusden, Ruben and Iruskieta, Mikel and Kahusk, Neeme and Kryvenko, Anna and Ligeti-Nagy, Noémi and Magariños, Carmen and Mölder, Martin and Navarretta, Costanza and Simov, Kiril and Tungland, Lars Magne and Tuominen, Jouni and Vidler, John and Vladu, Adina Ioana and Wissik, Tanja and Yrjänäinen, Väinö and Fišer, Darja (2024) ParlaMint II : advancing comparable parliamentary corpora across Europe. Language Resources and Evaluation. ISSN 1574-020X
Full text not available from this repository.Abstract
AbstractThe paper presents the results of the ParlaMint II project, which comprise comparable corpora of parliamentary debates of 29 European countries and autonomous regions, covering at least the period from 2015 to 2022, and containing over 1 billion words. The corpora are uniformly encoded, contain rich metadata about their 24 thousand speakers, and are linguistically annotated up to the level of Universal Dependencies syntax and named entities. The paper focuses on the enhancement made since the ParlaMint I project and presents the compilation of the corpora, including the encoding infrastructure, use of GitHub, the production of individual corpora, the common pipeline for producing their distribution, and use of CLARIN services for dissemination. It then gives a quantitative overview of the produced corpora, followed by the qualitative additions made within the ParlaMint II project, namely metadata localisation, the addition of new metadata, such as the political orientation of political parties, the machine translation of the corpora to English and its tagging with semantic classes, and the production of pilot speech corpora. Finally, outreach activities and further work are discussed.