Erjavec, Tomaž and Ogrodniczuk, Maciej and Osenova, Petya and Ljubešić, Nikola and Simov, Kiril and Pančur, Andrej and Rudolf, Michał and Kopp, Matyáš and Barkarson, Starkaður and Steingrímsson, Steinþór and Çöltekin, Çağrı and de Does, Jesse and Depuydt, Katrien and Agnoloni, Tommaso and Venturi, Giulia and Pérez, María Calzada and de Macedo, Luciana D. and Navarretta, Costanza and Luxardo, Giancarlo and Coole, Matthew and Rayson, Paul and Morkevičius, Vaidas and Krilavičius, Tomas and Darǵis, Roberts and Ring, Orsolya and van Heusden, Ruben and Marx, Maarten and Fišer, Darja (2023) The ParlaMint corpora of parliamentary proceedings. Language Resources and Evaluation, 57 (1). pp. 415-448. ISSN 1574-0218
s10579_021_09574_0.pdf - Published Version
Available under License Creative Commons Attribution.
Download (2MB)
Abstract
This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers, and are linguistically annotated following the Universal Dependencies formalism and with named entities. Samples of the corpora and conversion scripts are available from the project’s GitHub repository, and the complete corpora are openly available via the CLARIN.SI repository for download, as well as through the NoSketch Engine and KonText concordancers and the Parlameter interface for on-line exploration and analysis.