Coole, Matthew and Rayson, Paul and Mariani, John (2020) LexiDB: Patterns & Methods for Corpus Linguistic Database Management. In: Proceedings of The 12th Language Resources and Evaluation Conference :. European Language Resources Association (ELRA), Paris, pp. 3128-3135. ISBN 9791095546344
LREC2020_1_.pdf - Accepted Version
Available under License Creative Commons Attribution-NonCommercial.
Download (207kB)
2020.lrec_1.383.pdf - Published Version
Available under License Creative Commons Attribution-NonCommercial.
Download (269kB)
Abstract
LexiDB is a tool for storing, managing and querying corpus data. In contrast to other database management systems (DBMSs), itis designed specifically for text corpora. It improves on other corpus management systems (CMSs) because data can be added anddeleted from corpora on the fly with the ability to add live data to existing corpora. LexiDB sits between these two categories ofDBMSs and CMSs, more specialised to language data than a general-purpose DBMS but more flexible than a traditional static corpusmanagement system. Previous work has demonstrated the scalability of LexiDB in response to the growing need to be able to scale outfor ever-growing corpus datasets. Here, we present the patterns and methods developed in LexiDB for storage, retrieval and querying ofmulti-level annotated corpus data. These techniques are evaluated and compared to an existing CMS (Corpus Workbench CWB - CQP)and indexer (Lucene). We find that LexiDB consistently outperforms existing tools for corpus queries. This is particularly apparent withlarge corpora and when handling queries with large result sets.