Infrastructure for Semantic Annotation in the Genomics Domain

El-Haj, Mahmoud and Rutherford, Nathan and Coole, Matthew and Ezeani, Ignatius and Prentice, Sheryl and Ide, Nancy and Knight, Jo and Piao, Scott and Mariani, John and Rayson, Paul and Suderman, Keith (2020) Infrastructure for Semantic Annotation in the Genomics Domain. In: LREC 2020, Twelfth International Conference on Language Resources and Evaluation : LREC'20. European Language Resources Association (ELRA), Paris, pp. 6921-6929. ISBN 9791095546344

Text (genomics)
genomics.pdf - Accepted Version
Available under License Creative Commons Attribution-NonCommercial.
Download (1MB)

Text (2020.lrec-1.855)
2020.lrec_1.855.pdf - Published Version
Available under License Creative Commons Attribution-NonCommercial.
Download (1MB)

Abstract

We describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods. The infrastructure permits extreme-scale research on the open access PubMed Central archive. It combines an updatable Gene Ontology Semantic Tagger (GOST) for entity identification and semantic markup in the literature, with a NLP pipeline scheduler (Buster) to collect and process the corpus, and a bespoke columnar corpus database (LexiDB) for indexing. The corpus database is distributed to permit fast indexing, and provides a simple web front-end with corpus linguistics methods for sub-corpus comparison and retrieval. GOST is also connected as a service in the Language Application (LAPPS) Grid, in which context it is interoperable with other NLP tools and data in the Grid and can be combined with them in more complex workflows. In a literature based discovery setting, we have created an annotated corpus of 9,776 papers with 5,481,543 words.

Item Type:

Contribution in Book/Report/Proceedings

Departments:

Faculty of Science and Technology > School of Computing & Communications
Faculty of Health and Medicine > Medicine
Faculty of Science and Technology > Psychology

ID Code:

142283

Deposited By:

ep_importer_pure

Deposited On:

12 Mar 2020 15:05

Refereed?:

Yes

Published?:

Published

Last Modified:

24 Jan 2026 00:03

URI:

https://eprints.lancs.ac.uk/id/eprint/142283

Altmetric