Data augmentation and transfer learning for cross-lingual Named Entity Recognition in the biomedical domain

Lancheros, B.S. and Corpas Pastor, G. and Mitkov, R. (2024) Data augmentation and transfer learning for cross-lingual Named Entity Recognition in the biomedical domain. Language Resources and Evaluation. ISSN 1574-020X

Full text not available from this repository.

Abstract

Given the increase in production of data for the biomedical field and the unstoppable growth of the internet, the need for Information Extraction (IE) techniques has skyrocketed. Named Entity Recognition (NER) is one of such IE tasks useful for professionals in different areas. There are several settings where biomedical NER is needed, for instance, extraction and analysis of biomedical literature, relation extraction, organisation of biomedical documents, and knowledge-base completion. However, the computational treatment of entities in the biomedical domain has faced a number of challenges including its high cost of annotation, ambiguity, and lack of biomedical NER datasets in languages other than English. These difficulties have hampered data development, affecting both the domain itself and its multilingual coverage. The purpose of this study is to overcome the scarcity of biomedical data for NER in Spanish, for which only two datasets exist, by developing a robust bilingual NER model. Inspired by back-translation, this paper leverages the progress in Neural Machine Translation (NMT) to create a synthetic version of the Colorado Richly Annotated Full-Text (CRAFT) dataset in Spanish. Additionally, a new CRAFT dataset is constructed by replacing 20% of the entities in the original dataset generating a new augmented dataset. We evaluate two training methods: concatenation of datasets and continuous training to assess the transfer learning capabilities of transformers using the newly obtained datasets. The best performing NER system in the development set achieved an F-1 score of 86.39%. The novel methodology proposed in this paper presents the first bilingual NER system and it has the potential to improve applications across under-resourced languages.

Item Type:
Journal Article
Journal or Publication Title:
Language Resources and Evaluation
Uncontrolled Keywords:
/dk/atira/pure/subjectarea/asjc/3300/3310
Subjects:
?? linguistics and languagelibrary and information sciences ??
ID Code:
220419
Deposited By:
Deposited On:
23 May 2024 10:40
Refereed?:
Yes
Published?:
Published
Last Modified:
24 May 2024 03:05