Comparing and combining a semantic tagger and a statistical tool for MWE extraction.

Songlin Piao, Scott; and Rayson, Paul and Archer, Dawn and McEnery, Tony (2005) Comparing and combining a semantic tagger and a statistical tool for MWE extraction. Computer Speech and Language, 19 (4). pp. 378-397.

Full text not available from this repository.


Automatic extraction of multiword expressions (MWEs) presents a tough challenge for the NLP community and corpus linguistics. Indeed, although numerous knowledge-based symbolic approaches and statistically driven algorithms have been proposed, efficient MWE extraction still remains an unsolved issue. In this paper, we evaluate the Lancaster UCREL Semantic Analysis System (henceforth USAS (Rayson, P., Archer, D., Piao, S., McEnery, T., 2004. The UCREL semantic analysis system. In: Proceedings of the LREC-04 Workshop, Beyond Named Entity Recognition Semantic labelling for NLP tasks, Lisbon, Portugal. pp. 7–12)) for MWE extraction, and explore the possibility of improving USAS by incorporating a statistical algorithm. Developed at Lancaster University, the USAS system automatically annotates English corpora with semantic category information. Employing a large-scale semantically classified multi-word expression template database, the system is also capable of detecting many multiword expressions, as well as assigning semantic field information to the MWEs extracted. Whilst USAS therefore offers a unique tool for MWE extraction, allowing us to both extract and semantically classify MWEs, it can sometimes suffer from low recall. Consequently, we have been comparing USAS, which employs a symbolic approach, to a statistical tool, which is based on collocational information, in order to determine the pros and cons of these different tools, and more importantly, to examine the possibility of improving MWE extraction by combining them. As we report in this paper, we have found a highly complementary relation between the different tools: USAS missed many domain-specific MWEs (law/court terms in this case), and the statistical tool missed many commonly used MWEs that occur in low frequencies (lower than three in this case). Due to their complementary relation, we are proposing that MWE coverage can be significantly increased by combining a lexicon-based symbolic approach and a collocation-based statistical approach.

Item Type:
Journal Article
Journal or Publication Title:
Computer Speech and Language
Additional Information:
The study of multiword expressions (MWE) has been a hot topic in computational linguistics in the last five years. This cross-disciplinary paper linked research from corpus-based natural language processing to corpus linguists and showed that rule-based semantic heuristics and statistical extraction techniques were complementary. This paper was the culmination of a body of work submitted to Association for Computational Linguistics (ACL) conferences over a period of three years. It was part of a special issue of the journal of Computer Speech and Language and was consistently among the top twenty CSL articles downloaded for twelve months from April 2005. RAE_import_type : Journal article RAE_uoa_type : Computer Science and Informatics
Uncontrolled Keywords:
?? theoretical computer sciencesoftwarehuman-computer interactionqa75 electronic computers. computer science ??
ID Code:
Deposited By:
Deposited On:
21 Jun 2008 21:23
Last Modified:
16 Jan 2024 09:55