Learning to combine multiple string similarity metrics for effective toponym matching

Santos, Rui and Murrieta-Flores, Patricia and Martins, Bruno (2017) Learning to combine multiple string similarity metrics for effective toponym matching. International Journal of Digital Earth. ISSN 1753-8947

[thumbnail of Manusc_Combining_Multiple_String_Similarity_Metrics_for_Effective_Toponym_Matching]
Preview
PDF (Manusc_Combining_Multiple_String_Similarity_Metrics_for_Effective_Toponym_Matching)
Manusc_Combining_Multiple_String_Similarity_Metrics_for_Effective_Toponym_Matching.pdf - Accepted Version
Available under License Creative Commons Attribution-NonCommercial.

Download (502kB)

Abstract

Several tasks related to geographical information retrieval and to the geographical information sciences involve toponym matching, that is, the problem of matching place names that share a common referent. In this article, we present the results of a wide-ranging evaluation on the performance of different string similarity metrics over the toponym matching task. We also report on experiments involving the usage of supervised machine learning for combining multiple similarity metrics, which has the natural advantage of avoiding the manual tuning of similarity thresholds. Experiments with a very large dataset show that the performance differences for the individual similarity metrics are relatively small, and that carefully tuning the similarity threshold is important for achieving good results. The methods based on supervised machine learning, particularly when considering ensembles of decision trees, can achieve good results on this task, significantly outperforming the individual similarity metrics.

Item Type:
Journal Article
Journal or Publication Title:
International Journal of Digital Earth
Additional Information:
This is an Accepted Manuscript of an article published by Taylor & Francis in International Journal of Digital Earth on 06/09/2017, available online: http://www.tandfonline.com/doi/full/10.1080/17538947.2017.1371253
Uncontrolled Keywords:
/dk/atira/pure/subjectarea/asjc/1700/1712
Subjects:
?? duplicate detectionensemble learninggeographic information retrievalstring similarity metricssupervised learningtoponym matchingsoftwarecomputer science applicationsgeneral earth and planetary sciencesearth and planetary sciences(all) ??
ID Code:
89481
Deposited By:
Deposited On:
08 Jan 2018 10:28
Refereed?:
Yes
Published?:
Published
Last Modified:
19 Sep 2024 02:06