Tracing science-technology-linkages : A machine learning pipeline for extracting and matching patent in-text references to scientific publications

Abbasiantaeb, Zahra and Verberne, Suzan and Wang, Jian (2025) Tracing science-technology-linkages : A machine learning pipeline for extracting and matching patent in-text references to scientific publications. Information Processing & Management, 62 (6): 104264.

Full text not available from this repository.

Abstract

Patent references to science provide a valuable paper trail for investigating the knowledge flow from science to technological innovation. Research on patent–paper links has mostly concentrated on front-page references, often neglecting the more complex in-text references. Therefore, we developed a three-stage machine-learning pipeline to extract and match patent in-text references to scientific publications. Our pipeline performs the following tasks: (1) extracting reference strings from patent texts, (2) parsing fields from these reference strings, and (3) matching references to publications in the Web of Science (WoS) database. We developed a training dataset consisting of 3,900 (and 3,901) manually annotated references from 392 (and 319) randomly selected EPO (and USPTO) patents. The first stage, reference extraction, achieved almost perfect results with a precision of 98.9% and a recall of 97.7% at the reference level. Overall, the pipeline demonstrated robust performance, with a precision of 96.8% and a recall of 91.9% at the unique patent-paper-pair level. Applying this pipeline to EPO and USPTO patents granted between 1990 and 2022, we identified 5,438,836 (and 20,432,189) references from 492,469 (and 1,449,398) EPO (and USPTO) patents, 2,763,779 (and 11,069,995) of which are matched to WoS publications. This extensive dataset is a valuable resource for studying science-technology linkages. We offer open access to this dataset, along with the associated code and training data.

Item Type:
Journal Article
Journal or Publication Title:
Information Processing & Management
Uncontrolled Keywords:
Research Output Funding/yes_externally_funded
Subjects:
?? text miningreference extractionscience technology linkagecitation analysispatent analysisyes - externally funded ??
ID Code:
230479
Deposited By:
Deposited On:
22 Jul 2025 15:50
Refereed?:
Yes
Published?:
Published
Last Modified:
22 Jul 2025 15:50