Edwards, Matthew and Rashid, Awais and Rayson, Paul (2018) Data quality measures for identity resolution. PhD thesis, Lancaster University.
2018edwardsphd.pdf - Published Version
Available under License Creative Commons Attribution-NonCommercial-NoDerivs.
Download (1MB)
Abstract
The explosion in popularity of online social networks has led to increased interest in identity resolution from security practitioners. Being able to connect together the multiple online accounts of a user can be of use in verifying identity attributes and in tracking the activity of malicious users. At the same time, privacy researchers are exploring the same phenomenon with interest in identifying privacy risks caused by re-identification attacks. Existing literature has explored how particular components of an online identity may be used to connect profiles, but few if any studies have attempted to assess the comparative value of information attributes. In addition, few of the methods being reported are easily comparable, due to difficulties with obtaining and sharing ground- truth data. Attempts to gain a comprehensive understanding of the identifiability of profile attributes are hindered by these issues. With a focus on overcoming these hurdles to effective research, this thesis first develops a methodology for sampling ground-truth data from online social networks. Building on this with reference to both existing literature and samples of real profile data, this thesis describes and grounds a comprehensive matching schema of profile attributes. The work then defines data quality measures which are important for identity resolution, and measures the availability, consistency and uniqueness of the schema’s contents. The developed measurements are then applied in a feature selection scheme to reduce the impact of missing data issues common in identity resolution. Finally, this thesis addresses the purposes to which identity resolution may be applied, defining the further application-oriented data quality measurements of novelty, veracity and relevance, and demonstrating their calculation and application for a particular use case: evaluating the social engineering vulnerability of an organisation.