Jackson, James and Francis, Brian and Mitra, Robin and Dove, Iain (2022) Using saturated models for data synthesis. In: Proceedings of the 36th International Workshop on Statistical Modelling : July 18-22, 2022 - Trieste, Italy. EUT Edizioni Università di Trieste, Trieste 2022, ITA, pp. 205-210. ISBN 9788855113090
Jackson_et_al._2022_IWSM.pdf - Published Version
Available under License Creative Commons Attribution.
Download (727kB)
Abstract
The use of synthetic data sets are becoming ever more prevalent, as regulations such as the General Data Protection Regulation (GDPR), which place greater demands on the protection of individuals’ personal data, are coupled with the conflicting demand to make more data available to researchers. This paper discusses the approach of synthesizing categorical data at the aggregated (contingency table) level using a saturated count model, which adds noise - and hence protection - to cell counts. The paper also discusses how distributional properties of synthesis models are intrinsic to generating synthetic data with suitable risk and utility profiles.