Synthetic Data Feasibility: Benchmarking Differentially Private and Non-Differentially Private Synthetic Data Generation Methods
Conference
65th ISI World Statistics Congress 2025
Format: CPS Abstract - WSC 2025
Keywords: differential privacy, machine learning, synthetic_data
Abstract
Synthetic data, which is curated from existing data sources via various generation methods, is increasingly utilized in addressing a broad array of data science challenges. These challenges include dataset augmentation, de-biasing, and privacy-preserving data releases. Although synthetic datasets do not replicate their real dataset counterparts exactly, the methods used for synthetic data generation (SDG) can "memorize" input data, posing potential privacy risks. Moreover, real datasets often contain outliers whose privacy is particularly challenging to safeguard. Numerous definitions and forms of privacy have been proposed in the literature. This study focuses on privacy as the non-disclosure of an individual's participation in the dataset. A specific type of privacy called differential privacy (DP) provides guarantees that the outputs of a generation method will not significantly change when the inputs are changed by some threshold, thereby preserving individuals' privacy.
Recent studies have explored the integration of synthetic data into the data science and research pipeline. Typically, synthetic data augments datasets to achieve a balanced dataset with respect to target variables. However, an emerging trend is the replacement of real datasets with synthetic datasets within the pipeline. Generally, findings from literature indicate a tradeoff between utility and privacy; greater privacy in SDG methods often results in diminished machine learning performance on the synthetic data produced by these methods.
This study evaluates the feasibility of synthetic data utilization by assessing SDG methods across three dimensions: fidelity, utility, and privacy. Fidelity assesses the statistical similarity between synthetic and real data. Utility evaluates the effectiveness of synthetic data in performing specific tasks. Privacy quantifies the amount of information about the real data that can be inferred from the synthetic data through inference attacks. Classification tasks are performed using in-house datasets and surveys from the Bangko Sentral ng Pilipinas and publicly available tabular datasets. The SDG methods evaluated include generative adversarial networks, classification and regression trees, copulas, variational autoencoders, and Bayesian networks, with differentially private versions of these methods encompassed as well. To enhance the utility-privacy tradeoff, the study explores a modified train-on-synthetic-test-on-real (TSTR) pipeline. This approach involves a pre-training phase where an artificial neural network is trained on synthetic data, followed by fine-tuning on real data. The final utility is then assessed using a holdout set of real data.
The study identifies the best-performing SDG methods in both non-private and private settings, analyzing differences in performance across the three evaluation categories. Furthermore, the effects of the modified TSTR pipeline on the utility levels are explored along with the SDG methods that benefit the most from such modification. The study concludes by offering insights into how central bank policies can be reinforced for data sharing and release to the public and enhancing the data science workflow—all while ensuring that data privacy is maximally preserved.