65th ISI World Statistics Congress 2025

65th ISI World Statistics Congress 2025

Simulated microdata to enhance access to public sector data

Author

JC
Joseph Chien

Co-author

  • I
    Isaac Norden
  • M
    Marcus Robertson-Wall
  • A
    Anders Holmberg

Conference

65th ISI World Statistics Congress 2025

Format: IPS Abstract - WSC 2025

Keywords: microdata, safe

Session: IPS 755 - Improving Access to Microdata for Researchers

Monday 6 October 10:50 a.m. - 12:30 p.m. (Europe/Amsterdam)

Abstract

The Australian Bureau of Statistics (ABS) is exploring simulated data techniques to generate safe microdata to enhance access to valuable public sector data assets for research purposes. These safe microdata datasets are created using statistical information from the aggregate statistics that have been cleared for safe release from the ABS DataLab. They offer a balance between data utility, accessibility, and privacy protection and their advantages include:

1. Enabling researchers, communities, or any individual to explore research questions and develop models and code using microdata with realistic properties, without compromising confidentiality.
2. Providing a valuable training tool for universities and governments to explore and understand data.
3. Facilitating broader data access for researchers, communities, or any individual facing delays due to governance processes or resource limitations, thus accelerating research timelines.

It is not intended that the safe microdata generated through simulated data techniques should replace the real data assets but rather provides an opportunity for researchers and communities test hypotheses, explore basic statistical relationships, or fine tune models that can then be evaluated against real data.

This research explores the copula method and a combined Vale-Maurelli and multinomial regression model approaches. The copula method offers a flexible framework for simulating multivariate probability distributions with arbitrary marginal distributions and known intercorrelation structures. In contrast, the combined approach uses the Vale-Maurelli method for continuous variables and multinomial regression models for categorical variables. The Vale-Maurelli method simulates numeric variables using their first four moments (mean, variance, skewness, and kurtosis) and their covariance matrix, while multinomial regression models use numeric variables as predictors to generate categorical variables.

We compare the performance of these methods in preserving the statistical properties of both continuous and categorical variables. We evaluate their ability to maintain univariate distributions, bivariate relationships, and higher-order interactions present in the original data. This research contributes to the ongoing development of simulated data methodologies at the ABS, aiming to provide researchers with realistic, safe datasets for exploratory analysis and code testing before accessing the secure ABS DataLab environment. By comparing these approaches, we provide insights into their relative merits and limitations, guiding future implementations of simulated data in official statistics and broader data science and research applications.