An Efficient Semi-automated Data Export Checking System to Avoid Confidential Data Leakage
Conference
65th ISI World Statistics Congress 2025
Format: IPS Abstract - WSC 2025
Keywords: control, feature, supervised learning
Session: IPS 755 - Improving Access to Microdata for Researchers
Monday 6 October 10:50 a.m. - 12:30 p.m. (Europe/Amsterdam)
Abstract
CASD allows researchers to access confidential data with secure spaces. Once their work is over, and their result is aggregated enough, they are allowed to export their output. This output can be accessible out of the secure space, to be published, for example.
In 2023, CASD introduced a first version of its output checking system based on a machine learning model. Our objective is to raise warnings on outputs that are not compliant with statistical secrecy rules. For that, we rely on past controls made by our expert output-checking team.
This new version will introduce several improvements that will make the system more reliable and efficient. The model now predicts acceptance or refusal for each file contained in the output. Before, it was only capable to predict for a full export, which is a less performant approach. Training was done on a greater number of example, improving variety of files covered and reliability.
First, we will introduce how the system deals with the files. We perform a transformation step where we divide each file into components of different types (Text, Image and Table). Because of the heterogeneity of the file formats, this step requires various libraries. Then we apply various feature engineering techniques on these components to obtain the features which the model will be trained on. We will give more details on how we choose and apply the feature engineering techniques on each component type.
Titouan will present how we articulate these two steps into databases and prepare data for training. He will also introduce the technical system allowing to use the model every day.
Pengfei will talk about how we use MLops (Machine Learning Operations) workflow to automate the model training and deployment. In particular, he will show how to use MLops tools to increase the pace of model development and deployment/delivery.
Rémy will give a return of experience on using this model with his team, and improvements made with this technology. In particular, on the critical role the human checkers play in succeeding with this project.
Finally, we will assess performance and discuss the most effective predictors for this model.
We would like to make a live demonstration of our system, on simulated data, during our presentation.
Figures/Tables
schema-arch