Quantifying the contribution of individual records to the reidentification risk of (pseudo)anonymized datasets
Conference
64th ISI World Statistics Congress
Format: CPS Paper
Keywords: anonymization, extreme-value theory, privacy, pseudonymization, reidentification, risk
Session: CPS 07 - Statistical estimation II
Monday 17 July 8:30 a.m. - 9:40 a.m. (Canada/Eastern)
Abstract
The reidentification of individuals or business establishments in (pseudo)anonymized microdata may expose sensitive data and will lead to fines and reputational damage for the data’s custodians. The QaR method (AFNOR, 2020) proposes a measure of the reidentification risk of a dataset, and a statistical technique, based on extreme-value theory, to estimate it. This risk has great value. It is a gauge of the effectiveness of whatever disclosure control the custodians apply to the data; it could be reported to regulatory authorities to demonstrate the custodians’ level of care for the data subjects’ privacy; it can be used to calculate an insurance premium against unauthorized disclosure or the amount of money that custodians need in their balance sheet to cover potential financial damages due to such disclosure.
The present paper deals with a particular aspect of the methodology: the quantification of the contribution of each record to the dataset’s risk. It discusses its importance and its large computational demands in very large datasets, and proposes metrics that are faster to compute and could serve as proxies of record contribution. The results for some of these proxies are promises but more investigation is needed.