Anonymised Geospatial Health Data: A New Method for Small-Area Geomasking
Conference
65th ISI World Statistics Congress 2025
Format: CPS Abstract - WSC 2025
Session: CPS 71 - Spatial Data and Geomasking
Tuesday 7 October 4 p.m. - 5 p.m. (Europe/Amsterdam)
Abstract
In order to carry out geographic health analyses, it is essential to link health data with spatial data. It should be noted that the precision of the given spatial data is directly dependent on the accuracy of identifiable spatial patterns. On the other hand, the privacy of individuals must be protected, which is why linking health data with address data or precise geocoordinates is generally not permitted. Instead, health data are usually aggregated to administrative levels such as countries, administrative districts or municipalities. However, this severely limits the potential for analysing the data.
Armstrong et al. (1999) introduced geographical masks to enhance analysis potential while ensuring individual anonymity. Recent advances in digitalisation and data availability have revived interest in this topic. At the same time, stricter data protection laws are increasing the need for anonymisation methods. Yet, there is a lack of standardised approaches in research to compare and evaluate existing methods with regard to the degree of anonymity generated and the maintenance of data usefulness. The well-known spatial k-anonymity for assessing anonymity proves to be insufficient in practical application (Hafner et al., 2019). Spatial k-anonymity defines k-1 as the number of masked coordinates that are closer to the original point than the masked point itself (Hampton et al., 2010). However, under certain circumstances it is possible to clearly assign anonymised points to original addresses despite the presence of spatial k-anonymity. In addition, the focus in most applications is only on the anonymisation of geographical features. A link with other features, such as socio-demographic characteristics, is not taken into account. However, as health data must always be viewed in an overall context, in addition to linking health data with spatial data, geographic health research also needs to take into account other conceivable sensitive characteristics and quasi-identifiers in order to achieve meaningful and differentiated analysis results. In order to continue to guarantee the anonymity of the characteristic carriers (e.g. sick persons according to a specific ICD code) or to be able to measure it at all in this case, it is necessary to extend the spatial k-anonymity with approaches such as l-diversity (Machanavajjhala et al., 2007) or t-closeness (Li et al., 2006). In addition, it is necessary to evaluate the actual anonymity of geomasked datasets by considering different attack scenarios. Whether a dataset is anonymous also essentially depends on the additional knowledge of a data attacker. Consequently, this article presents comprehensive criteria for assessing the data security of georeferenced health data. In addition, an aggregating method is presented that summarises the smallest possible units within which actual measurable anonymity is guaranteed. The practical suitability of the methodology presented is demonstrated by applying it to a synthetically generated health data set for the city of Cologne.