Modernising probabilistic linking at the Australian Bureau of Statistics and its potential to improve multisource statistics production
Conference
65th ISI World Statistics Congress 2025
Format: IPS Abstract - WSC 2025
Tuesday 7 October 10:50 a.m. - 12:30 p.m. (Europe/Amsterdam)
Abstract
The paper will explain ABS’s experiences and potential with SpLink to improve the quality and enhance the data integration capability. Especially, the potential the methodology and tool has to address challenges of data quality across government, for large and multiple data sources and our ability to measure linkage uncertainty, not only obtain accurate measures of linkage quality but also to provide a means of accounting for linkage error in analyses.
An example is adjusting estimates of regression models fitted to linked data to obtain unbiased estimates of parameters and their variances. We will also outline how we plan to use this for construction of statistical networks between people, jobs and employers or other entities which can be analysed through a Knowledge Graph.
ABS initial results with SpLink are very promising in terms of consistency with the current deterministic method and faster turnaround of linkages. The probabilistic linking method (Fellegi-Sunter) also provides more accurate, and statistically defensible measures of match probability of each linkage pair and hence a better overall assessment of linkage quality. The use of SpLink will assist ABS to meet the growing demand for linked data products across federal and state jurisdictions.
We also seek to open possibilities for linking multiple data sources. The quality of a linkage inevitably hinges on the quality and consistency of the linkage variables on each dataset. Datasets that may become available in the future, especially those from private sector sources, are unlikely to have the same legal requirement for registrants to provide accurate information as many federal government agencies do. However, from a statistical point of view, it is critical to measure linkage uncertainty to not only obtain accurate measures of linkage quality but also to provide a means of accounting for linkage error in analyses. SpLink facilitates this at scale by automatically producing all the many-to-many links along with their match probabilities.
SpLink makes many-to-many links a realistic high turnover proposition. This in turn, enables the construction of statistical networks between people, jobs and employers or other entities which can be analysed through a Knowledge Graph. By bringing together the units model definitions, connection probabilities, and a schema on rues for allowable connections it will be possible to undertake sophisticated analyses.