65th ISI World Statistics Congress 2025

65th ISI World Statistics Congress 2025

Estimating missed links between administrative data lists using dual systems estimation

Conference

65th ISI World Statistics Congress 2025

Format: IPS Abstract - WSC 2025

Keywords: "bayesian, data-linkage, error rates

Session: IPS 798 - Assessment and Improvement of Data Quality Through Use of Auxiliary Information and Record Linkage

Tuesday 7 October 10:50 a.m. - 12:30 p.m. (Europe/Amsterdam)

Abstract

New Zealand has run a Census every five years since 1877 with 4 exceptions. Stats NZ has a long-term goal to develop an alternative census model based on government administrative data supported by survey data. An administrative data-first approach aims to produce a wider range of statistics that are delivered faster, offer more detailed insights, and improve the efficiency of data collection (reusing data that the government has already invested time and effort to collect). It also reduces our vulnerability to external events (like pandemics, natural disasters, and severe weather events), which delayed earlier censuses. An ultimate goal of this programme is to produce high quality, timely population estimates from linked administrative data.

We are developing a methodology for producing these estimates consisting of dual systems estimation (DSE) using two lists of administrative data. These lists are linked using a largely automated method. The estimates will be biased by linkage error, in the form of missed and false links between the two lists of administrative data. Missed links are matching records on administrative data lists that were missed by the automated linking, and false links are non-matching records that were erroneously linked by the automated linking.

False links can be mitigated by a conservative linking strategy, allowing us to concentrate on estimating and adjusting for missed links.

We have developed a method of estimating the number of missed links using a DSE on the links themselves. This requires two independent methods of linking. We formulate our method in a Bayesian framework, and estimate the number of missed links at small domains with associated uncertainty. We demonstrate this method on simulated data, and on real administrative data.

When applied to administrative data lists, it is difficult to find two independent sets of linking variables. For instance, when a limited number of linking variables is available, one can split the components of these variables to create two lists of linking variables which are both adequately specific to define matches between the two lists. We find this technique tends to induce dependency between missed links on the two sets of variables. We extend the method to deal with this dependency between the two linking methods.