A Metadata-based Framework for Combining Data Sources in Official Statistics
Conference
65th ISI World Statistics Congress 2025
Format: CPS Abstract - WSC 2025
Keywords: combining-data, metadata
Session: CPS 61 - Data Integration in Official Statistics
Monday 6 October 4 p.m. - 5 p.m. (Europe/Amsterdam)
Abstract
At national statistical institutes, there are increasing numbers of sources available for official statistic production. When one is interested in making output, it might not always be clear if and how the intended output can be produced from the variety of input sources available. Important considerations that affect their use are the unit types, populations, periods, and variables in the sources available. A metadata-based framework can be used to describe the processes by which tabular sources are combined.
The proposed presentation discusses a metadata-based framework for combining data sources, as well as the practicalities of implementing an automated solution-finding approach. The framework enables a standardised way of describing the process of combining administrative, survey and big data sources. This can be used to document the steps used when producing multi-source statistics. The framework can also be used during the design phase of new potential statistical output by testing if and how the intended output can be made from the input sources available. This is done by searching for the existence of a path between a set of available sources and the intended output. Such a path is composed of basic data manipulations and modelling steps which are defined in the framework. The framework requires only metadata as input and, as a result, the design phase of a new statistic can be performed without any privacy concerns.
An implementation of the A* algorithm can fully automatically find paths when multiple sources are combined. This is especially relevant for national statistical institutes which tend to have access to a large collection of data sources. The framework and its implementation demonstrate the greatest benefits when a large amount of data sources are available. The framework is applied to a case study of mobility data to demonstrate its use and the most important considerations of the framework are discussed.
The framework enables researchers, particularly in the field of official statistics, to answer various questions for any set of data sources and any intended statistic, given sufficient metadata information is available. Questions that can be answered are "Can a given set of sources be combined to produce an intended statistic?", "How can a given set of sources be combined to produce an intended statistic?", "What is the minimally required granularity level of input sources for an intended statistic?", and "Given a set of potential new sources, which source is the most valuable to acquire regarding a specific goal?".