Audit sampling inference for official survey statistics
Conference
64th ISI World Statistics Congress
Format: CPS Abstract
Keywords: auditing, dataregister, machine learning, model accuracy, rao-blackwellization
Session: CPS 38 - Survey statistics I
Tuesday 18 July 8:30 a.m. - 9:40 a.m. (Canada/Eastern)
Abstract
For a finite population, the prediction estimator associates a prediction for each population unit. Those predictions might come from a sample trained machine learning model (or just any model) or given in advance (as when using a register processed from administrative sources). We consider audit sampling (or simply auditing) inference of this prediction estimator: "wherever the goal of survey sampling is to produce a point estimate of some target parameter of a given finite population, auditing aims not to estimate the target parameter itself but some chosen error measure of any given estimator of the target parameter, which may be biased due to failure of the underlying model assumptions or other favourable conditions that are necessary" (Zhang, 2021).
The framework of inference is design-based given a finite population, from which the random sample is taken under a probability design, but the outcomes of interest and other values known separately from sampling are treated as fixed. Design-based auditing inference is valid regardless the models or algorithms underlying the estimator being assessed.
For sample trained models, the auditing sample is also the training sample. This means that, in practice, we have to train the model with a subsample and do the auditing with the complementary sample. Note the analogy with cross-validation techniques. Rao-Blackwellization is used to recover efficiency losses due to reduced sample sizes. When predictions are given in advance, Rao-Blackwellization is also useful for auditing, whenever we introduce models for prediction errors.
We run several simulations with both synthetic and survey data, to show that auditing inference does provide useful (and meaningful) accuracy measures, even when models are badly specified.