Download PDF

2023 Waksberg Award Invited Address

Organiser

Mr Jean-Francois Beaumont

Participants

Mr Jean-Francois Beaumont (Chair)

Prof. Raymond L. Chambers (Presenter/Speaker)

The Missing Information Principle - A Paradigm for Analysis of Messy Sample Survey Data

Conference

64th ISI World Statistics Congress

Category: International Association of Survey Statisticians (IASS)

Abstract

Sample surveys, as a tool for administration and research, have been employed for over a century. In that time, they have primarily served as tools for collecting data for enumerative purposes, i.e., for describing the observable characteristics of a well-defined finite population. Estimation of these characteristics have been typically based on weighting and repeated sampling, or design-based, inference.

However, from survey sampling's early days, sample data have also been used for modelling the unobservable processes that gave rise to the finite population data. This type of use has been termed analytic, and from a sample survey perspective it is typically secondary. That is, the analysis is not specifically allowed for in the sampling design, and often involves integrating the sample data with data from secondary sources. Here weighting-based methods are also in common use, mainly because of the wide availability of software for enumerative analysis. But these methods are inefficient, especially when realised sample sizes are small, and incapable of dealing with multiple secondary data sources.

Alternative approaches to inference in these situations, drawing inspiration from mainstream statistical modelling, have been strongly promoted. The principal focus of these have been on allowing for informative sampling. Modern approaches in survey sampling, though, are more focussed on where the sample data are in fact part of a more complex set of data sources all carrying relevant information about the process of interest. Though one of these sources can be information about the sampling process, it is usually not the one of primary interest, in many cases because the impact of the sampling process can be allowed for by conditioning on known covariates correlated with sample design variables. Instead, the focus is on allowing for coverage errors, measurement errors and response errors in the statistical modelling process.

When an efficient modelling method like maximum likelihood is preferred, the issue becomes one of how maximum likelihood can be modified to account for these complex data features. From a frequentist perspective, application of the Missing Information Principle provides a clear way forward in this regard. In this talk I will review development of MIP-based ideas in survey sampling over the last forty years, in large part from the perspective of work by my colleagues and myself since the mid-80s. I will endeavour to provide a perspective on the generality of the underlying idea of the MIP, and to show its links to well-known optimal prediction theory as well as to finite population-based parametric inference. Finally, I will tackle a modern survey sampling scenario that can be expected to become more common given the rapid growth in auxiliary data sources for survey data analysis. This is where sampled records from one accessible source or register are linked to records from another less accessible source, with values of the response variable of interest drawn from this second source, and where a key output is small area estimates for the response variable for domains defined on the first source.