Filling in the Blanks: Augmenting Survey Data Imputation with External Data and Rubin's SIR Algorithm
Conference
65th ISI World Statistics Congress 2025
Format: CPS Abstract - WSC 2025
Keywords: external_data, income_data, multiple imputation by chained equations, sampling/importance_resampling_algorithm
Session: CPS 28 - Nonresponse Bias and Missing Data in Surveys
Wednesday 8 October 4 p.m. - 5 p.m. (Europe/Amsterdam)
Abstract
Authors: Char Hilgers and Sabine Zinn
Multiple imputation of missing values in survey data analysis is a state-of-the-art technique. Typically, methods like multivariate imputation by chained equations (mice, van Buuren 2018) are employed, replacing missing values on a variable-by-variable basis. Generally, the information used for imputation comes from the survey dataset being analysed. Valid analysis results are achieved when the missing values are either missing completely at random (MCAR) or missing at random (MAR). However, the situation becomes more complex if the values are missing not at random (MNAR).
There are some approaches to dealing with this issue. One approach incorporates sensitivity analyses into the imputation, i.e. making the imputation as robust as possible. Alternatively, the data set to be imputed can be enriched with further information, so that an MNAR mechanism becomes MAR, and thus the imputation and analysis of the imputed data can be valid. The advantages of this approach are clear, but often the full range of variables of the data set is already included in the imputation, and still the suspicion of MNAR remains.
We present a new method that integrates external data into the mice imputation process to reduce the risk of MNAR and better justify the assumption of a MAR mechanism.
Specifically, we integrate Rubin's SIR (Sampling/Importance Resampling) algorithm (Rubin 1987) into the mice framework to incorporate external distribution information for the variable of interest. Importance ratios, derived from the differences between the external distribution and the survey data's estimated distribution, guide the selection of replacement values for missing data. We also provide an estimate of uncertainty introduced by the method. In addition to using an external distribution, this method allows the use of imputed data for the same variable from another survey dataset, making it powerful enough to inform the imputation in one dataset through another. For this to work, besides the variable of interest, there must be a sufficient overlap of other variables measured in the same way.
We demonstrate the effectiveness of our new approach with a simulation example, involving the imputation of a typical income variable. Additionally, we apply this method to two datasets from the German Socio-Economic Panel Study (SOEP), where the multiply imputed income variable from one dataset is used to inform the imputation in the other.
References
(i) Rubin, D. B. (1987). The calculation of posterior distributions by data augmentation: Comment: A noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest: The SIR algorithm. Journal of the American Statistical Association, 82(398), 543-546.
(ii) Van Buuren, S. (2018). Flexible imputation of missing data. CRC press.