A score function to prioritize editing in household survey data: a machine learning approach
Conference
65th ISI World Statistics Congress 2025
Format: IPS Abstract - WSC 2025
Session: IPS 959 - Sharing and Accessing Granular Administrative Data
Wednesday 8 October 2 p.m. - 3:40 p.m. (Europe/Amsterdam)
Abstract
Errors in household finance survey data collection can lead to inaccuracies in population estimates. Manual case-by-case revision has traditionally been used to identify and edit potential errors and omissions in the data, such as omitted or misreported assets, income, and debts. Selective editing strategies aim at reducing the editing burden by prioritizing cases through a scoring function. However, the application of traditional selective editing strategies to household finance survey data is challenging due to their underlying assumptions. Using data from the Spanish Survey of Household Finances, we develop a machine learning approach to classify data during the editing phase into cases affected by severe errors and omissions. We compare the performance of several supervised classification algorithms and find that a Gradient Boosting Trees classifier outperforms the competitors. We then use the resulting score to prioritize cases and consider data editing efforts into the choice of an optimal classification threshold.