Download PDF

A score function to prioritize editing in household survey data: a machine learning approach

Author

Sandra Garcia-Uribe

Co-author

Nicolas Forteza

Conference

65th ISI World Statistics Congress

Format: IPS Abstract - WSC 2025

Session: IPS 959 - Sharing and Accessing Granular Administrative Data

Wednesday 8 October 2 p.m. - 3:40 p.m. (Europe/Amsterdam)

Abstract

Errors in household finance survey data collection can lead to inaccuracies in population estimates. Manual case-by-case revision has traditionally been used to identify and edit potential errors and omissions in the data, such as omitted or misreported assets, income, and debts. Selective editing strategies aim at reducing the editing burden by prioritizing cases through a scoring function. However, the application of traditional selective editing strategies to household finance survey data is challenging due to their underlying assumptions. Using data from the Spanish Survey of Household Finances, we develop a machine learning approach to classify data during the editing phase into cases affected by severe errors and omissions. We compare the performance of several supervised classification algorithms and find that a Gradient Boosting Trees classifier outperforms the competitors. We then use the resulting score to prioritize cases and consider data editing efforts into the choice of an optimal classification threshold.