65th ISI World Statistics Congress 2025

65th ISI World Statistics Congress 2025

Automated Coding Approaches Using Machine Learning in the U.S. Consumer Expenditure Diary Survey

Conference

65th ISI World Statistics Congress 2025

Format: IPS Abstract - WSC 2025

Keywords: household_expenditures, machine learning

Abstract

The U.S. Bureau of Labor Statistics provides data on expenditures, income, and demographic characteristics of consumers in the United States. Data are collected by the U.S. Census on behalf of BLS through two surveys: the Interview Survey, which captures major and/or recurring items, via a computer assisted personal interview (CAPI) instrument, and the Diary Survey for more minor or frequently purchased items via a respondent self-administered diary. Diary Survey expenditures are described by respondents through free-form text.

The volume of CE Diary Survey expenditures collected each month is approximately 30,000 records. Historically, the process by which the expenditure entries are labeled and grouped into appropriate expenditure categories, known as item codes, involves two agencies, two item code structures, and multiple layers of manual review and decisioning. The CE Diary Autocoder, launched in 2024, applies a Natural Language Processing approach to classifying diary entries into item codes, significantly reducing processing time and costs.

This presentation will show how free-form text in a respondent diary ladders up to item codes through illustrative examples, describe the methods which comprise the CE Diary Autocoder, explain how human intervention factors into the automated process, and summarize high-level results.