65th ISI World Statistics Congress 2025

65th ISI World Statistics Congress 2025

Quality Assurance of the Re-coding to NACE Rev. 2.1, Combining Model and Manual Coding

Conference

65th ISI World Statistics Congress 2025

Format: IPS Abstract - WSC 2025

Keywords: coding, llm, quality assurance

Session: IPS 799 - Real-World Machine Learning Applications in Official Statistics

Thursday 9 October 10:50 a.m. - 12:30 p.m. (Europe/Amsterdam)

Abstract

NACE 2.0 is the current statistical classification of economic activities in the European Union which assures a common standard for European statistics. Implementing NACE Revision 2.1 is demanding for European countries. A major part of the transition is the re-coding of units in the Business registers. Previously, the re-coding process has mostly been done using surveys and manual coding, which often result in large costs and increased response burden.

Quality demands on NACE are especially high in Sweden because of the divergent use of the business register for statistical and administrative purposes. Hence, quality needs to be high in the re-coding process both regarding distributions and on the unit level. In previous quality assurance processes, several human coders repeated the re-coding. Because of budget restrictions, it may not be feasible to perform this process over the entire nomenclature for the current revision.

Because of the increased performance of large language models (LLMs), several countries investigate the possibilities of using LLMs to decrease manual coding. However, the model approach does not only facilitate lower use of manual resources. It may also be used to develop effective quality assurance as a model can process the entire population instead of a sample.

However, model coding may not only consist of a LLM due to the reason of missing textual data of high quality. Therefore, model coding may span from simple rule-based methods to pre-trained LLM and classical supervised learning. This approach suggests facilitating a more accurate evaluation of the quality in comparison of a single model.

Lastly, we present a quality assurance process, which focuses on combining manual labour with model coding. The quality assurance process includes: 1. Model inference; 2. Design inference with auxiliary information; 3. Manual coding supported by models; 4. Re-use of manually coded data. Further, we discuss how the quality assurance process interact with the re-coding methods and the quality demands and how the theoretical process was carried out practically at Statistics Sweden during the first half of 2025.