65th ISI World Statistics Congress 2025 | The Hague

65th ISI World Statistics Congress 2025 | The Hague

Real-World Machine Learning Applications in Official Statistics

Organiser

YX
Yingfu Xie

Participants

  • BB
    Dr Barteld Braaksma
    (Chair)

  • PD
    PROF. DR. Piet Daas
    (Presenter/Speaker)
  • Identifying online platforms: Model development, validation and type-I error reduction

  • JP
    Jenny Pocknee
    (Presenter/Speaker)
  • Anomaly detection and estimating edit values for business administrative data

  • DS
    Dr David Salgado
    (Presenter/Speaker)
  • Early estimates for short-term business statistics with ML models for statistical units prediction

  • MS
    Malte Schierholz
    (Presenter/Speaker)
  • Extraction of CO2 emissions from corporate sustainability reports

  • PE
    Mr Petter Ehn Wingårdh
    (Presenter/Speaker)
  • Quality assurance of the re-coding to NACE Rev. 2.1, combining model and manual coding

  • VG
    Mr Vincent Charles Galvin
    (Discussant)

  • JM
    Jens Malmros
    (Discussant)

  • Category: International Statistical Institute

    Proposal Description

    Nowadays, National statistical institutes, as well as other public sectors, are facing challenges such as reduced budgets, bigger data, and demand for rapid statistics, and struggle to find ways to tackle them. At the same time, modern technology in artificial intelligence inclusive of Large Language Models (LLMs) and Machine Learning is revolutionizing our society in many ways. Statistical institutes and other public sectors are increasingly recognizing the immense potential of AI and ML approaches to address the challenges they are facing.

    This session dives into 5 Real-World ML applications across various statistical agencies and central banks, showcasing how these innovative approaches are delivering benefits. The topics range from identifying innovative companies based on their website text, extracting CO2 emissions from corporate reports, using ML to predict missing values and nowcast business statistics, combining LLMs and manual coding when updating business’ NACE codes to NACE 2.1, and utilizing unsupervised ML to detect anomaly in large business datasets. The presentations showcase both supervised and unsupervised ML approaches, and both new Large Language Models (LLMs) and natural language processing (NLP) and classic ML classification and regression. These presentations highlight not only the potential of ML to streamline statistical processes, reduce costs, and enhance the quality and timeliness of official statistics but also the challenges and limitations during this transition.

    Our session engages researchers from 2 continents and 6 countries with a desirable combination of genders and senior levels.

    Below is a brief introduction to the presentations included in this session:

    Statistics Netherlands: This presentation tackles the challenge of identifying online platforms within business registers by leveraging a text classification model trained on website content. The model was further refined to minimize false positives while maintaining accuracy.

    LMU, München: This project investigates the use of machine learning (or possibly large language models, LLMs) to automate the extraction of CO2 emission data from corporate sustainability reports. This approach has the potential to significantly improve data collection efficiency in environmental statistics.

    Statistics Spain: To address the timeliness limitations of traditional survey-based statistics, this presentation explores the use of ML models for early estimation. The proposed method involves predicting missing data points in surveys, allowing for more timely business statistics.

    Statistics Sweden: This work tackles the challenge of quality assurance in business register re-coding when upgrading to NACE 2.1. It proposes an approach that combines machine learning models, likely large language models (LLMs) due to the emphasis on text processing, with manual coding. This method offers the potential to improve efficiency and accuracy while optimizing resource allocation.

    Australian Bureau of Statistics: This presentation explores the use of unsupervised machine learning methods, such as Local Outlier Factor (LOF) and Isolation Forest (IsoF), to detect anomalies and estimate edits in large business datasets. This approach can significantly improve the efficiency of data validation processes by prioritizing human intervention for the most suspicious data points.