65th ISI World Statistics Congress 2025

65th ISI World Statistics Congress 2025

Extraction of CO2 emissions from corporate sustainability reports

Author

MS
Malte Schierholz

Co-author

Conference

65th ISI World Statistics Congress 2025

Format: IPS Abstract - WSC 2025

Keywords: information-extraction, llms, sustainability-reports

Session: IPS 799 - Real-World Machine Learning Applications in Official Statistics

Thursday 9 October 10:50 a.m. - 12:30 p.m. (Europe/Amsterdam)

Abstract

Financial regulators and central banks are increasingly integrating sustainability aspects into their operations, but significant data gaps remain. The CSRD directive requires all large European enterprises to annually publish their greenhouse gas emissions (CO2-equivalents) in their management report, annual report, or sustainability report. The amount of information available, i.e., the value and unit for each scope, direct emissions (Scope 1), indirect energy-related emissions (Scope 2), and other indirect emissions (Scope 3), is immense, but the data are spread over thousands of PDF documents, published online on company websites, and historically often without abiding to official standards or guidelines. Until now, private companies extract carbon emissions and other indicators from these PDF documents and sell it in a structured, tabular data format to the Bundesbank and to other public authorities. However, despite little apparent difficulties in value extraction from PDF documents the reliability between values extracted by different companies is rather low. Given the current dim situation, we leverage Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) to build several fully automated data extraction pipelines, which are then being compared with data bought from private providers and evaluated using a specially curated gold standard dataset of our own. Open-source software is shared with the community which enables everyone to extract CO2-related indicators from company sustainability reports.