Extraction of CO2 emissions from corporate sustainability reports
Conference
65th ISI World Statistics Congress 2025
Format: IPS Abstract - WSC 2025
Keywords: information-extraction, llms, sustainability-reports
Session: IPS 799 - Real-World Machine Learning Applications in Official Statistics
Thursday 9 October 10:50 a.m. - 12:30 p.m. (Europe/Amsterdam)
Abstract
Financial regulators and central banks are increasingly integrating sustainability aspects into their operations, but significant data gaps remain. The CSRD directive requires all large European enterprises to annually publish their greenhouse gas emissions (CO2-equivalents) in their management report, annual report, or sustainability report. The amount of information available, i.e., the value and unit for each scope, direct emissions (Scope 1), indirect energy-related emissions (Scope 2), and other indirect emissions (Scope 3), is immense, but the data are spread over thousands of PDF documents, published online on company websites, and historically often without abiding to official standards or guidelines. Until now, private companies extract carbon emissions and other indicators from these PDF documents and sell it in a structured, tabular data format to the Bundesbank and to other public authorities. However, despite little apparent difficulties in value extraction from PDF documents the reliability between values extracted by different companies is rather low. Given the current dim situation, we leverage Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) to build several fully automated data extraction pipelines, which are then being compared with data bought from private providers and evaluated using a specially curated gold standard dataset of our own. Open-source software is shared with the community which enables everyone to extract CO2-related indicators from company sustainability reports.