65th ISI World Statistics Congress 2025

65th ISI World Statistics Congress 2025

Framework for Database Classification

Conference

65th ISI World Statistics Congress 2025

Format: CPS Abstract - WSC 2025

Keywords: big data, natural-language-processing, price_index

Session: CPS 74 - Statistical Modelling of Price Indices and Food Baskets

Wednesday 8 October 4 p.m. - 5 p.m. (Europe/Amsterdam)

Abstract

FGV IBRE – Brazilian Institute of Economics of Fundação Getulio Vargas – is a center of excellence for research, analysis and production of economic statistics. With a physical presence in 15 Brazilian capitals, it researches more than 300,000 prices of products, services and other primary data every month. For 70 years, its mission has been to contribute to the economic and social development of Brazil. Among the statistics produced by FGV IBRE are, for example, price indices, surveys and reference prices.
Data classification is a crucial step in using large unstructured databases. One of our main challenges, from the implementation of webscraping and data scanner techniques, is to develop a framework capable of classifying this data continuously.
In order to integrate databases from different sources with items in the CPI (Consumer Price Index) consumption basket, it is essential to develop a robust framework for classifying these bases on a large scale. The development of this framework faces two main challenges.
The first challenge is the generation of a well-calibrated training database that helps in the classification process. To do this, we use the inputs that already make up the IPC, ensuring that the association of the terms of the new bases by the model maintains the existing quality standard. In cases of inputs with low representation, it will be necessary to create new categories to improve the model's performance.
The second challenge is implementing a natural language processing (NLP) model capable of classifying this data. At this stage, we use two classes of models: traditional and current. Traditional models include AdaBoost, RandomForest, XGBoost and SVM. Among the current models, we use Transformers.
The tool under development carries out the classification process in five steps:
1. Data entry;
2. Data classification;
3. Separation of well-classified data for use;
4. User validation;
5. Data feedback.
Steps 4 and 5 are essential for maintaining the classification process in the long term. In step 4, the user can correct wrong classifications, and in step 5, these corrections are used as input for new training, allowing the model to learn continuously.
The implementation of this project is vital to allow bases from different sources to be used to calculate price indices, increasing the precision and scope of economic statistics produced by FGV IBRE.