E-shops transaction data for price statistics
Conference
65th ISI World Statistics Congress 2025
Format: CPS Abstract - WSC 2025
Session: CPS 44 - Data Collection and Analytical Techniques for Price Indexing
Tuesday 7 October 4 p.m. - 5 p.m. (Europe/Amsterdam)
Abstract
In the past decade, web scraping has become a popular tool to complement official statistics by National Statistical Institutes. Particularly, it offers an efficient way of collecting prices of products, which are sold online, that need to be incorporated within the compilation of consumer price indices (CPIs) in order to capture an economic activity of online consumer purchases. However, the web scraped prices of products are on-offer sales and do not capture realized purchases as obtained by collecting scanner data from physical retailers, i.e. information on sales and quantities sold for individual products. The online realized purchases of sold products can only be obtained directly from e-shops. To enhance the reflection of the consumer behavior for online sales, Statistical Office of the Slovak Republic (SO SR) negotiated to collect e-shops transaction data directly from the owner of the website platform, where the majority of e-shops on the Slovak market advertise and sell their products. The transactions of daily online sales for individual products are automatically collected through a web-based tool that transfer data to the SO SR server, i.e. daily data include information of realized sales and quantities sold for individual products. At the outset of the paper, we describe the automated process of data collection, its advantages compared to web scraping, data quality assurances, data completeness and integrity, and the application of data filters prior to the CPI compilation. The main objective of the paper is to show the compilation of price indices using e-shops transaction data. It is not straightforward to select the most appropriate price index formulae, particularly, for products that have a high churn and time series of prices with a high inherent fluctuation. In the empirical analysis, we evaluate both weighted and unweighted price indices to examine the contribution of sales volumes on the fluctuation of the index value. Moreover, recent research show that for a high churn of the product sample over time, multilateral methods for compiling price indices are more suitable. Hence, we also estimate price indices using the GEKS method and the regression-based Time-product dummy model.