Effectiveness of selected methods of data integration in tourism statistics
Conference
Format: CPS Abstract
Keywords: probabilisticlinkage, web scraping
Abstract
Effectiveness of selected methods of data integration in tourism statistics
Marek Cierpial-Wolan, Assoc. Prof., Statistics Poland, University of Rzeszow
Dominik Rozkrut, PhD, Statistics Poland, University of Szczecin
The modern world is being determined by many threats, both globally and locally. World economies are facing growing problems such as: instability caused by numerous armed conflicts, the unprecedented scale of global migration, and since 2020 the world is facing a coronavirus pandemic. All these circumstances cause various types of effects of a socio-economic nature, and in particular significantly affect the tourism industry. The consequence of these events is the emergence of huge information needs regarding tourism. This is related to both the specifics of the tourism market (short-term and sometimes incidental business) and activity in the gray zone. The demand for high-quality information, available in real mode, requires the integration of data from various sources, i.e. administrative records, databases created on the basis of full and sample surveys, and especially big data. Thus, the development of innovative methods of data integration is practically becoming an imperative for academia and institutional statistics today. This article evaluates the usefulness of the following probabilistic methods of data linkage and deduplication: Natural Language Processing (NLP), Machine learning algorithm (K Nearest Neighbors (K-NN) using TF-IDF techniques) and Fuzzy matching. The paper used these methods to combine data from web scraping of booking portals (Booking.com, Hotels.com and AirBnB.com) with tourism survey frame. The effective use of multiple portals was made possible not only by text algorithms but also by photo comparison algorithms such as comparing the similarity of color histograms, comparing visuals using feature descriptors and digital fingerprints. This method is based on analyzing the visual features of the images that are assigned to each offer. The paper evaluates the quality of the tourism statistics frame through the acquisition of data from web scraping of tourism portals, and how this affected the results of monthly surveys of accommodation facilities. The conducted surveys showed that the most useful among the tested methods was Fuzzy matching based on Levenshtein's algorithm combined with Vincenty's formula. In addition, as a result of the integration, it was possible to significantly improve the quality of the tourism survey frame (an increase in the number of new facilities by 1.1%) and thus correct the results of the survey of the supply side of tourism in Poland.
Keywords: probabilistic record linkage, web scraping, comparing visuals
JEL: C1, C81, Z32