Download PDF

Determination of the continuity of the enterprise statistical unit using the Levenshtein distance technique

Author

Peter Palosi

Conference

64th ISI World Statistics Congress

Format: IPS Abstract

Keywords: algorithm, business, continuity, database, economic, enterprise, programming, statistics

Session: IPS 434 - New methods and sources in the modernisation of economic statistics

Monday 17 July 10 a.m. - noon (Canada/Eastern)

Abstract

In European business statistics, the enterprise has been developed as the statistical unit for businesses. The enterprise consists of one or more legal units, and its composition can change throughout the years, which raises the question of how to determine the continuity of the enterprise. This is a crucial concept in business statistics: it paints a clear picture of the economy, where businesses choose to establish and grow their operations, as data quality in official economic statistics has always been the focal point for statisticians and policy makers.

In the statistical business register – which is the main data source of economic statistics - a unique ID is assigned to each enterprise, however if the enterprise composition is changed, there is no guarantee that the enterprise retains its original ID, therefore a loss in continuity can occur. In the practice of European business statistics, the criteria of defining the continuity of an enterprise are its controlling legal unit, main economic activity and location. These properties define the enterprise continuity: if at least two elements are changed, then it is considered that a new enterprise was created, and the other enterprise ceased to exist, as a result of a break in continuity. The solution of this problem is not straightforward, as slight changes in the controlling legal unit’s name, or in the main location’s address can occur, which should not necessarily mean a continuity break.

The Hungarian Central Statistical Office (HCSO) chose the Levenshtein distance algorithm for finding the subtle differences in enterprise location addresses and names, thus it can provide more accurate results than a full match test in finding enterprise continuity. The Levenshtein distance is a well-known string metric for measuring the difference between two sequences, which was introduced by Vladimir Levenshtein in the year 1965. It is still being used as of today in various fields such as spell checking, DNA analysis, plagiarism detection and speech recognition. Numerous generic implementations of this algorithm in multiple programming languages can be found online, although using it for a specific problem is usually not trivial.

This talk would present an implementation of the Levenshtein distance algorithm for the whole process in determining enterprise continuity, it includes the steps as retrieving the data from the database in the required format, the proper way of presenting the input data for the algorithm to process and as well as how to interpret and use the output data. This presentation may be useful for statisticians who are interested in practical IT solutions, and eager to try new methods for compiling and improving the data quality of economic statistics. In our experience, the algorithm performs efficiently in this field, therefore it may be considered if it can be used in other domains such as sectorial or social statistics too.