Application of k-medoids for Data Clustering
Conference
65th ISI World Statistics Congress
Format: CPS Poster - WSC 2025
Keywords: burdens, clustering, elbow, elbow_plot, infrastructure, k-medoids, labor, labor_burdens, method, plot, silhouette, silhouette_method, transportation
Abstract
The FGV IBRE – Brazilian Institute of Economics of the Getulio Vargas Foundation – is a center of excellence in research, analysis, and production of economic statistics. With a physical presence in 15 Brazilian capitals, it surveys more than 300,000 prices of products, services, and other primary data monthly. For 70 years, its mission has been to contribute to Brazil's economic and social development. Among the statistics produced by FGV IBRE are, for example, price indices, surveys, and reference prices.
Reference prices of resources are objective and well-founded parameters that provide transparency, quality, speed, and cost-effectiveness in procuring products, works, and/or services. They ensure that budgets respect appropriate price levels in public procurements.
The cost composition of a specific engineering service is formed by the materials used in executing the service, the necessary machines and equipment, and the labor. Regarding labor, the reference prices are obtained from a sample of information collected from secondary official and public data sources.
The total reference labor costs are defined based on the breakdown of four components: salaries, social labor burdens, complementary labor burdens and additional labor burdens.
In this study, the objective is to cluster labor categories to define a single total labor burden per set within the scope of transportation infrastructure in São Paulo. Therefore, a statistical data clustering technique known as k-medoids is applied, whose principle is to partition the data into k groups, called clusters, and select a representative for each cluster, called a medoid, which will be the element closest on average to the other elements in its cluster. This means that for each cluster of labor categories, one of the component categories will be the representative of its group.
The number of clusters must be defined beforehand. In the elbow plot, the sums of the within-cluster squared deviations are displayed for each possible value of k. The goal is to find the point where the trade-off between similarity and simplicity is optimal.
The silhouette method is a good validator of this optimal point. The silhouette is an individual measure calculated for each element of the dataset. The Average Silhouette Coefficient (ASC) classifies the clustering on a scale as robust, reasonable, weak, or non-existent.
In this work, k was defined as 8 using the elbow plot. The clusters found are robust according to the ASC value greater than 0.75. Therefore, k-medoids clustered the more than 100 labor categories considered in the study into 8 groups and selected 8 representative professionals, 1 for each cluster. Hence, the labor burdens could be calculated only for these 8 labor categories chosen as medoids, which would simplify not only the presentation of the results but also the calculation effort.