65th ISI World Statistics Congress 2025

65th ISI World Statistics Congress 2025

Selecting Between Clustering Methods: An Innovative Approach

Conference

65th ISI World Statistics Congress 2025

Format: IPS Abstract - WSC 2025

Session: IPS 1039 - Artificial Intelligence in Medicine

Wednesday 8 October 10:50 a.m. - 12:30 p.m. (Europe/Amsterdam)

Abstract

The paper presents a new method for validating clustering partitions, addressing the challenges associated with assessing clustering algorithms when true class labels are unknown. Clustering algorithms aim to group similar objects, but determining the quality of these groupings can be difficult without external benchmarks. The proposed method combines internal and relative validation criteria with machine learning (ML) algorithms to create a ranking system for different clustering partitions. This method stands out by explicitly considering the structure of the dataset’s features and offering a robust solution for high-dimensional data.
Traditional validation methods for clustering include external, internal, and relative criteria. External criteria compare the obtained clustering results with known external classifications, but such information is not always available. Internal criteria rely solely on the dataset’s properties, like the proximity matrix, but they can fail to capture the true quality of the partitions. Relative criteria compare a set of partitions but do not take into account the dataset’s feature structure. The proposed method fills this gap by incorporating ML algorithms to assess the coherence between the clustering results and the dataset's features, making it more flexible and accurate in varied contexts.
The validation process works by using the assigned clusters as a response variable and the features of the dataset as independent variables in a machine learning model. The model's performance, represented by an index such as accuracy, sensitivity, or specificity, serves as an indicator of the clustering algorithm’s effectiveness. Rather than assessing the absolute quality of a partition, the method ranks partitions relative to one another. This ranking highlights the clustering algorithms that best capture the underlying structure of the data.
To test the effectiveness of the proposed approach, the authors conducted a simulation study. They generated datasets with varying levels of noise and applied 11 classical clustering algorithms. The ML-based validation method was then used to rank the partitions produced by these algorithms. The results showed that the method correctly ranked the partitions, with higher-quality partitions (those with less noise) receiving better rankings. This demonstrates the method's ability to rank partitions without knowing the true classification of the data, making it a powerful tool for cluster validation.
In comparison to other methods, the proposed approach offers several advantages. It does not require external reference labels, as with external validation methods. Additionally, it considers the structure of the dataset’s features, addressing limitations in internal and relative validation criteria. Moreover, it has the flexibility to handle high-dimensional data, which is often a challenge for existing cluster validation methods.
In conclusion, the proposed validation method provides a novel and effective way to rank clustering algorithms based on their ability to partition data according to its underlying structure. By leveraging machine learning algorithms, this method offers a more robust and versatile solution for cluster validation. The authors suggest future work could explore the method’s applicability to big data scenarios, where computational efficiency becomes crucial.