64th ISI World Statistics Congress

64th ISI World Statistics Congress

Tell me who your friends are, and I will tell you who you are: The role of the second most frequent group in cluster labeling

Conference

64th ISI World Statistics Congress

Format: CPS Abstract

Keywords: classification, clustered-data, interpretation

Session: CPS 06 - Clustering

Monday 17 July 8:30 a.m. - 9:40 a.m. (Canada/Eastern)

Abstract

Conventional classification techniques often achieve unsatisfactory results in multidimensional and unbalanced datasets. Unsupervised classification methods, such as cluster analysis, can expand the researcher’s horizons in discovering new features and providing unexpected interpretations. On the other hand, supervised classification methods can benefit from unsupervised findings. The interpretation of cluster analysis results is one of the most important steps toward identifying potential classes. Adding a new variable that represents the discovered classes to a classification model can significantly improve classification accuracy. We focus on the labeling of cluster analysis results, which often depend on the human perception of the studied object. Cluster analysis of databases that include mixed numerical and categorical data is often problematic for interpretation. When experts label the classes obtained from the cluster analysis, they usually focus on the most frequent group in each categorical variable. The observations that fall into the same cluster but have a different categorical variable value usually receive less attention. The main goal of the present study is to highlight the group of observations with the second most frequent value for a categorical variable in the same cluster in order to learn about the similarities and dissimilarities of different population groups. In certain situations, the second frequent group is not less informative than the most frequent group. As a well-known idiom says, "Tell me who your friends are, and I will tell you who you are". The ability to draw conclusions from similarities between different members of the same cluster is even more interesting for unbalanced datasets. In these cases, the minority value of the categorical variable often fails to “cause” the clustering algorithm to create a separate cluster and the observations are “forced” to join clusters with observations that are close enough. We set a goal to identify situations where focusing on the second most frequent group in the cluster improves the interpretation, labeling, and, finally, the classification results.