Regularized k-POD clustering for missing data
Conference
65th ISI World Statistics Congress 2025
Format: CPS Poster - WSC 2025
Keywords: clustering, k-means, missing-data
Abstract
The k-means clustering is one of the most popular clustering methods, the main idea of which is to find k cluster centers and then cluster data points by assigning them to their nearest centers. Although missing data is common in many applications, clustering for missing data receives far less attention. Recently, the k-POD clustering is proposed as a natural extension for k-means clustering to missing data, which alternatively imputes missing entries by current cluster centers and conducts k-means clustering. It requires no assumptions on missingness mechanisms and can be applicable for high dimensional data and even large missingness proportions. However, the estimated cluster centers by k-POD clustering are generally biased, which thus makes the corresponding clustering results unreliable. In this work, we focus on reducing the bias and improving the performance of k-POD clustering. Motivated by the observation that the bias often occurs when there exist noise features that have no contribution to clustering, we propose a novel regularized k-POD clustering by penalizing cluster centers feature-wisely. This makes it possible to shrink the cluster centers in noise features and reduce the bias. The optimization procedure is based on the majorization-minimization algorithm, which ensures convergence. By comparing the proposed method with other methods, the results of numerical experiments and applications on real-world data both show a lower bias in estimating cluster centers as well as better performance in clustering.