VLDB2026

Highly-Efficient Large-Scale k-means with Individual Fairness

Shengkun Zhu, Jinshan Zeng, Yuan Sun, Sheng Wang, Yiming Wang, Yushuai Ji, Feiping Nie, Xiaodong Li, Zhiyong Peng

摘要

Traditional k -means minimizes the sum of squared error (SSE) but may treat data points unequally, as some are assigned to significantly distant centroids. This leads to unfair outcomes in downstream tasks such as facility location planning, where each cluster corresponds to a specific share of limited resources. To address this, we modify the objective of k -means via exponential tilting , which emphasizes the impact of distant data points and yields a new objective: the tilted SSE. We propose TKM, which optimizes via coordinate descent and stochastic gradient descent, and improves fairness by shifting centroids toward underrepresented groups. We adopt the within-cluster variance to quantify fairness among individuals within the same group, which provably reduces extreme disparities in outcomes. To improve efficiency, we propose FastTKM, which uses stochastic dynamics to estimate the tilted SSE with lower computational cost. We theoretically demonstrate that, under our proposed methods, the variance decreases with t , a scaling factor that controls the degree of centroid deviation. Furthermore, our methods exhibit time and space complexities comparable to the classical Lloyd's heuristic. Experimentally, our methods outperform six baselines in terms of clustering utility and fairness across twelve real-world datasets. In terms of efficiency, our methods achieve thousand-fold speedups in running time and reduction in memory usage, with this factor growing as the dataset size increases.