ICML2022

Massively Parallel k-Means Clustering for Perturbation Resilient Instances

Vincent Cohen-Addad, Vahab S. Mirrokni, Peilin Zhong

6 citations

Abstract

We consider k-means clustering of n data points in Euclidean space in the Massively Parallel Computation (MPC) model, a computational model which is an abstraction of modern massively parallel computing system such as MapReduce. Recent work provides evidence that getting O(1)approximate k-means solution for general input points using o(log n) rounds in the MPC model may be impossible under certain conditions [Ghaffari, Kuhn & Uitto'2019] . However, the real-world data points usually have better structures. One instance of interest is the set of data points which is perturbation resilient [Bilu & Linial'2010]. In particular, a point set is αperturbation resilient for k-means if perturbing pairwise distances by multiplicative factors in the range [1, α] does not change the optimum k-means clusters. We bypass the worst case lower bound by considering the perturbation resilient input points and showing o(log n) rounds k-means clustering algorithms for these instances in the MPC model. Specifically, we show a fully scalable (1 + ε)-approximate k-means clustering algorithm for O(α)-perturbation resilient instance in the MPC model using O(1) rounds and O ε,d (n 1+1/α 2 +o(1) ) total space. If the space per machine is sufficiently larger than k, i.e., at least k • n Ω(1) , we also develop an optimal kmeans clustering algorithm for O(α)-perturbation resilient instance in MPC using O(1) rounds and O d (n 1+o(1) • (n 1/α 2 + k)) total space.