STOC2025
Almost Optimal PAC Learning for k-Means
Vincent Cohen-Addad, Silvio Lattanzi, Chris Schwiegelshohn
Abstract
Given a set of points, the k-means clustering problem consists of finding a partition of a set of points into k clusters such that the sum of squared Euclidean distances between the points and their assigned centers is minimized. In this paper, we consider learning bounds for this problem. That is, given a set of n samples P drawn independently from some unknown but fixed distribution D, how quickly does a solution computed on P converge to the optimal clustering of D? The currently fastest provable rate of convergence of the order →k/nmin(k,logklog<sup>2</sup>(n/k)) is due to [Appert, Catoni, 2021] with the best known lower bound being of the order λk/n due to [Bartlett, Linder, and Lugosi, 1998]. We give learning bounds with both optimal dependency on the sample size n and nearly optimal dependency on k by proving a convergence rate of the order of →klogk/n.