ICML2024

Analyzing Dα seeding for k-means

Étienne Bamas, Sai Ganesh Nagarajan, Ola Svensson

Abstract

One of the most popular clustering algorithms is the celebrated D α seeding algorithm (also know as k-means++ when α = 2) by Arthur and Vassilvitskii (2007) , who showed that it guarantees in expectation an O(2 2α • log k)-approximate solution to the (k,α)-clustering cost (where distances are raised to the power α) for any α ≥ 1. More recently, Balcan, Dick, and White (2018) observed experimentally that using D α seeding with α > 2 can lead to a better solution with respect to the standard k-means objective (i.e. the (k, 2)-clustering cost). In this paper, we provide a rigorous understanding of this phenomenon. For any α > 2, we show that D α seeding guarantees in expectation an approximation factor of