NeurIPS2020

Fast and Accurate $k$ -means++ via Rejection Sampling

Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, Ola Svensson

25 citations

Abstract

k-means++ [4] is a widely used clustering algorithm that is easy to implement, has nice theoretical guarantees and strong empirical performance. Despite its wide adoption, k-means++ sometimes suffers from being slow on large data-sets so a natural question has been to obtain more efficient algorithms with similar guarantees. In this paper, we present a near linear time algorithm for k-means++ seeding. Interestingly our algorithm obtains the same theoretical guarantees as k-means++ and significantly improves earlier results on fast k-means++ seeding. Moreover, we show empirically that our algorithm is significantly faster than k-means++ and obtains solutions of equivalent quality. * Equal contribution † Work was partially done while author was visiting researcher at Google Research, Switzerland. 3 Even assuming a constant number of Lloyd's algorithm steps. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.