ICML2020

Sets Clustering

Ibrahim Jubran, Murad Tukan, Alaa Maalouf, Dan Feldman

10,342 citations

Abstract

The input to the sets-k-means problem is an integer k ≥ 1 and a set P = P 1 , • • • , P n of fixed sized sets in R d . The goal is to compute a set C of k centers (points) in R d that minimizes the sum P ∈P min p∈P,c∈C p -c 2 of squared distances to these sets. An ε-core-set for this problem is a weighted subset of P that approximates this sum up to 1 ± ε factor, for every set C of k centers in R d . We prove that such a core-set of O(log 2 n) sets always exists, and can be computed in O(n log n) time, for every input P and every fixed d, k ≥ 1 and ε ∈ (0, 1). The result easily generalized for any metric space, distances to the power of z > 0, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS (1 + ε approximation) for the sets-kmeans problem that takes time near linear in n. This is the first result even for sets-mean on the plane (k = 1, d = 2). Open source code and experimental results for document classification and facility locations are also provided.