ICLR2023

Near-optimal Coresets for Robust Clustering

Lingxiao Huang, Shaofeng H.-C. Jiang, Jianing Lou, Xuan Wu

被引用 1 次

摘要

We consider robust clustering problems in Rd\mathbb{R}^d, specifically kk-clustering problems (e.g., kk-Median and kk-Means with mm outliers, where the cost for a given center set CRdC \subset \mathbb{R}^d aggregates the distances from CC to all but the furthest mm data points, instead of all points as in classical clustering. We focus on the εε-coreset for robust clustering, a small proxy of the dataset that preserves the clustering cost within εε-relative error for all center sets. Our main result is an εε-coreset of size O(m+poly(kε1))O(m + \mathrm{poly}(k ε^{-1})) that can be constructed in near-linear time. This significantly improves previous results, which either suffers an exponential dependence on (m+k)(m + k) [Feldman and Schulman, SODA'12], or has a weaker bi-criteria guarantee [Huang et al., FOCS'18]. Furthermore, we show this dependence in mm is nearly-optimal, and the fact that it is isolated from other factors may be crucial for dealing with large number of outliers. We construct our coresets by adapting to the outlier setting a recent framework [Braverman et al., FOCS'22] which was designed for capacity-constrained clustering, overcoming a new challenge that the participating terms in the cost, particularly the excluded mm outlier points, are dependent on the center set CC. We validate our coresets on various datasets, and we observe a superior size-accuracy tradeoff compared with popular baselines including uniform sampling and sensitivity sampling. We also achieve a significant speedup of existing approximation algorithms for robust clustering using our coresets.