ICLR2023
Near-optimal Coresets for Robust Clustering
Lingxiao Huang, Shaofeng H.-C. Jiang, Jianing Lou, Xuan Wu
1 citation
Abstract
We consider robust clustering problems in , specifically -clustering problems (e.g., -Median and -Means with outliers, where the cost for a given center set aggregates the distances from to all but the furthest data points, instead of all points as in classical clustering. We focus on the -coreset for robust clustering, a small proxy of the dataset that preserves the clustering cost within -relative error for all center sets. Our main result is an -coreset of size that can be constructed in near-linear time. This significantly improves previous results, which either suffers an exponential dependence on [Feldman and Schulman, SODA'12], or has a weaker bi-criteria guarantee [Huang et al., FOCS'18]. Furthermore, we show this dependence in is nearly-optimal, and the fact that it is isolated from other factors may be crucial for dealing with large number of outliers. We construct our coresets by adapting to the outlier setting a recent framework [Braverman et al., FOCS'22] which was designed for capacity-constrained clustering, overcoming a new challenge that the participating terms in the cost, particularly the excluded outlier points, are dependent on the center set . We validate our coresets on various datasets, and we observe a superior size-accuracy tradeoff compared with popular baselines including uniform sampling and sensitivity sampling. We also achieve a significant speedup of existing approximation algorithms for robust clustering using our coresets.