NeurIPS2021

Coresets for Clustering with Missing Values

Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, Xuan Wu

被引用 19 次

摘要

We provide the first coreset for clustering points in Rd\mathbb{R}^d that have multiple missing values (coordinates). Previous coreset constructions only allow one missing coordinate. The challenge in this setting is that objective functions, like kk-Means, are evaluated only on the set of available (non-missing) coordinates, which varies across points. Recall that an ϵ\epsilon-coreset of a large dataset is a small proxy, usually a reweighted subset of points, that (1+ϵ)(1+\epsilon)-approximates the clustering objective for every possible center set. Our coresets for kk-Means and kk-Median clustering have size (jk)O(min(j,k))(ϵ1dlogn)2(jk)^{O(\min(j,k))} (\epsilon^{-1} d \log n)^2, where nn is the number of data points, dd is the dimension and jj is the maximum number of missing coordinates for each data point. We further design an algorithm to construct these coresets in near-linear time, and consequently improve a recent quadratic-time PTAS for kk-Means with missing values [Eiben et al., SODA 2021] to near-linear time. We validate our coreset construction, which is based on importance sampling and is easy to implement, on various real data sets. Our coreset exhibits a flexible tradeoff between coreset size and accuracy, and generally outperforms the uniform-sampling baseline. Furthermore, it significantly speeds up a Lloyd's-style heuristic for kk-Means with missing values.