SIGMOD2025

Robust Statistical Analysis on Streaming Data with Near-Duplicates in General Metric Spaces

Qin Zhang

Abstract

This paper considers statistical analysis on noisy datasets where near-duplicate elements need to be treated as identical ones. We focus on two basic problems, distinct elements and ℓ 0 -sampling, in the data stream model where the sequence of elements can only be scanned once using a limited space, under which a comprehensive data deduplication step before statistical analysis is not feasible. Previous streaming algorithms for these problems could only handle noisy datasets in O (1)-dimensional Euclidean spaces. In this paper, we propose sublinear-space streaming algorithms that work for noisy datasets in any metric space. We also give a lower bound result showing that solving the distinct elements problem on noisy datasets in general metric spaces is inherently more difficult than solving it on noiseless datasets and on noisy datasets in O (1)-dimensional Euclidean spaces.