SIGMOD2025

SieveSketch: A Fine-grained and Adaptive Sketch Framework for Accurate Frequency Estimation

Shishi Zhang, Yaping Xu, Lu Tang

摘要

Estimating item frequencies in data streams is a fundamental task that supports a wide range of applications. To improve accuracy, existing algorithms typically employ filters to handle cold (infrequent) and hot (frequent) items separately. However, their accuracy often degrades across different data streams due to fixed parameter settings. Once the filter reaches its capacity, it can no longer effectively distinguish target items, resulting in a significant drop in accuracy. To achieve higher accuracy and better adaptability to data streams, we propose SieveSketch, a novel framework for frequency estimation in data stream processing. Inspired by two observations of narrow cold-item frequency range and different sensitivity of items to hash collisions, SieveSketch proposes adaptive scaling to adjust the count range of each counter with few bits (e.g. 4 bits) to record massive cold items efficiently, and takes a frequency-oriented counting method to process items at a more fine-grained level to improve the accuracy. We theoretically analyze the error bound of SieveSketch. We conduct extensive experiments on real-world and synthetic datasets, and the results show that, compared to the state-of-the-art, SieveSketch reduces the estimation error by up to 222.1 times.