SIGMOD2025
Randomized Sketches for Quantile in LSM-tree based Store
Ziling Chen, Shaoxu Song
被引用 1 次
摘要
Quantiles are costly to compute exactly but can be efficiently estimated by quantile sketches. Extensive works on summarizing streaming data, such as KLL sketch, focus on minimizing the cost in memory to provide certain error guarantees. For the problem of quantile estimation of values in LSM-tree based stores, streaming methods have an expensive I/O cost linear to data size N. Since disk components (chunks and SSTables) in the LSM-tree are immutable once flushed, quantile sketches can be pre-computed as a type of statistics to reduce I/O cost and accelerate queries. Unfortunately, to provide deterministic additive εN error guarantees on queried data, all pre-computed deterministic sketches of queried chunks each with size N_c should provide εN_c error guarantee, resulting in no improvement in the linear I/O cost. In this study, we propose pre-computing randomized sketches which provide randomized additive error guarantees. Our major technical contributions include (1) randomized sketches for data chunks constructed in flush events, which are proved to be optimal and achieve an I/O cost proportional to √(N), (2) hierarchical randomized sketches for SSTables constructed in compaction events, that further improve the asymptotic I/O cost, (3) the KLL sketch summarizing proposed pre-computed sketches is proved to be more accurate than that summarizing streaming data, and proved to achieve sublinear I/O cost while achieving the same memory complexity as in the streaming settings. Extensive experiments on synthetic and real datasets demonstrate the superiority of the proposed techniques. The approach is deployed in an LSM-tree based time-series database Apache IoTDB.