VLDB2025
Eureka: Enabling Fine-Grained Access and Range Queries on Compressed Scientific Data via Data-Index Co-Compression
Ning Yan, Sheng Di, Kai Zhao, Lipeng Wan
摘要
Handling large-scale scientific data in high-performance computing (HPC) environments poses significant challenges, including excessive I/O, high storage costs, and slow query performance. Traditional approaches often require full data decompression and scans, making them impractical for real-time or interactive analysis. To address these limitations, we introduce Eureka, a unified data-index co-compression framework that enables fine-grained access and efficient range queries on compressed scientific datasets. Eureka integrates spatial domain decomposition with block-wise error-bounded lossy compression to support selective decompression. It constructs a hierarchical AVL-tree index during compression to capture block-level value ranges, enabling fast pruning during query execution. To reduce metadata overhead, the index itself is also compressed while ensuring recall-preserving results. Experiments on six diverse HPC simulation datasets show that Eureka achieves up to 25x data compression and over 300x index compression, surpassing state-of-the-art compressors such as SZ3 and ZFP in rate-distortion performance. Additionally, Eureka delivers over 30x speedup for low-selectivity range queries, making it a scalable and efficient solution for modern scientific data analysis.