VLDB2020

PIDS: Attribute Decomposition for Improved Compression and Query Performance in Columnar Storage

Hao Jiang, Chunwei Liu, Qi Jin, John Paparrizos, Aaron J. Elmore

Abstract

We propose PIDS, Pattern Inference Decomposed Storage, an innovative storage method for decomposing string attributes in columnar stores. Using an unsupervised approach, PIDS identifies common patterns in string attributes from relational databases, and uses the discovered pattern to split each attribute into sub-attributes. First, by storing and encoding each sub-attribute individually, PIDS can achieve a compression ratio comparable to Snappy and Gzip. Second, by decomposing the attribute, PIDS can push down many query operators to sub-attributes, thereby minimizing I/O and potentially expensive comparison operations, resulting in the faster execution of query operators.