VLDB2020
DeltaPQ: Lossless Product Quantization Code Compression for High Dimensional Similarity Search
Runhui Wang, Dong Deng
被引用 31 次
摘要
High dimensional data is ubiquitous and plays an important role in many applications. However, the size of high dimensional data is usually excessively large. To alleviate this problem, in this paper, we develop novel techniques to compress and search high dimensional data. Specifically, we first apply vector quantization, a classical lossy data compression method. It quantizes a high dimensional vector to a sequence of small integers, namely the quantization code. Then, we propose a novel lossless compression algorithm, DeltaPQ, to further compress the quantization codes. DeltaPQ organizes the quantization codes in a tree structure and stores the differences between two quantization codes rather than the original codes. Among the exponential number of possible tree structures, we develop an efficient algorithm, whose time and space complexity are linear to the number of codes, to find the one with optimal compression ratio. The approximate nearest neighbor search query can be processed directly on the compressed data with small space overhead in a few bytes. Many similarity measures can be supported, such as inner product, cosine similarity, Euclidean distance, and Lp-norm. Experimental results on five large-scale real-world datasets show that DeltaPQ achieves a compression ratio of up to 5 (and often greater than 2) on the quantization codes whereas the state-of-art general-purpose lossless compression algorithms barely work.