VLDB2025

QStore: Quantization-Aware Compressed Model Storage

Raunak Shah, Zhaoheng Li, Yongjoo Park

3 citations

Abstract

Modern applications commonly leverage large, multi-modal foundation models, in complex workflows that demand the storage and usage of similar models in multiple precisions. A straightforward approach is to maintain a separate file for each model precision (e.g., INT8, BF16), which is indeed taken by model providers such as HuggingFace and Ollama. However, this approach incurs excessive storage costs as a higher precision model (e.g., BF16) is a superset of a lower precision model (e.g., INT8) in terms of information. Unfortunately, simply maintaining only the higher-precision model and requiring every user to dynamically convert the model precision is not desirable because every user of lower precision models must pay the cost for model download and precision conversion.

In this paper, we present QStore, a unified, lossless compression format for simultaneously storing a model in two (high and low) precisions efficiently. Instead of storing low and high-precision models separately, QStore stores low-precision model and only residual information needed to reconstruct high-precision models. The residual information size is significantly smaller than the original high-precision models, thus, achieving high storage cost savings. Moreover, QStore does not compromise model loading speed: The low-precision models can still be loaded quickly, while the high-precision models can also be reconstructed efficiently by merging low-precision data and the residual with QStore's lightweight decoding. We evaluate QStore for compressing multiple precisions of popular foundation models, and show that QStore reduces overall storage cost by up to 2.2× while enabling up to 1.7× and 1.8× faster model saving and loading versus existing approaches.