ICLR2025
QERA: an Analytical Framework for Quantization Error Reconstruction
Cheng Zhang, Jeffrey T. H. Wong, Can Xiao, George Anthony Constantinides, Yiren Zhao
Abstract
The growing number of parameters and computational demands of large language models (LLMs) present significant challenges for their efficient deployment. Recently, there is an increasing interest in quantizing weights to extremely low precision while offsetting the resulting error with low-rank, high-precision error reconstruction terms. The combination of quantization and low-rank approximation is now popular in both adapter-based, parameter-efficient fine-tuning methods such as LoftQ (Li et al., 2023) and low-precision inference techniques including ZeroQuant-V2 (Yao et al., 2023) . Usually, the low-rank terms are calculated via the singular value decomposition (SVD) of the weight quantization error, minimizing the Frobenius and spectral norms of the weight approximation error. Recent methods like LQ-LoRA (Guo et al., 2023) and LQER (Zhang et al., 2024a) introduced hand-crafted heuristics to minimize errors in layer outputs (activations) rather than weights, resulting improved quantization results. However, these heuristic-based methods lack an analytical solution to guide the design of quantization error reconstruction terms. In this paper, we revisit this problem and formulate an analytical framework, named Quantization Error Reconstruction Analysis (QERA), and offer a closed-form solution to the problem. We show QERA benefits both existing low-precision fine-tuning and inference methods -QERA achieves a fine-tuned accuracy gain for ∆ acc = 6.05% of 2-bit RoBERTabase on GLUE compared to LoftQ; and obtains ∆ acc = 2.97% higher post-training quantization accuracy of 4-bit Llama-3.1-70B compared to ZeroQuant-V2 and ∆ ppl = -0.28 lower perplexity on WikiText2 compared to LQER. We open-source our code and models at github.com/ChengZhang-98/QERA. Published as a conference paper at ICLR 2025 precision. Works such as ZeroQuant-V2 (Yao et al., 2023) and LQER (Zhang et al., 2024a) have shown that adding a high-precision low-rank component, as low as 8 or 32, can recover considerable model performance for 3-or 4-bit weight quantization. Although both the QPEFT and PTQ methods have demonstrated substantial performance improvements in lowering the computational overhead of LLMs, a theoretical analysis of quantization error reconstruction is lacking. Usually, A k and B k are calculated by applying truncated singular value decomposition (SVD) to the weight quantization error (W -W ), minimizing the Frobenius and spectral norms of the weight approximation error. However, recent work on activation-aware quantization and knowledge distillation implies that minimizing layer output error may lead to a greater performance gain than minimizing weight approximation error (Lin et al., 2024; Liu et al., 2023a; Shao et al., 2023) . Besides the unsettled minimization objective, it has remained unclear whether there exists a theoretically optimal solution for the values of A k and B k , and if so, how one can solve for it. A better initialization or theoretically grounded initialization of A k and B k brings direct benefits for both QPEFT and PTQ. In QPEFT, the initialization of LoRA (Hu et al., 2021) , which uses element-wise Gaussian random values for A k and zeros for B k , struggles under aggressive quantization since the quantization error can derail fine-tuning. In PTQ, the quantized model performance is based on the computation of the low-rank terms, given a specific quantization function q(•) and rank k. In this paper, we aim to provide an analytical framework for the quantization error reconstruction problem. To demonstrate the effectiveness of our theoretical framework, we further apply our analytical solutions to state-of-the-art QPEFT and PTQ methods and show the significant performance improvements under the same computational budget. Specifically, our contributions are as follows: • We show that the commonly used objective for solving the quantization error reconstruction problem in prior work , i.e., minimizing the weight approximation error (e.g., ||W -W || p ), does not guarantee a reduced model output error. Instead, we show that minimizing the layer output error (e.g., ||y -y|| p ) is closely related to minimizing the model output error. • We derive the analytical solution to the low-rank terms A k and B k by minimizing the layer output error. We demonstrate that under a statistical assumption, this solution can be found in a particularly computationally efficient manner, also explaining the success of LQER. • We empirically demonstrate the effectiveness of our solutions by applying them to stateof-the-art QPEFT and PTQ methods. Our analytical framework, QERA, significantly improves the performance of these methods. For example, QERA achieves ∆ acc = 6.05% higher accuracy of 2-bit RoBERTa-base on GLUE compared to LoftQ, improving the finetuning accuracy and efficiency. Moreover, QERA obtains ∆ acc = 2.97% higher accuracy than ZeroQuant-V2, when quantizing LLaMA-3-70B to 4 bits, averaged across six tasks. This