NeurIPS2025

Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization

Yamato Arai, Yuma Ichikawa

被引用 22 次

摘要

Layer-wise PTQ is a promising technique for compressing large language models (LLMs), due to its simplicity and effectiveness without requiring retraining. However, recent progress in this area is saturating, underscoring the need to revisit its core limitations and explore further improvements. We address this challenge by identifying a key limitation of existing layer-wise PTQ methods: the growth of quantization errors across layers significantly degrades performance, particularly in low-bit regimes. To address this fundamental issue, we propose Quantization Error Propagation (QEP), a general, lightweight, and scalable framework that enhances layer-wise PTQ by explicitly propagating quantization errors and compensating for accumulated errors. QEP also offers a tunable propagation mechanism that prevents overfitting and controls computational overhead, enabling the framework to adapt to various architectures and resource budgets. Extensive experiments on several LLMs demonstrate that QEP-enhanced layer-wise PTQ achieves substantially higher accuracy than existing methods. Notably, the gains are most pronounced in the extremely low-bit quantization regime. Introduction Large Language Models (LLMs) have achieved impressive performance in various natural language processing tasks, including open-ended text generation, multi-step reasoning, and dialogue modeling. Notable examples include ChatGPT [Achiam et al., 2023] and the Llama family [Touvron et al., 2023 , Grattafiori et al., 2024] . However, deploying LLMs cost-effectively remains difficult because of their substantial memory usage and computational demands [Chen et al., 2023] . This limitation is especially critical for edge computing and latency-sensitive applications. To address these challenges, a wide range of model compression techniques, such as quantization [