ICML2025

BoA: Attention-aware Post-training Quantization without Backpropagation

Junhan Kim, Ho-Young Kim, Eulrang Cho, Chungman Lee, Joonyoung Kim, Yongkweon Jeon

Abstract

Post-training Quantization (PTQ)  With the explosive growth in model complexity, the performance of LLMs has been advancing.  The growth in scale has resulted in a corresponding increase in computational costs.  Compression is required.  Quantization is a promising solution and an essential step for deploying LLMs on resource-constrained devices that mainly support fixed-point arithmetic.  Considering the model complexity and required resources (e.g., training costs and available dataset), quantization-aware training (QAT) is not practical for compressing LLMs with billions of parameters.  Recent studies have focused more on PTQ.  Additional Processing Time Incurred by Attention-aware Hessians  Since the proposed attention-aware Hessians model the row-wise dependency, we can compensate for the quantization error of a certain row by updating other rows.  To do so, the rows must be quantized sequentially (NOT simultaneously).  e.g., The second row can be quantized after being updated to compensate for the quantization error of the first row.