ICML2025
BoA: Attention-aware Post-training Quantization without Backpropagation
Junhan Kim, Ho-Young Kim, Eulrang Cho, Chungman Lee, Joonyoung Kim, Yongkweon Jeon
摘要
Post-training Quantization (PTQ) With the explosive growth in model complexity, the performance of LLMs has been advancing. The growth in scale has resulted in a corresponding increase in computational costs. Compression is required. Quantization is a promising solution and an essential step for deploying LLMs on resource-constrained devices that mainly support fixed-point arithmetic. Considering the model complexity and required resources (e.g., training costs and available dataset), quantization-aware training (QAT) is not practical for compressing LLMs with billions of parameters. Recent studies have focused more on PTQ. Additional Processing Time Incurred by Attention-aware Hessians Since the proposed attention-aware Hessians model the row-wise dependency, we can compensate for the quantization error of a certain row by updating other rows. To do so, the rows must be quantized sequentially (NOT simultaneously). e.g., The second row can be quantized after being updated to compensate for the quantization error of the first row.