ICLR2026

GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

Selim An, Il hong Suh, Yeseong Kim

Abstract

Quantization techniques such as BitsAndBytes (Dettmers et al., 2022) , AWQ (Lin et al., 2024), and GPTQ (Frantar et al., 2022) are widely used as a standard method in deploying large language models but often degrades accuracy when using lowbit representations, e.g., 4 bits. Low-rank correction methods (e.g., LQER (Zhang et al., 2024a), QERA (Zhang et al., 2024b), ASER (Zhao et al., 2025)) has been proposed to mitigate this issue, however, they restore all layers and insert errorcorrection modules into every decoder block, which increases latency and memory overhead. To address this limitation, we propose GlowQ, a group-shared low-rank approximation for quantized LLMs that caches a single shared right factor per input-sharing group and restores only the groups or layers that yield the highest accuracy benefit. GlowQ computes the high-precision projection once per inputsharing group and reuses it across its modules, reducing parameter and memory overhead, and retaining the expressivity of layer-specific corrections. We also propose a selective variant, GlowQ-S, that applies the cached shared module only where it provides the largest benefit. Compared with strong baselines, our approach reduces TTFB by 5.6% and increases throughput by 9.6% on average, while reducing perplexity on WikiText-2 by 0.17% and increasing downstream accuracy by 0.42 percentage points. The selective model GlowQ-S further reduces latency, cutting TTFB by 23.4% and increasing throughput by 37.4%, while maintaining accuracy within 0.2 percentage points on average. Code is available at https://github.com/ahnselim/GlowQ .