ACL2025
One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments
Ke Yi, Yuhui Xu, Heng Chang, Yuan Meng, Tong Zhang, Jia Li
Abstract
Large Language Models (LLMs) have advanced rapidly but face significant memory demands. While quantization could alleviate the memory-bound issue, current methods typically require lengthy training to recover accuracy under low bit width. In that circumstance, deployment across scenarios with different resource constraints necessitates repeated training, amplifying the issue of protracted training. It is beneficial to train a once-for-all (OFA) supernet capable of offering optimal subnets for downstream applications. To extend the oncefor-all setting to LLMs, we decouple the shared weights to mitigate the interference and integrate Low-Rank adapters to enhance training efficiency. Furthermore, it is observed that there is an imbalance in the allocation of training resources due to traditional uniform sampling. A non-parametric scheduler is introduced to adjust the sampling rate for each quantization configuration, thereby achieving a more balanced allocation among subnets with varying demands. We validate the approach on LLaMA families and Mistral on downstream evaluation, demonstrating high performance while significantly reducing deployment time faced with multiple scenarios. 1 1 The work was conducted during Ke Yi's visit to Hong Kong University of Science and Technology -Guangzhou. * denotes equal contribution. ‡ denotes program leader. † denotes equal advising.