ICML2025
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
摘要
Despite State Space Models (SSMs) are emerging as an efficient alternative to Transformers, depolying SSMs on both cloud and edge devices is challenging due to the limited resources. Model quantization reduces model size and leverages hardware acceleration, and recent efforts on SSM quantization have focused on optimizing a particular model or bit-width. However, distinct bitwidths are essential for different scenarios, like W4A8 for boosting large-batch decoding speed, and W4A16 for enhancing generation speed in short-prompt single-user applications. We present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment. Based on channel order preserving and activation persistence of SSMs, we propose an offline approach to quantize inputs of the linear recurrence in 8-bit by sorting and clustering for input x, combined with a per-state-group quantization for input-dependent parameters B and C. To ensure compute-invariance in the SSM output, we rearrange weights offline according to the clustering sequence. We show that Quamba2-8B outperforms two state-of-the-art SSM quantization methods and delivers 1.3× and 3× speed-ups in the pre-filling and generation stages, respectively, while offering 4× memory reduction with only a 1.6% average accuracy drop. The evaluation on MMLU shows the generalizability and robustness of our framework. The code and quantized models are released at the link.