NeurIPS2023

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De Sa

被引用 372 次

摘要

This work studies post-training parameter quantization in large language models (LLMs). We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from incoherent weight and Hessian matrices, i.e., from the weights being even in magnitude and the directions in which it is important to round them accurately being unaligned with the coordinate axes. QuIP consists of two steps: (1) an adaptive rounding procedure minimizing a quadratic proxy objective; (2) efficient pre-and post-processing that ensures weight and Hessian incoherence via multiplication by random orthogonal matrices. We complement QuIP with the first theoretical analysis for an LLM-scale quantization algorithm, and show that our theory also applies to an existing method, OPTQ. Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight. Our code can be found at https://github.com/Cornell-RelaxML/QuIP . Remarks. The number of rows being quantized is m, and each quantization method operates across the n entries of each row. For all rounding methods described by Eq. ( 2 ), and for all positive semidefinite H, Q as nearest rounding achieves the same worst-case proxy loss as stochastic rounding, but achieves better average proxy loss. Note that the worst case for comparing LDLQ against these baselines occurs when H is diagonal, see Theorem 1 and Lemma 3. Assuming incoherence as we do is a natural way to exclude such cases. Quantization With Incoherence Processing: Incoherence Processing Step Next, we leverage the above incoherence analysis to introduce incoherence processing, the second step of the QuIP algorithm. Our strategy will be to pre-process weight and Hessian matrices to ensure