ICLR2026

Arbitrary-Order Block SignSGD for Memory-Efficient LLM Fine-Tuning

Yijie Zhou, Shi Pu

Abstract

We propose ABSignSGD, a block-coordinate variant of sign-based descent with flexible block selection that enables memory-and runtime-efficient full-parameter fine-tuning of large language models. We present a unified convergence analysis under mild conditions, covering both the base method and a majority-vote extension for distributed training. The latter improves communication efficiency by aggregating only gradient signs rather than averaging full gradients. Experiments on Qwen3-8B, Llama3-8B, and Qwen3-32B, spanning mathematical reasoning and general instruction-following tasks, show that ABSignSGD converges faster per iteration and delivers superior downstream performance while reducing both runtime and memory usage compared to existing methods. Ablation studies further indicate that the memoryless sign-based update naturally complements block-wise updates, explaining the method's strong empirical performance. * Corresponding author. The code is available at https://github.com/yijiezcn/ABSignSGD . † Excludes the 2M GB half-precision weights stored by all methods. ‡ For low-rank projection methods, original papers omit communication budgets; sending full gradients costs 4M GB-orders of magnitude higher than others-and even low-rank gradients remain comparable to LoRA and far above ABSignSGD. § Double checkmark denotes additional runtime speedup from arbitrary-order block updates. CONTRIBUTIONS (i) We introduce ABSignSGD, a block-coordinate variant of SignSGD that enables arbitrary-order block updates, allowing us to tailor the update policy for maximal efficiency (e.g., depth-biased updates; see Contribution (iii)). This design delivers substantial memory and runtime savings while preserving competitive convergence and downstream performance. We further extend the method to distributed training with ABSignSGD-MV, which aggregates only gradient signs via majority vote, thereby achieving extreme communication efficiency. (ii) We establish theoretical convergence guarantees under mild assumptions, providing a unified analysis for ABSignSGD and ABSignSGD-MV. Specifically, they achieve O( 1 √ K ) convergence under arbitrary block selection schemes given bounded update intervals. (iii) We introduce a depth-biased update that prioritizes deeper layers, providing runtime speedup without sacrificing performance. Extensive experiments on fine-tuning Qwen3-8B and Llama3-8B for mathematical reasoning and instruction-following show that ABSignSGD achieves the lowest memory footprint, fastest runtime, and superior downstream performance among memory-efficient optimizers. A targeted ablation study further pinpoints the factors driving its effectiveness. * Note: In the event of a tie (as seen in Step 3 where T1 = 6 and T4 = 6), we prioritize the shallower block to ensure coverage, though any consistent tie-breaking rule works.