ICLR2026

DeMo: Decoupled Momentum Optimization

Bowen Peng, Lizhang Chen, Baiyu Su, Jeffrey Quesnelle, Diederik P Kingma, qiang liu

7 citations

Abstract

Scaling neural network training increasingly depends on synchronous dataparallelism, yet full-precision gradient all-reduce imposes a severe communication bottleneck. We propose Decoupled Momentum Optimization (DeMo), a drop-in replacement for any momentum-based optimizers that significantly reduces the communication bandwidth while maintaining convergence. DeMo (i) decouples local momentum updates, (ii) applies a fast orthonormal transform (e.g., DCT) followed by top-k sparsification, and (iii) reuses the momentum buffer as error feedback via momentum subtraction. This design reduces perstep communication by up to two orders of magnitude with minimal computational overhead. Experiments on 300M-and 1B-parameter DeMo language models show DeMo transmits up to 85× less data per GPU than AdamW-DDP while achieving comparable loss and accuracy. DeMo is topology-agnostic and enables training across multi-datacenter or Ethernet-based setups. Code is available at https://github.com/bloc97/DeMo .