ICLR2025
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs
Shane Bergsma, Nolan Simran Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, Joel Hestness
1 citation
Abstract
LLMs are commonly trained with a learning rate (LR) warmup, followed by cosine decay to 10% of the maximum (10× decay). In a large-scale empirical study, we show that under an optimal peak LR, a simple linear decay-to-zero (D2Z) schedule consistently outperforms other schedules when training at compute-optimal dataset sizes. D2Z is superior across a range of model sizes, batch sizes, datasets, and vocabularies. Benefits increase as dataset size increases. Leveraging a novel interpretation of AdamW as an exponential moving average of weight updates, we show how linear D2Z optimally balances the demands of early training (moving away from initial conditions) and late training (averaging over more updates in order to mitigate gradient noise). In experiments, a 610M-parameter model trained for 80 tokens-per-parameter (TPP) using D2Z achieves lower loss than when trained for 200 TPP using 10× decay, corresponding to an astonishing 60% compute savings. Models such as Llama2-7B, trained for 286 TPP with 10× decay, could likely have saved a majority of compute by training with D2Z. All the main experiments were run on Cerebras CS-3 systems. We present a large-scale empirical study to determine which schedules work best in which situations, and why. We focus on both compute-efficient and over-trained models. According to Chinchilla scaling laws (Hoffmann et al., 2022) , the fewest FLOPs to achieve a given loss is obtained when models are trained for around 20 tokens-per-parameter (TPP). It is also common to train for more than 20 TPP because smaller, over-trained models are cheaper to serve (Touvron et al., 2023a). Our experiments (across various model scales, vocabulary sizes, and dataset sources) reveal a consistent outcome: when all schedules use their optimal peak LR, linear decay-to-zero