ICML2025
Learning Dynamics in Continual Pre-Training for Large Language Models
Xingjin Wang, Howe Tissue, Lu Wang, Linjing Li, Daniel Dajun Zeng
Abstract
Continual Pre-Training (CPT) is a popular and effective method for applying strong foundation models to specific downstream tasks. In this work, we explore the learning dynamics throughout the CPT process for large language models. We specifically focus on how general and downstream domain performance evolves at each training step, with performance measured by validation losses. We observe that the CPT loss curve fundamentally characterizes a transition from an initial pre-training trajectory to a new, domainspecific one, conceptualized as a shift between two hidden loss curves. This transition can be described by decoupling the effects of distribution shift and learning rate annealing. We derive a CPT scaling law that combines these two factors, enabling the prediction of loss at any (continual) training step and across various learning rate schedules. Our formulation presents a comprehensive understanding of several critical factors in CPT, including loss potential, peak learning rate, training steps, and replay ratio. Moreover, our approach can be adapted to optimize training hyper-parameters for different CPT goals, such as balancing general and domain-specific performance. Extensive experiments demonstrate that our scaling law holds across various CPT datasets and hyper-parameters.