ICML2025

The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions

Gül Sena Altintas, Devin Kwok, Colin Raffel, David Rolnick

Abstract

Neural network training is unstable: even when it succeeds in converging to a solution, it may not consistently reach the same solution. Prior work found that in the early (chaotic) phase of training, training (SGD) noise can cause the same network to diverge to disconnected minima [1, 2], as measured via barriers (Eq 1). Knowing whether training and fine-tuning is stable matters in practice: model averaging benefits from connected solutions, while ensembling benefits from diverse solutions. ? But how unstable is training, really? Is early-phase training stable to perturbations smaller than training noise, and is late-phase training unstable to perturbations larger than training noise? ? How does pre-training affect stability? Does stability depend on the amount of pre-training, and the specific combination of pre-training and fine-tuning tasks? ? Are some model architectures, task domains, or hyperparameters more stable than others?