ICLR2026
Fantastic Pretraining Optimizers and Where to Find Them
Kaiyue Wen, David Leo Wright Hall, Tengyu Ma, Percy Liang
被引用 61 次
摘要
AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2× speedup. We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups. To address these two issues, we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1-8× the Chinchilla optimum). We find that fair and informative comparisons require rigorous hyperparameter tuning and evaluations across a range of model scales and data-to-model ratios, performed at the end of training. First, optimal hyperparameters for one optimizer may be suboptimal for another, making blind hyperparameter transfer unfair. Second, the actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1× for 1.2B parameter models. Thirdly, comparing intermediate checkpoints before reaching the target training budgets can be misleading, as rankings between two optimizers can flip during training due to learning rate decay. Through our thorough investigation, we find that all the fastest optimizers such as Muon and Soap, use matrices as preconditioners -multiplying gradients with matrices rather than entry-wise scalars. However, the speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4× over AdamW for 0.1B parameter models to merely 1.1× for 1.2B parameter models. setting of training Bert and T5-like models. They also observe the ordering of optimizers can flip during learning rate decay, and AdamW remains competitive when properly tuned. Semenov et al. [2025] investigates the same topic of benchmarking optimizers for pre-training. These two works agree on many high-level points, e.g., (i) non-zero weight decay and decaying to a small learning rate are essential for pretraining; (ii) variance-reduced AdamW variants such as Mars show non-trivial speedups over vanilla AdamW. Comparison with concurrent work Semenov et al. [2025] However, our results differ in the relative performance of matrix-level optimizers. Our paper shows that matrix-level optimizers such as Muon can achieve significant speedups over variance-reduced versions of AdamW, such as Mars, their study finds that AdEMaMix and Mars outperform Muon. Our initial investigation suggests this discrepancy largely stems from the differences in the batch sizes used in experiments. Their most extensively tuned experiments on 130M models use a batch size of only 0.1M and 0.02M tokens, whereas our experiments operate with tuned batch sizes that are not smaller than 0.4M tokens. This difference is likely a result of different hardware regimes. We leverage 128 TPU-v5lite chips (which are approximately equivalent to 12 H100 GPUs), where only larger batches (larger than 0.4M) can fully utilize the parallelism in the compute. On the other side, their experiments appear to be primarily conducted on 1-8 H100 GPUs, where smaller batches might be preferable. Since Mars and AdEMAMix both perform gradient averaging and variance reduction, these methods are advantangeous in their noise-dominated small-batch regime, whereas in our larger-batch setting these benefits diminish and matrix-level optimizers become more competitive. In their larger scale experiments, Semenov et al. [2025] increase both model size to 720M parameters and batch size to 1M tokens, respectively, which is closer to our setting. However, we differ in hyperparameter tuning methodology: (i) we conduct more extensive sweeps over learning rates on our 520M scales experiments and generally find higher values (4e-3 to 8e-3) compared to their choices (1e-3 to 2e-3), which are transferred from smaller scale and for some optimizers are not tuned; and (ii) for Muon, we separately tune the learning rates for the embedding layers and the matrix weights, which also improves its performance. These differences highlight the sensitivity of optimizer benchmarking to hardware and tuning strategies, underscoring the importance of carefully controlled experimental design when comparing optimizer performance. Methodology In this section, we detail the experimental design and evaluation protocol that underpin our empirical investigation. In Section 3.1, we specify the general setup for all subsequent studies. We then describe our three-phase hyperparameter-tuning framework: Phase I (in Section 3.2) performs fine-grained coordinatedescent sweeps across multiple model sizes and data-to-model ratios to identify scaling-sensitive parameters; Phase II (in Section 3.3) refines these sensitive parameters on mid-scale settings and selects the most promising optimizers; and Phase III (in Section 3.4) extrapolates hyperparameter scaling laws to the 1.2 billion-parameter regime. Together, these protocols ens