ICLR2025

The Optimization Landscape of SGD Across the Feature Learning Strength

Alexander B. Atanasov, Alexandru Meterez, James B. Simon, Cengiz Pehlevan

摘要

We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter γγ. Recent work has identified γγ as controlling the strength of feature learning. As γγ increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling γγ across a variety of models and datasets in the online training setting. We first examine the interaction of γγ with the learning rate ηη, identifying several scaling regimes in the γγ-ηη plane which we explain theoretically using a simple model. We find that the optimal learning rate ηη^* scales non-trivially with γγ. In particular, ηγ2η^* \propto γ^2 when γ1γ\ll 1 and ηγ2/Lη^* \propto γ^{2/L} when γ1γ\gg 1 for a feed-forward network of depth LL. Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored "ultra-rich" γ1γ\gg 1 regime. We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps. We find networks of different large γγ values optimize along similar trajectories up to a reparameterization of time. We further find that optimal online performance is often found at large γγ and could be missed if this hyperparameter is not tuned. Our findings indicate that analytical study of the large-γγ limit may yield useful insights into the dynamics of representation learning in performant models.