NeurIPS2022

Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay

Zhiyuan Li, Tianhao Wang, Dingli Yu

被引用 16 次

摘要

We prove the Fast Equilibrium Conjecture proposed by Li et al. [1], i.e. , stochastic gradient descent (SGD) on a scale-invariant loss ( e.g. , using networks with various normalization schemes) with learning rate ⌘ and weight decay factor � mixes in function space in e O (1 / ( ⌘� )) steps, under two standard assumptions: (1) the noise covariance matrix is non-degenerate and (2) the minimizers of the loss form a connected, compact and analytic manifold. The analysis uses the framework of Li et al. [2] and shows that for every T > 0 , the iterates of SGD with learning rate ⌘ and weight decay factor � on the scale-invariant loss converge in distribution in ln(1 + T � / ⌘ ) / (4 ⌘� ) iterations as ⌘� ! 0 while satisfying ⌘  O ( � )  O (1) . Moreover, the evolution of the limiting distribution can be described by a stochastic differential equation that mixes to the same equilibrium distribution for every initialization around the manifold of minimizers as T ! 1 .