NeurIPS2022

The alignment property of SGD noise and how it helps select flat minima: A stability analysis

Lei Wu, Mingze Wang, Weijie Su

被引用 58 次

摘要

The phenomenon that stochastic gradient descent (SGD) favors flat minima has played a critical role in understanding the implicit regularization of SGD. In this paper, we provide an explanation of this striking phenomenon by relating the particular noise structure of SGD to its linear stability (Wu et al., 2018) . Specifically, we consider training over-parameterized models with square loss. We prove that if a global minimum θ * is linearly stable for SGD, then it must satisfy , where H(θ * ) F , B, η denote the Frobenius norm of Hessian at θ * , batch size, and learning rate, respectively. Otherwise, SGD will escape from that minimum exponentially fast. Hence, for minima accessible to SGD, the sharpness-as measured by the Frobenius norm of the Hessian-is bounded independently of the model size and sample size. The key to obtaining these results is exploiting the particular structure of SGD noise: The noise concentrates in sharp directions of local landscape and the magnitude is proportional to loss value. This alignment property of SGD noise provably holds for linear networks and random feature models (RFMs), and is empirically verified for nonlinear networks. Moreover, the validity and practical relevance of our theoretical findings are also justified by extensive experiments on CIFAR-10 dataset. However, these works all make unrealistic (even wrong) over-simplifications of SGD noise (see the related work section for more details) in their analysis. In addition, instead of studying SGD, they all consider the continuous-time stochastic differential equation (SDE), which is a good modeling of SGD only in finite time and small learning rate (LR) regime [25] . It is generally unclear how this SDE modeling is relevant for understanding SGD with a large LR-a regime preferred in practice. Consequently, these works only provide intuitive and empirical analyses, lacking a quantitative characterization of when and how SGD favors flat minima. Another line of works [46, 29] relate the selection bias of SGD to the dynamical stability. In overparameterized case, all global minima are fixed points of SGD but their dynamical stabilities can be very different. At unstable minima, a small perturbation will drive SGD to leave away, whereas, for stable minima, SGD can stay around and even converge back after initial perturbations. Thus SGD prefers stable minima over unstable ones. Specifically, [46, 29] analyze the linear stability [1] of SGD, showing that a linearly stable minimum must be flat and uniform. Different from SDE-based analysis, this stability-based analysis is relevant for large-LR SGD and is even empirically accurate in predicting the properties of minima selected by SGD [46, 20, 8] . In this work, we follow the linear stability analysis in [46, 29] but take the particular geometry-aware structure of SGD noise into consideration. We establish a direct connection between linear stability and flatness, which allows us to obtain a quantitative characterization of how the learning rate and batch affect the flatness of minima accessible to SGD. In contrast, [46, 29] have to introduce the another quantity: non-uniformity together with flatness to characterize linear stability because of neglecting the noise structure. Setup Let (x i , y i ) n i=1 with x i ∈ R d , y i ∈ R K be the training set and f (•; θ) with θ ∈ R p be our model. The model size is defined to be the number of parameters p and in this paper. For simplicity, we will always assume K = 1 and the extension to the case of K > 1 is straightforward. Let L i (θ) = 1 2 |f (x i ; θ)y i | 2 be the fitting error at the i-th sample and L(θ) = 1 n n i=1 L i (θ) be the empirical risk. We shall focus on the over-parameterized case in the sense that inf θ L(θ) = 0. To minimize L(•), we consider the mini-batch SGD: