NeurIPS2023
Convergence of Adam Under Relaxed Assumptions
Haochuan Li, Alexander Rakhlin, Ali Jadbabaie
被引用 108 次
摘要
In this paper, we provide a rigorous proof of convergence of the Adaptive Moment Estimate (Adam) algorithm for a wide class of optimization objectives. Despite the popularity and efficiency of the Adam algorithm in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally bounded gradients, to show the convergence to stationary points. In this paper, we show that Adam provably converges to ǫ-stationary points with O(ǫ -4 ) gradient complexity under far more realistic conditions. The key to our analysis is a new proof of boundedness of gradients along the optimization trajectory of Adam, under a generalized smoothness assumption according to which the local smoothness (i.e., Hessian norm when it exists) is bounded by a sub-quadratic function of the gradient norm. Moreover, we propose a variance-reduced version of Adam with an accelerated gradient complexity of O(ǫ -3 ). not converge. That being said, the counter-examples depend on the hyper-parameters of Adam, i.e., they are constructed after picking the hyper-parameters. Therefore, it does not rule out the possibility of obtaining convergence guarantees for problem-dependent hyper-parameters, as also pointed out by (Shi et al., 2021; Zhang et al., 2022) . Many recent works have developed convergence analyses of Adam with various assumptions and hyperparameter choices. Zhou et al. (2018b) show Adam with certain hyper-parameters can work on the counterexamples of (Reddi et al., 2018) . De et al. (2018) prove convergence for general non-convex functions assuming gradients are bounded and the signs of stochastic gradients are the same along the trajectory. The analysis in (D'efossez et al., 2020) also relies on the bounded gradient assumption. Guo et al. (2021) assume the adaptive stepsize is upper and lower bounded by two constants, which is not necessarily satisfied unless assuming bounded gradients or considering the AdaBound variant (Luo et al., 2019) . (Zhang et al., 2022; Wang et al., 2022) consider very weak assumptions. However, they show either 1) "convergence" only to some neighborhood of stationary points with a constant radius, unless assuming the strong growth condition; or 2) convergence to stationary points but with a slower rate. Variants of Adam. After Reddi et al. ( 2018 ) pointed out the non-convergence issue with Adam, various variants of Adam that can be proved to converge were proposed (