ICLR2022

High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize

Ali Kavis, Kfir Yehuda Levy, Volkan Cevher

51 citations

Abstract

In this paper, we propose a new, simplified high probability analysis of AdaGrad for smooth, non-convex problems. More specifically, we focus on a particular accelerated gradient (AGD) template (Lan, 2020) , through which we recover the original AdaGrad and its variant with averaging, and prove a convergence rate of O(1/ √ T ) with high probability without the knowledge of smoothness and variance. We use a particular version of Freedman's concentration bound for martingale difference sequences (Kakade & Tewari, 2008) which enables us to achieve the best-known dependence of log(1/δ) on the probability margin δ. We present our analysis in a modular way and obtain a complementary O(1/T ) convergence rate in the deterministic setting. To the best of our knowledge, this is the first high probability result for AdaGrad with a truly adaptive scheme, i.e., completely oblivious to the knowledge of smoothness and uniform variance bound, which simultaneously has best-known dependence of log(1/δ). We further prove noise adaptation property of AdaGrad under additional noise assumptions. * A Viterbi fellow This alternative perspective to adaptivity is crucial because most existing analysis, both for classical and adaptive methods, assume to have access to smoothness constant, bound on gradients (Reddi et al., 2018) and even noise variance (Ghadimi & Lan, 2013) . In practice, it is difficult, if not impossible, to compute or even estimate such quantities. For this purpose, in the setting of (P) we study a class of adaptive gradient methods that enable us to handle noisy gradient feedback without requiring the knowledge of the objective's smoothness modulus, noise variance or a bound on gradient norms. We summarize our contributions as follows: 1. We provide a modular, simple high probability analysis for AdaGrad-type adaptive methods. 2. We present the first optimal high probability convergence result of the original AdaGrad algorithm for non-convex smooth problems. Concretely, (a) we analyze a fully adaptive step-size, oblivious to Lipschitz constant and noise variance, (b) we obtain the best known dependence of log(1/δ) on the probability margin δ. (c) we show that under sub-Gaussian noise model, AdaGrad adapts to noise level with high probability, i.e, as variance σ → 0, convergence rate improves, 1/ √ T → 1/T . 3. We present new extensions of AdaGrad that include averaging and momentum primitives, and prove similar high probability bounds for these methods, as well. Concretely, we study a general adaptive template which individually recovers AdaGrad, AdaGrad with averaging and adaptive RSAG (Ghadimi & Lan, 2016) for different parameter choices. In the next section, we will provide a broad overview of related work with an emphasis on the recent developments. Section 3 formalizes the problem setting and states our blanket assumptions. Section 4 introduces the building blocks of our proposed proof technique while proving convergence results for AdaGrad. We generalize the convergence results of AdaGrad for a class of nonconvex, adaptive algorithms in Section 5. Finally, we present concluding remarks in the last section. RELATED WORK Adaptive methods for stochastic optimization As an extended version of the online (projected) GD (Zinkevich, 2003 ), AdaGrad (Duchi et al., 2011) is the pioneering work behind most of the contemporary adaptive optimization algorithms Adam, AmsGrad and RmsProp (Tieleman & Hinton, 2012) to name a few. Simply put, such AdaGrad-type methods compute step-sizes on-the-fly by accumulating gradient information and achieve adaptive regret bounds as a function of gradient history (see also