NeurIPS2023

Convex and Non-convex Optimization Under Generalized Smoothness

Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, Ali Jadbabaie

71 citations

Abstract

Classical analysis of convex and non-convex optimization methods often requires the Lipshitzness of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clipping, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clipping and allows heavy-tailed noise with bounded variance in the stochastic setting. * Equal contribution. Table 1: Summary of the results. ǫ denotes the sub-optimality gap of the function value in convex settings, and the gradient norm in non-convex settings. " * " denotes optimal rates. Method Convexity ℓ-smoothness Gradient complexity GD Strongly convex No requirement O Related work Gradient-based optimizaiton. The classical gradient-based optimization problems for the standard Lipschitz smooth functions have been well studied for both convex [Nemirovskij and Yudin, 1983 , Nesterov, 2003 , d'Aspremont et al., 2021] and non-convex functions. In the convex setting, the goal is to reach an ǫ-sub-optimal point x satisfying f (x) -inf x f (x) ≤ ǫ. It is well known that GD achieves the O(1/ǫ) gradient complexity and NAG achieves the accelerated O(1/ √ ǫ) complexity which is optimal among all gradientbased methods. For strongly convex functions, GD and NAG achieve the O(κ log(1/ǫ)) and O( √ κ log(1/ǫ)) complexity respectively, where κ is the condition number and the latter is again optimal. In the non-convex setting, the goal is to find an ǫ-stationary point x satisfying ∇f (x) ≤ ǫ, since finding a global minimum is NP-hard in general. It is well known that GD achieves the optimal O(1/ǫ 2 ) complexity which matches the lower bound in [Carmon et al., 2017] . In the stochastic setting for unbiased stochastic gradient with bounded variance, SGD achieves the optimal O(1/ǫ 4 ) complexity [Ghadimi and Lan, 2013] , matching the lower bound in [Arjevani et al., 2019] . In this paper, we obtain the classical rates in terms of ǫ for all the above-mentioned methods and settings, under a far more general smoothness condition. Generalized smoothness. The (L 0 , L 1 )-smoothness condition proposed by Zhang et al. [2019] was studied by many follow-up works. Under the same condition, [Zhang et al., 2020] considers momentum in the updates and improves the constant dependency of the convergence rate for SGD with clipping derived in [Zhang et al., 2019] . [Qian et al., 2021] studies gradient clipping in incremental gradient methods, [Zhao et al., 2021] studies stochastic normalized gradient descent, and [Crawshaw et al., 2022] studies a generalized SignSGD method, under the (L 0 , L 1 )-smoothess condition. [Reisizadeh et al., 2023] studies variance reduction for (L 0 , L 1 )-smooth functions. [Chen et al., 2023] proposes a new notion of α-symmetric generalized smoothness, which is roughly as general as (L 0 , L 1 )-smoothness. [Wang et al., 2022] analyzes convergence of Adam and provides a lower bound which shows non-adaptive SGD may diverge. In the stochastic setting, the abovementioned works either consider the strong assumption of bounded gradient noise or require a very large batch size that depends on ǫ, which essentially reduces the analysis to the deterministic setting. [Faw et al., 2023] proposes an AdaGrad-type algorithm in order to relax the bounded noise assumption. Perhaps due to the lower bounds in [Zhang et al., 2019 , Wang et al., 2022] , all the above works study methods with an adaptive stepsize. In this and our concurrent work [Li et al., 2023] , we further generalize the smoothness condition and analyze various methods under this condition through bounding the gradients along the trajectory. Function class In this section, we discuss the function class of interest where the objective function f lies. We start with the following two standard assumptions in the literature of unconstrained optimization, which will be assumed throughout Sections 4 and 5 unless explicitly stated. Assumption 1. The objective function f is differentiable and closed within its open domain X .