ICLR2025

Complexity Lower Bounds of Adaptive Gradient Algorithms for Non-convex Stochastic Optimization under Relaxed Smoothness

Michael Crawshaw, Mingrui Liu

Abstract

Recent results in non-convex stochastic optimization demonstrate the convergence of popular adaptive algorithms (e.g., AdaGrad) under the (L 0 , L 1 )-smoothness condition, but the rate of convergence is a higher-order polynomial in terms of problem parameters like the smoothness constants. The complexity guaranteed by such algorithms to find an ǫ-stationary point may be significantly larger than the optimal complexity of Θ ∆Lσ 2 ǫ -4 achieved by SGD in the L-smooth setting, where ∆ is the initial optimality gap, σ 2 is the variance of stochastic gradient. However, it is currently not known whether these higher-order dependencies can be tightened. To answer this question, we investigate complexity lower bounds for several adaptive optimization algorithms in the (L 0 , L 1 )-smooth setting, with a focus on the dependence in terms of problem parameters ∆, L 0 , L 1 . We provide complexity bounds for three variations of AdaGrad, which show at least a quadratic dependence on problem parameters ∆, L 0 , L 1 . Notably, we show that the decorrelated variant of AdaGrad-Norm requires at least Ω ∆ 2 L 2 1 σ 2 ǫ -4 stochastic gradient queries to find an ǫ-stationary point. We also provide a lower bound for SGD with a broad class of adaptive stepsizes. Our results show that, for certain adaptive algorithms, the (L 0 , L 1 )-smooth setting is fundamentally more difficult than the standard smooth setting, in terms of the initial optimality gap and the smoothness constants. Published as a conference paper at ICLR 2025 ent (NAG) exhibit linear convergence, but the iteration complexity of NAG is faster than GD by a factor of √ κ, where κ is the condition number of the objective function (Nesterov, 2013) . Our contributions can be summarized as follows: 1. In Theorem 1, we provide a complexity lower bound of Ω ∆ 2 L 2 1 σ 2 ǫ -4 for Decorrelated AdaGrad-Norm (which uses decorrelated step sizes and a shared learning rate for all coordinates) under (L 0 , L 1 )-smoothness and almost surely bounded gradient noise. The proof uses a novel construction of a difficult objective for which Decorrelated AdaGrad-Norm may diverge (depending on the choice of hyperparameter), combined with a highdimensional objective (adapted from Drori & Shamir ( 2020 )). This lower bound matches the upper bound of AdaGrad-Norm in two out of three dominating terms, and only differs in terms of σ. See Section 4 for a comparison between these upper and lower bounds. 2. In Theorem 2, we lower bound the complexity of Decorrelated AdaGrad by , where γ is a hyperparameter. The proof uses a novel highdimensional objective for which the algorithm diverges when η ≥ Ω(γ/(L 1 σ)). Theorem 3 extends this result for the original AdaGrad algorithm, achieving a lower bound of Ω ∆ 2 L 2 0 ǫ -4 . While our lower bound for AdaGrad is weaker than for its decorrelated counterpart, this complexity is still larger than the optimal smooth rate in regimes when ∆ or the smoothness constants are large compared to σ. 3. In Theorem 4, we consider the setting of (L 0 , L 1 )-smoothness and gradient noise bounded by an affine function of the gradient norm. For SGD with a broad class of adaptive step sizes, we show a lower bound that is nearly quadratic in the problem parameters ∆, L 1 . This is proven by showing that one of the following must hold: (1) adaptive SGD can be forced into a biased random walk with a constant probability of divergence, or (2) the adaptive step size is O 1/(∆L 2 1 ) when optimizing a function with gradient magnitude equal to ǫ, which leads to slow convergence. The remainder of the paper is structured as follows. We discuss related work in Section 2, then give the formal problem statement in Section 3. We then present our complexity lower bounds for Decorrelated AdaGrad-Norm (Section 4), Decorrelated AdaGrad and the original AdaGrad (Section 5), and adaptive SGD (Section 6). We conclude with Section 7. RELATED WORK Relaxed Smoothness. Relaxed smoothness was introduced by Zhang et al. (2020b), who showed that GD with normalization converges faster than GD under this condition. This inspired a line of work focusing on efficient algorithms under this condition. Zhang et al. (2020a) showed an improved analysis of gradient clipping, and Jin et al. (2021) considered a non-convex distributionally robust optimization satisfying this condition. Several recent works (Liu et al., 2022; Crawshaw et al., 2023a;b) designed communication-efficient federated learning algorithms under relaxed smoothness. Li et al. (2024) analyzed gradient-based methods without gradient clipping under generalized smoothness. Crawshaw et al. (2022) studied a coordinate-wise version of relaxed smoothness, empirically showed that transformers satisfy this condition, and designed a generalized signSGD algorithm with convergence guarantees. Chen et al. (2023) proposed a new notion of α-symmetric generalized smoothness and analyzed a class of normalized GD algorithms. More recently, a few works hav