ICLR2026

Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

Zijian Liu

被引用 1 次

摘要

Optimization under heavy-tailed noise has become popular recently, since it better fits many modern machine learning tasks, as captured by empirical observations. Concretely, instead of a finite second moment on gradient noise, a bounded p\mathfrak{p}-th moment where p(1,2]\mathfrak{p}\in(1,2] has been recognized to be more realistic (say being upper bounded by σlp\sigma_{\mathfrak{l}}^{\mathfrak{p}} for some σl0\sigma_{\mathfrak{l}}\geq0). A simple yet effective operation, gradient clipping, is known to handle this new challenge successfully. Specifically, Clipped Stochastic Gradient Descent (Clipped SGD) guarantees a high-probability rate O(σlln(1/δ)T1p1)\mathcal{O}(\sigma_{\mathfrak{l}}\ln(1/\delta)T^{\frac{1}{\mathfrak{p}}-1}) (resp. O(σl2ln2(1/δ)T2p2)\mathcal{O}(\sigma_{\mathfrak{l}}^{2}\ln^{2}(1/\delta)T^{\frac{2}{\mathfrak{p}}-2})) for nonsmooth convex (resp. strongly convex) problems, where δ(0,1]\delta\in(0,1] is the failure probability and TNT\in\mathbb{N} is the time horizon. In this work, we provide a refined analysis for Clipped SGD and offer two rates, O(σldeff12pln11p(1/δ)T1p1)\mathcal{O}(\sigma_{\mathfrak{l}}d_{\mathrm{eff}}^{-\frac{1}{2\mathfrak{p}}}\ln^{1-\frac{1}{\mathfrak{p}}}(1/\delta)T^{\frac{1}{\mathfrak{p}}-1}) and O(σl2deff1pln22p(1/δ)T2p2)\mathcal{O}(\sigma_{\mathfrak{l}}^{2}d_{\mathrm{eff}}^{-\frac{1}{\mathfrak{p}}}\ln^{2-\frac{2}{\mathfrak{p}}}(1/\delta)T^{\frac{2}{\mathfrak{p}}-2}), faster than the aforementioned best results, where deff1d_{\mathrm{eff}}\geq1 is a quantity we call the generalized effective dimension. Our analysis improves upon the existing approach in two respects: better utilization of Freedman's inequality and finer bounds for clipping error under heavy-tailed noise. In addition, we extend the refined analysis to convergence in expectation and obtain new rates that break the known lower bounds. Lastly, to complement the study, we establish new lower bounds for both high-probability and in-expectation convergence. Notably, the in-expectation lower bounds match our new upper bounds, indicating the optimality of our refined analysis for convergence in expectation.