ICML2024
Tuning-Free Stochastic Optimization
Ahmed Khaled, Chi Jin
13 citations
Abstract
Large-scale machine learning problems make the cost of hyperparameter tuning ever more prohibitive. This creates a need for algorithms that can tune themselves on-the-fly. We formalize the notion of "tuning-free" algorithms that can match the performance of optimally-tuned optimization algorithms up to polylogarithmic factors given only loose hints on the relevant problem parameters. We consider in particular algorithms that can match optimally-tuned Stochastic Gradient Descent (SGD). When the domain of optimization is bounded, we show tuning-free matching of SGD is possible and achieved by several existing algorithms. We prove that for the task of minimizing a convex and smooth or Lipschitz function over an unbounded domain, tuning-free optimization is impossible. We discuss conditions under which tuning-free optimization is possible even over unbounded domains. In particular, we show that the recently proposed DoG and DoWG algorithms are tuning-free when the noise distribution is sufficiently well-behaved. For the task of finding a stationary point of a smooth and potentially nonconvex function, we give a variant of SGD that matches the best-known high-probability convergence rate for tuned SGD at only an additional polylogarithmic cost. However, we also give an impossibility result that shows no algorithm can hope to match the optimal expected convergence rate for tuned SGD with high probability. on x 0 -x * (as in optimally tuned SGD) must suffer regret from potentially exponential regret. If we do not insist on a linear dependence on x 0 -x * , then the best achievable convergence bound scales x 0 -x * 3 , and this is tight (Mhammedi and Koolen, 2020) . None of the aforementioned lower bounds apply to the setting of stochastic optimization, since in general online learning assumes an adversarial oracle, which is stronger than a stochastic oracle. Tuning-free algorithms in the deterministic setting. Gradient descent augmented with line search (Nesterov, 2014; Beck, 2017) is tuning-free for smooth convex and nonconvex optimization. Bisection search (Carmon and Hinder, 2022) is tuning-free for both convex and smooth as well as convex and Lipschitz optimization, as is a restarted version of gradient descent with Polyak stepsizes (Hazan and Kakade, 2019). In the smooth setting, the adaptive descent method of (Malitsky and Mishchenko, 2020) is also tuning-free. There are also accelerated methods (Lan et al., 2023) , methods for the Lipschitz setting (Defazio and Mishchenko, 2023), methods based on online learning (Orabona, 2023), and others. Algorithms for the stochastic setting. Observe that because online learning is a more general setting than the stochastic one, we can apply algorithms from online convex optimization here, like e.g. (Mhammedi and Koolen, 2020) coupled with an appropriate online-to-batch conversion (Hazan, 2022) . In more recent work (Carmon and Hinder, 2022; Ivgi et al., 2023) , we see algorithmic developments specific to the stochastic setting. We discuss the convergence rates these algorithms achieve in more detail in Section 4.1. Other hyperparameter tuning approaches. In practice, hyperparameters are often found by grid search, random search, or methods based on Bayesian optimization (Bischl et al., 2023) ; None of these approaches come with efficient theoretical guarantees. Another approach is "meta-optimization" where we have a sequence of optimization problems and seek to minimize the cumulative error over this sequence. Often, another optimization algorithm is then used to select the learning rates, e.g. hypergradient descent (Baydin et al., 2018). Meta-optimization approaches are quite difficult to establish theoretical guarantees for, and only recently have some theoretical results been shown (Chen and Hazan, 2023). Our setting in this paper is different, since rather than seek to minimize regret over a sequence of optimization problems, we have a single function and an oracle that gives us (stochastic) gradient estimates for this function. Concurrent work. In concurrent work, Carmon and Hinder ( 2024 ) and Attia and Koren (2024) also study lower bounds for first-order stochastic optimization. In both papers, like in our work, the algorithm is provided with a certain range that the problem parameters fall in (what we term as hints) and must make use of only that to minimize the function with stochastic gradient evaluations. Carmon and Hinder (2024) study what is the minimum possible multiplicative factor slowdown any algorithm must suffer compared to optimally-tuned baselines when provided access only to hints, which they term the price of adaptivity. They provide lower bounds for stochastic convex optimization for Lipschitz functions in expectation and with high probability, and also consider the case where some of the problem parameters have no uncertainty (e.g. when we know the Lipschitz constant but not the initial distance to the optimum). Our lower bound in this setting (