ICML2021

Bilevel Optimization: Convergence Analysis and Enhanced Design

Kaiyi Ji, Junjie Yang, Yingbin Liang

343 citations

Abstract

Bilevel optimization has arisen as a powerful tool for many machine learning problems such as meta-learning, hyperparameter optimization, and reinforcement learning. In this paper, we investigate the nonconvex-strongly-convex bilevel optimization problem. For deterministic bilevel optimization, we provide a comprehensive convergence rate analysis for two popular algorithms respectively based on approximate implicit differentiation (AID) and iterative differentiation (ITD). For the AID-based method, we orderwisely improve the previous convergence rate analysis due to a more practical parameter selection as well as a warm start strategy, and for the ITD-based method we establish the first theoretical convergence rate. Our analysis also provides a quantitative comparison between ITD and AID based approaches. For stochastic bilevel optimization, we propose a novel algorithm named stocBiO, which features a sample-efficient hypergradient estimator using efficient Jacobian-and Hessianvector product computations. We provide the convergence rate guarantee for stocBiO, and show that stocBiO outperforms the best known computational complexities orderwisely with respect to the condition number κ and the target accuracy . We further validate our theoretical results and demonstrate the efficiency of bilevel optimization algorithms by the experiments on meta-learning and hyperparameter optimization. Algorithm Gc(f, ) Gc(g, ) JV(g, ) HV(g, ) Gc(f, ) and Gc(g, ): number of gradient evaluations w.r.t. f and g. κ : condition number. JV(g, ): number of Jacobian-vector products ∇ x ∇ y g(x, y)v. Notation O: omit log 1 terms. HV(g, ): number of Hessian-vector products ∇ 2 y g(x, y)v. Table 2. Comparison of bilevel stochastic optimization algorithms. Algorithm Gc(F, ) Gc(G, ) JV(G, ) HV(G, ) TTSA (Hong et al., 2020) O(poly(κ) -5 2 use poly(κ) because Hong et al. 2020 does not provide the explicit dependence on κ. bilevel optimizers for the deterministic setting, and proposes a novel algorithm for the stochastic setting with order-level lower computational complexity than the existing results.