ICML2024
SILVER: Single-loop variance reduction and application to federated learning
Kazusato Oko, Shunta Akiyama, Denny Wu, Tomoya Murata, Taiji Suzuki
2 citations
Abstract
Most variance reduction methods require multiple times of full gradient computation, which is timeconsuming and hence a bottleneck in application to distributed optimization. We present a singleloop variance-reduced gradient estimator named SILVER (SIngle-Loop VariancE-Reduction) for the finite-sum non-convex optimization, which does not require multiple full gradients but nevertheless achieves the optimal gradient complexity. Notably, unlike existing methods, SILVER provably reaches second-order optimality, with exponential convergence in the Polyak-Łojasiewicz (PL) region, and achieves further speedup depending on the data heterogeneity. Owing to these advantages, SILVER serves as a new base method to design communication-efficient federated learning algorithms: we combine SILVER with local updates, which gives the best communication rounds and number of communicated gradients across all range of Hessian heterogeneity, and, at the same time, guarantees second-order optimality and exponential convergence in the PL region. Single-loop Variance Reduction for Federated Learning Table 1. Comparison of communication rounds and complexity for (2). 2 Algorithms Communication rounds Client sampling FedAvg (NC) (Karimireddy et al., 2020) σ 2 c pε 4 + σc ε 3 + 1 ε 2 ✓ SCAFFOLD (NC) (Karimireddy et al., 2020) σ 2 pKε 4 + 1 ε 2 ( P p ) 2 3 ✓ SCAFFOLD (NC, quad) (Karimireddy et al., 2020) σ 2 P Kε 4 + 1 Kε 2 + ζ ε 2 × MimeMVR (NC) (Karimireddy et al., 2021) (Murata and Suzuki, 2021) 1 (Tyurin and Richtárik, 2022a) MimeSGD (PL) (Karimireddy et al., 2021 ) finding ε-first-order stationary points; SOSP: finding (ε, δ)-second-order stationary points; PL: finding ε-solutions under Polyak-Łojasiewicz (PL) condition with µ; Quad: only valid for quadratics. P : total number of clients, p: client sample size (if allowed), K: local update steps between communications, ζ, ζ ′ : Hessian heterogeneity (ζ ≤ ζ ′ ), σc: variance of ∇fi(x) -∇f (x), σ: variance of ∇fi,j(x) -∇fi(x), ω: compression rate of gradients.