ICLR2026

Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction

Alexander Tyurin

Abstract

We consider centralized distributed optimization in the classical federated learning setup, where nn workers jointly find an ε\varepsilon-stationary point of an LL-smooth, dd-dimensional nonconvex function ff, having access only to unbiased stochastic gradients with variance σ2\sigma^2. Each worker requires at most hh seconds to compute a stochastic gradient, and the communication times from the server to the workers and from the workers to the server are τs\tau_{\textnormal{s}} and τw\tau_{\textnormal{w}} seconds per coordinate, respectively. One of the main motivations for distributed optimization is to achieve scalability with respect to nn. For instance, it is well known that the distributed version of SGD has a variance-dependent runtime term hσ2LΔnε2,\frac{h \sigma^2 L \Delta}{n \varepsilon^2}, which improves with the number of workers n,n, where Δ:=f(x0)f,\Delta := f(x^0) - f^*, and x0Rdx^0 \in \mathbb{R}^d is the starting point. Similarly, using unbiased sparsification compressors, it is possible to reduce both the variance-dependent runtime term and the communication runtime term from τwdLΔε\tau_{\textnormal{w}} d \frac{L \Delta}{\varepsilon} to τwdLΔnε+τwdhσ2nεLΔε,\frac{\tau_{\textnormal{w}} d L \Delta}{n \varepsilon} + \sqrt{\frac{\tau_{\textnormal{w}} d h \sigma^2}{n \varepsilon}} \cdot \frac{L \Delta}{\varepsilon}, which also benefits from increasing n.n. However, once we account for the communication from the server to the workers τs\tau_{\textnormal{s}}, we prove that it becomes infeasible to design a method using unbiased random sparsification compressors that scales both the server-side communication runtime term τsdLΔε\tau_{\textnormal{s}} d \frac{L \Delta}{\varepsilon} and the variance-dependent runtime term hσ2LΔε2,\frac{h \sigma^2 L \Delta}{\varepsilon^2}, better than poly-logarithmically in nn, even in the homogeneous (i.i.d.) case, where all workers access the same function or distribution. Indeed, when τsτw,\tau_{\textnormal{s}} \simeq \tau_{\textnormal{w}}, our lower bound is Ω~(min[h(σ2nε+1)LΔε+τsdLΔε,  hLΔε+hσ2LΔε2]).\tilde{\Omega}(\min[h (\frac{\sigma^2}{n \varepsilon} + 1) \frac{L \Delta}{\varepsilon} + {\tau_{\textnormal{s}} d \frac{L \Delta}{\varepsilon}},\; h \frac{L \Delta}{\varepsilon} + {h \frac{\sigma^2 L \Delta}{\varepsilon^2}}]). To establish this result, we construct a new ``worst-case'' function and develop a new lower bound framework that reduces the analysis to the concentration of a random sum, for which we prove a concentration bound. These results reveal fundamental limitations in scaling distributed optimization, even under the homogeneous (i.i.d.) assumption.