ICLR2026
Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction
Alexander Tyurin
Abstract
We consider centralized distributed optimization in the classical federated learning setup, where workers jointly find an -stationary point of an -smooth, -dimensional nonconvex function , having access only to unbiased stochastic gradients with variance . Each worker requires at most seconds to compute a stochastic gradient, and the communication times from the server to the workers and from the workers to the server are and seconds per coordinate, respectively. One of the main motivations for distributed optimization is to achieve scalability with respect to . For instance, it is well known that the distributed version of SGD has a variance-dependent runtime term which improves with the number of workers where and is the starting point. Similarly, using unbiased sparsification compressors, it is possible to reduce both the variance-dependent runtime term and the communication runtime term from to which also benefits from increasing However, once we account for the communication from the server to the workers , we prove that it becomes infeasible to design a method using unbiased random sparsification compressors that scales both the server-side communication runtime term and the variance-dependent runtime term better than poly-logarithmically in , even in the homogeneous (i.i.d.) case, where all workers access the same function or distribution. Indeed, when our lower bound is To establish this result, we construct a new ``worst-case'' function and develop a new lower bound framework that reduces the analysis to the concentration of a random sum, for which we prove a concentration bound. These results reveal fundamental limitations in scaling distributed optimization, even under the homogeneous (i.i.d.) assumption.