KDD2025

SwitchTop-k: Scaling Top-k Compression on Programmable Switches

Yijun Li, Jiawei Huang, Jingling Liu, Zhaoyi Li, Wanchun Jiang, Jianxin Wang

摘要

Distributed deep learning has been widely deployed in data centers to provide various services such as image classification and speech recognition. To reduce the training time, Top-k compression has become one of the most popular solutions used to shrink the data volume of gradients. Nevertheless, we observe that existing Top-k compression solutions are inefficient when used for large-scale distributed training due to gradient build-up, missing of Top-k gradients, and high compression overhead at the end hosts. To address these problems, we propose SwitchTop-k, which improves the accuracy of selecting Top-k values while ensuring a high compression rate and zero compression overhead. Specifically, SwitchTop-k offloads the Top-k compression from the end hosts to the programmable switches, thus alleviating the gradient build-up and compression overhead. Meanwhile, we propose a sketch-based solution to achieve high accuracy in selecting global Top-k gradients. We also co-design switch logic and end host logic to improve communication efficiency of uncompressed traffic. Finally, we implement SwitchTop-k on Intel Tofino switches and integrate it with Pytorch. The test results show that SwitchTop-k reduces iteration time by up to 91% compared with existing compression algorithms.