ICML2025

Ladder-Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference with Communication Overlapping

Muru Zhang, Mayank Mishra, Zhongzhu Zhou, William Brandon, Jue Wang, Yoon Kim, Jonathan Ragan-Kelley, Shuaiwen Leon Song, Ben Athiwaratkun, Tri Dao

出版方

摘要

Large language model inference is both memoryintensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism requires communication of information between GPUs, which limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residualbased models that enables straightforward overlapping to hide the latency of communication. Our insight is that in addition to system optimizations, the model architecture can also be redesigned to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speedup at inference time with sharding over 8 devices. We train a 1.2B and 3.5B Ladder Residual based Transformer models from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens.