ICML2025

DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers

Xuanlei Zhao, Shenggan Cheng, Chang Chen, Zangwei Zheng, Ziming Liu, Zheming Yang, Yang You

Abstract

Scaling multi-dimensional transformers to long sequences is important across various domains. The challenges of large memory requirements and slow speed of such sequences require sequence parallelism. All existing approaches fall under the category of embedded sequence parallelism, which are limited to shard along a single sequence dimension, thereby introducing significant communication overhead. However, multidimensional transformers involve independent calculation across multiple sequence dimensions. To this end, we propose Dynamic Sequence Parallelism (DSP) as a novel abstraction of sequence parallelism. DSP dynamically switches the parallel dimension according to the computation stage with efficient resharding strategy. DSP offers significant reductions in communication costs, adaptability across modules, and ease of use with minimal constraints. Experiments demonstrate DSP's superiority over state-of-the-art sequence parallelism methods by remarkable throughput improvements ranging from 32.2% to 10×, with at least 50% communication volume reduction.