ICLR2026

DSA: Efficient Inference For Video Generation Models via Distributed Sparse Attention

Shenggui Li, Runyu Lu, qiaoling chen, Haiyan Yin, Yueming Lyu, Yonggang Wen, Ivor Tsang, Tianwei Zhang

摘要

Diffusion Transformer models have driven the rapid advances in video generation, achieving state-of-the-art quality and flexibility. However, their attention mechanism remains a major performance bottleneck, as its dense computation scales quadratically with the sequence length. To overcome this limitation and reduce the generation latency, we propose DSA, a novel attention mechanism that integrates sparse attention with distributed inference for diffusion-based video generation. By leveraging carefully-designed parallelism strategies and scheduling, DSA significantly reduces redundant computation while preserving global context. Extensive experiments on benchmark datasets demonstrate that, when deployed on 8 GPUs, DSA achieves up to 1.43× inference speedup than the existing distributed method and 10.79× faster than single-GPU inference.