AAAI2026

Minute-Long Videos with Dual Parallelisms

Zeqing Wang, Bowen Zheng, Xingyi Yang, Zhenxiong Tan, Yuecong Xu, Xinchao Wang

摘要

Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54× lower latency and 1.48× lower memory cost on 8×RTX 4090 GPUs. * Corresponding Author Preprint. Under review. hidden [11, 17] or input [25, 12] sequences using a full model replica on each device. However, they incurs high memory overhead due to the entire model on every device [25, 12] . In contrast, pipeline parallelism [6] mitigates memory usage by partitioning the model across devices as a device pipeline [9, 16, 24] . Therefore, an ideal solution would combine the sequence parallelism with pipeline parallelism to maximize speed and minimize memory usage. However, naively combining sequence and pipeline parallelism is fundamentally conflicting. The core issue stems from the inherent synchronization property of video diffusion models: all input tokens must pass through an entire layer together before any can move on. In pipeline parallelism, this means the full input must finish processing on one device (e.g., Device 1) before passing to the next (e.g., Device 2). This requirement directly contradicts sequence parallelism, which splits the input across devices. As a result, all distributed parts must be gathered back onto a single device for serialized processing on specific model layers. Only then can all parts enter the next pipeline stage, i.e. next device. This repeated gathering serializes computation and negates the benefits of sequence parallelism, reintroducing a serial bottleneck and significant communication overhead. To address this conflict, we propose a novel distributed inference strategy, termed DualParal. At a high level, DualParal divides both the video sequence and model into chunks and applies parallel processing across both.