ICLR2026

DPad: Efficient Diffusion Language Models with Suffix Dropout

Xinhua Chen, Sitao Huang, Cong Guo, Chiyue Wei, Yintao He, Jianyi Zhang, Hai Helen Li, Yiran Chen

31 citations

Abstract

Diffusion-based Large Language Models (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose Diffusion Scratchpad(DPad)\textbf{Diffusion Scratchpad} (\textbf{\textit{DPad}}), a training-free method that restricts attention to a structured subset of suffix tokens, preserving fidelity while eliminating redundancy. DPad\textit{DPad} integrates two strategies: (i) a sliding window\textit{sliding window}, which maintains a fixed-length suffix window, and (ii) distance-decay dropout\textit{distance-decay dropout}, which deterministically removes distant suffix tokens before attention computation. This concise design is compatible with existing optimizations such as parallel decoding and prefix caching, and lends itself to a lightweight implementation. Comprehensive evaluations across multiple benchmarks on LLaDA\texttt{LLaDA} and Dream\texttt{Dream} models demonstrate that DPad\textit{DPad} delivers up to 61.4×\mathbf{61.4\times} speedup over vanilla dLLMs while maintaining comparable accuracy, highlighting its potential for efficient and scalable long-sequence inference.