ICLR2026
Pyramid Patchification Flow for Visual Generation
Hui Li, Baoyou Chen, Li jiaye, Jingdong Wang, Siyu Zhu
1 citation
Abstract
Diffusion Transformers (DiTs) typically use the same patch size for across timesteps, enforcing a constant token budget across timesteps. In this paper, we introduce Pyramidal Patchification Flow (PPFlow), which reduces the number of tokens for high-noise timesteps to improve the sampling efficiency. The idea is simple: use larger patches at higher-noise timesteps and smaller patches at lower-noise timesteps. The implementation is easy: share the DiT's transformer blocks across timesteps, and learn separate linear projections for different patch sizes in and . Unlike Pyramidal Flow that operates on pyramid representations,, our approach operates over full latent representations, eliminating trajectory ``jump points'', and thus avoiding re-noising tricks for sampling. Training from pretrained SiT-XL/2 requires only additional training FLOPs and delivers denoising speedups with image generation quality kept; training from scratch achieves comparable sampling speedup, e.g., speedup in SiT-B. Training from text-to-image model FLUX.1, PPFlow can achieve speedup from 512 to 2048 resolution with comparable quality.