ICLR2026

Pyramid Patchification Flow for Visual Generation

Hui Li, Baoyou Chen, Li jiaye, Jingdong Wang, Siyu Zhu

被引用 1 次

摘要

Diffusion Transformers (DiTs) typically use the same patch size for Patchify\operatorname{Patchify} across timesteps, enforcing a constant token budget across timesteps. In this paper, we introduce Pyramidal Patchification Flow (PPFlow), which reduces the number of tokens for high-noise timesteps to improve the sampling efficiency. The idea is simple: use larger patches at higher-noise timesteps and smaller patches at lower-noise timesteps. The implementation is easy: share the DiT's transformer blocks across timesteps, and learn separate linear projections for different patch sizes in Patchify\operatorname{Patchify} and Unpatchify\operatorname{Unpatchify}. Unlike Pyramidal Flow that operates on pyramid representations,, our approach operates over full latent representations, eliminating trajectory ``jump points'', and thus avoiding re-noising tricks for sampling. Training from pretrained SiT-XL/2 requires only +8.9%+8.9\% additional training FLOPs and delivers 2.02×2.02\times denoising speedups with image generation quality kept; training from scratch achieves comparable sampling speedup, e.g., 2.04×2.04\times speedup in SiT-B. Training from text-to-image model FLUX.1, PPFlow can achieve 1.611.86×1.61 - 1.86 \times speedup from 512 to 2048 resolution with comparable quality.