ICLR2026

Dual-Path Condition Alignment for Diffusion Transformers

Changhao Peng, Yuqi Ye, Shuangjun Du, Wenxu Gao, Wei Gao

Abstract

Denoising-based generative models have been significantly advanced by representation-alignment (REPA) loss, which leverages pre-trained visual encoders to guide intermediate network features. However, REPA's reliance on external visual encoders introduces two critical challenges: potential distribution mismatches between the encoder's training data and the generation target, and the high computational costs of pre-training. Inspired by the observation that REPA primarily aids early layers in capturing robust semantics, we propose an unsupervised alternative that avoids external visual encoder and the assumption of consistent data distribution. We introduce DUal-Path condition Alignment (DUPA), a novel self-alignment framework, which independently noises an image multiple times and processes these noisy latents through decoupled diffusion transformer, then aligns the derived conditionslow-frequency semantic features extracted from each path. Experiments demonstrate that DUPA achieves FID $=$ 1.46 on ImageNet 256 $\times$ 256 with only 400 training epochs, outperforming all methods that do not rely on external supervision. DUPA is also model-agnostic and can be readily applied to any denoising-based generative model, showcasing its excellent scalability and generalizability. Code is available at https://github.com/PCH-gg/DUPA, https://openi.pcl.ac.cn/OpenAIDriving/DUPA.