ICLR2026
LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer
Song Fei, Tian Ye, Lujia Wang, Lei Zhu
Abstract
Image restoration (IR) aims to recover images degraded by unknown mixtures while preserving semantics-conditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free IR framework that adapts a large diffusion transformer (Flux.1) without image captions. Our LucidFlux introduces a lightweight dualbranch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep-and layer-adaptive modulation schedule is designed to route these cues across the backbone's hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or Vision-Language Model (VLM) captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, our LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition on-rather than adding parameters or relying on text prompts-is the governing lever for robust and caption-free image restoration in the wild.