ICLR2025

Warm Diffusion: Recipe for Blur-Noise Mixture Diffusion Models

Hao-Chien Hsueh, Wen-Hsiao Peng, Ching-Chun Huang

摘要

Diffusion probabilistic models have achieved remarkable success in generative tasks across diverse data types. While recent studies have explored alternative degradation processes beyond Gaussian noise, this paper bridges two key diffusion paradigms: hot diffusion, which relies entirely on noise, and cold diffusion, which uses only blurring without noise. We argue that hot diffusion fails to exploit the strong correlation between high-frequency image detail and lowfrequency structures, leading to random behaviors in the early steps of generation. Conversely, while cold diffusion leverages image correlations for prediction, it neglects the role of noise (randomness) in shaping the data manifold, resulting in out-of-manifold issues and partially explaining its performance drop. To integrate both strengths, we propose Warm Diffusion, a unified Blur-Noise Mixture Diffusion Model (BNMD), to control blurring and noise jointly. Our divide-andconquer strategy exploits the spectral dependency in images, simplifying score model estimation by disentangling the denoising and deblurring processes. We further analyze the Blur-to-Noise Ratio (BNR) using spectral analysis to investigate the trade-off between model learning dynamics and changes in the data manifold. Extensive experiments across benchmarks validate the effectiveness of our approach for image generation. INTRODUCTION Diffusion probabilistic models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Nichol & Dhariwal, 2021; Song et al., 2022) have gained significant attention for their ability to learn data distributions through denoising, leading to impressive generation quality. These generative models typically employ a stochastic process that gradually transforms complex data distributions into simpler forms by adding a small amount of Gaussian noise in each forward iteration, eventually arriving at a simple Gaussian distribution. The reverse process involves using a neural network to model the score (Hyvärinen & Dayan, 2005) of a noise-level-dependent marginal distribution, iteratively adapting the denoised samples to recover the input data distribution. However, the process of learning this score estimator is domain-agnostic, focusing solely on recovering the underlying signal by removing Gaussian noise without considering the inherent properties of the modeled data. While this universal approach is effective for various data modalities, we argue that it leaves room for improvement in modeling images. Specifically, it overlooks the strong correlation between high-frequency image detail and low-frequency structures-a relationship we term spectral dependency. This correlation suggests that an efficient image generation process should progress from common low-frequency components to diverse high-frequency detail.