ICLR2025
How Much is a Noisy Image Worth? Data Scaling Laws for Ambient Diffusion
Giannis Daras, Yeshwanth Cherapanamjeri, Constantinos Daskalakis
摘要
The quality of generative models depends on the quality of the data they are trained on. Creating large-scale, high-quality datasets is often expensive and sometimes impossible, e.g. in certain scientific applications where there is no access to clean data due to physical or instrumentation constraints. Ambient Diffusion and related frameworks train diffusion models with solely corrupted data (which are usually cheaper to acquire) but ambient models significantly underperform models trained on clean data. We study this phenomenon at scale by training more than 80 models on data with different corruption levels across three datasets ranging from 30, 000 to ≈ 1.3M samples. We show that it is impossible, at these sample sizes, to match the performance of models trained on clean data when only training on noisy data. Yet, a combination of a small set of clean data (e.g. 10% of the total dataset) and a large set of highly noisy data suffices to reach the performance of models trained solely on similar-size datasets of clean data, and in particular to achieve near state-of-the-art performance. We provide theoretical evidence for our findings by developing novel sample complexity bounds for learning from Gaussian Mixtures with heterogeneous variances. Our theoretical model suggests that, for large enough datasets, the effective marginal utility of a noisy sample is exponentially worse that of a clean sample. Providing a small set of clean samples can significantly reduce the sample size requirements for noisy data, as we also observe in our experiments. INTRODUCTION A key factor behind the remarkable success of modern generative models, from image Diffusion Models (DMs) to Large Language Models (LLMs), is the curation of large scale datasets (Gadre et al., 2023; Li et al., 2024). However, in certain applications, access to high-quality data is scarce, expensive, or impossible. For example, in Magnetic Resonance Imaging (MRI) the quality of the data is proportional to the time spent in the scanner (Jalal et al., 2021) and, in black-hole imaging, it is never possible to get full measurements from the object of interest (Lin et al., 2024) . Constructing a (copyright-free) large-scale dataset of high-quality general domain images is also an expensive and complex process. Enterprise text-to-image image DMs rely on proprietary datasets, often acquired from third-party vendors at significant cost (Betker et al., 2023; Imagen-Team-Google et al., 2024). State-of-the-art open-source DMs are typically trained by crawling a large pool of images from the Web and filtering them for quality with a pipeline that deems each sample as suitable or non-suitable for training (Gadre et al., 2023) . However, this binary treatment of samples is problematic because often low-quality images still contain useful information. For example, a blurry image might get dismissed from the filtering pipeline to avoid blurry generations at inference time, yet the image might still contain important information about the world, such as the type of objects present at the scene. Recently, there has been a growing interest in developing frameworks for training generative models using corrupted data, e.g. from blurry or noisy images (