ICLR2025
Denoising Autoregressive Transformers for Scalable Text-to-Image Generation
Jiatao Gu, Yuyang Wang, Yizhe Zhang, Qihang Zhang, Dinghuai Zhang, Navdeep Jaitly, Joshua M. Susskind, Shuangfei Zhai
Abstract
Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process which gradually adds noise to the input. We argue that the Markovian property limits the model's ability to fully utilize the generation trajectory, leading to inefficiencies during training and inference. In this paper, we propose DART, a transformer-based model that unifies autoregressive (AR) and diffusion within a non-Markovian framework. DART iteratively denoises image patches spatially and spectrally using an AR model that has the same architecture as standard language models. DART does not rely on image quantization, which enables more effective image modeling while maintaining flexibility. Furthermore, DART seamlessly trains with both text and image data in a unified model. Our approach demonstrates competitive performance on class-conditioned and text-to-image generation tasks, offering a scalable, efficient alternative to traditional diffusion models. Through this unified framework, DART sets a new benchmark for scalable, high-quality image synthesis. * Work done as part of an internship at Apple. train on high resolution images directly, requiring either cascaded models (Ho et al., 2022) , or multiscale approaches (Gu et al., 2023) or preprocessing of images to autoencoder codes at lower resolutions (Rombach et al., 2022) . These limitations can stem from their reliance on the Markovian assumption, which simplifies the generative process but restricts the model only to see the generation from the previous step. This often leads to inefficiencies during training and inference, as the individual steps of denoising are unaware of the trajectory of generations from prior steps. In parallel, autoregressive models, such as GPT-4 (Achiam et al., 2023), have shown great success in modeling long-range dependencies in sequential data, particularly in the field of natural language processing. These models efficiently cache computations and manage dependencies across time steps, which has also inspired research into adapting autoregressive models for image generation. However, early efforts such as PixelCNN (Oord et al., 2016), while promising, suffered from high computational costs due to pixel-wise generation. More recent models like VQ-GAN (Esser et al., 2021a) and related work (Yu et al., 2022; Team, 2024; Tian et al., 2024) learn models of quantized images in a compressed latent space; Li et al. (2024) propose to generate directly in such space without quantization by employing a diffusion-based loss function. However, these methods fail to fully leverage the progressive denoising benefits of diffusion models, resulting in limited global context and error propagation during generation. To address these limitations, we propose Denoising AutoRegressive Transformer (DART), a novel generative model that integrates autoregressive modeling within a non-Markovian diffusion framework (Song et al., 2021) (Fig. 2 ). The non-Markovian formulation in DART enables the model to leverage the full generative trajectory during training and inference, while retaining the progressive modeling benefits of diffusion models, resulting in more efficient and flexible generation compared to traditional diffusion and autoregressive approaches. Additionally, DART introduces two key improvements to address the limitations of the non-Markovian approach: (1) token-level autoregressive modeling (DART-AR), which captures dependencies between image tokens autoregressively, enabling finer control and improved generation quality, and (2) a flow-based refinement module (DART-FM), which enhances the model's expressiveness and smooths transitions between denoising steps. These extensions make DART a flexible and efficient framework capable of handling a wide range of tasks, including class conditional, text-to-image, as well as multimodal generation. DART offers a scalable, efficient alternative to traditional diffusion models, achieving competitive performance on standard benchmarks for class-conditioned (e.g., ImageNet (Deng et al., 2009)) and text-to-image generation. To summarize, major contributions of our work include: • We propose DART, a novel non-Markovian diffusion model that leverages the full denoising trajectory, leading to more efficient and flexible image generation compared to traditional approaches. • We propose two key improvements: DART-AR and DART-FM, which improve the expressiveness and coherence throughout the non-Markovian generation process. • DART achieves competitive performance in both class-conditioned and text-to-image generation tasks, offering a scalable and unified approach for high-quality, controllable image synthesis.