NeurIPS2025
Sparse Image Synthesis via Joint Latent and RoI Flow
Ziteng Gao, Jay Zhangjie Wu, Mike Zheng Shou
摘要
Natural images often exhibit underlying sparse structures, with information density varying significantly across different spatial locations. However, most generative models rely on dense grid-based pixels or latents, neglecting this inherent sparsity. In this paper, we explore modeling visual generation paradigm via sparse non-grid latent representations. Specifically, we design a sparse autoencoder that represents an image as a small number of latents with their positional properties (i.e., regions of interest, RoIs) with high reconstruction quality. We then explore training flow-matching transformers jointly on non-grid latents and RoI values. To the best knowledge, we are the first to address spatial sparsity using RoIs in generative process. Experimental results show that our sparse flow-based transformers have competitive performance compared with dense grid-based counterparts with significantly reduced lower compute, and reaches a competitive 2.76 FID with just 64 latents on class-conditional ImageNet 256 × 256 generation. * The corresponding author. 39th Conference on Neural Information Processing Systems (NeurIPS 2025). regular VAE input image grid latent representation generative modeling sparse VAE sparse non-grid representation latent latent + RoI generative modeling Figure 1: Left: conventional autoencoders encode pixels into latent grid representations. Right: our method encodes them into fewer non-grid latents with region of interests (RoIs). every timestep, our model learns to estimate both latent and RoI velocity from initial noise to target samples. Divergent from prior grid-based latent approaches, our method dynamically adjusts latent spatial positions during sampling via ordinary differential equations (ODEs) at inference, allowing adaptive refinement of both content and spatial focus. We show the feasibility of representing and generating images with sparse latents and RoIs on the challenge ImageNet benchmark. Our proposed sparse flow autoencoder, SF-VAE, can represent 256 × 256 images with just 64 latents with 0.70 reconstruction FID, or even down to 32 latents with 1.70 rFID. Then, the presented sparse flow-based transformers, SF-SiTs, have competitive performance on par with diffusion/flow-based grid-based transformers [11, 10]. The largest SF-SiT, XL variant, can reach 2.76 FID with classifier-free guidance [12] on the class-conditional ImageNet generation benchmark with just 64 latents. 2 Related Work Diffusion models. In recent years, generative models has been marked as a breakthrough in the field of visual synthesis [1, 13, 14]. Commercial systems like DALL-E [2] or FLUX [6] are typically rooted in denoising diffusion architectures. The seminal work, denoising diffusion probabilistic models [15], take the image generative process as a gradual denoising trajectory, iteratively refining pure noise into target images. Building on this foundation, subsequent advancements further accelerated and refined diffusion-based generation. Improved variants including [16, 17, 9] investigate the training and sampling trajectories, enabling high-quality results with fewer sampling steps. Latent diffusion models [1] democratize the high resolution image synthesis by operating in a compressed latent space with reduced computational costs. The following up work, including diffusion/flow-matching transformers [18, 11, 10] , also follows this convention to speed up training. Although training diffusion models directly on raw pixels is technically feasible [19, 20] , the preference for latent space modeling stems from practical challenges: raw pixel data often contains high-frequency details and perceptually complex patterns that are computationally intensive and difficult for diffusion processes to model effectively. Latent space for diffusion models. The compact latent space is crucial for diffusion models to achieve high-quality image synthesis. Latent diffusion models [1] propose to train an autoencoder to map raw pixels to a latent space first, where the latent space is typically 8× spatially downsampled and comes with 4 channels, reaching a compression rate of 48. The follow up work on autoencoders, including [5, 21, 22] , mainly investigate the channel number and shows that increasing channel number can improve the quality of diffusion samples via larger transformer models. Recently proposed deep compression autoencoders (DC-AE) [23] compress the latent space at more aggressive spatial downsampling rates, e.g., 32 or 64, further reducing the training cost of diffusion models. However, there is a lack of exploration and discussion on the structure of latents for diffusion models. Most autoencoders for diffusion models encode pixels into dense 2D grid-based latents and ignore the underlying non-uniform and sparse structures in natural images, where a background region in an image might be worth less latents than foregrounds. Here in this paper, we study this sparsity as well as visual non-unifo