ICLR2025

FreCaS: Efficient Higher-Resolution Image Generation via Frequency-aware Cascaded Sampling

Zhengqiang Zhang, Ruihuang Li, Lei Zhang

Abstract

While image generation with diffusion models has achieved a great success, generating images of higher resolution than the training size remains a challenging task due to the high computational cost. Current methods typically perform the entire sampling process at full resolution and process all frequency components simultaneously, contradicting with the inherent coarse-to-fine nature of latent diffusion models and wasting computations on processing premature high-frequency details at early diffusion stages. To address this issue, we introduce an efficient Frequency-aware Cascaded Sampling framework, FreCaS in short, for higherresolution image generation. FreCaS decomposes the sampling process into cascaded stages with gradually increased resolutions, progressively expanding frequency bands and refining the corresponding details. We propose an innovative frequency-aware classifier-free guidance (FA-CFG) strategy to assign different guidance strengths for different frequency components, directing the diffusion model to add new details in the expanded frequency domain of each stage. Additionally, we fuse the cross-attention maps of previous and current stages to avoid synthesizing unfaithful layouts. Experiments demonstrate that FreCaS significantly outperforms state-of-the-art methods in image quality and generation speed. In particular, FreCaS is about 2.86× and 6.07× faster than ScaleCrafter and DemoFusion in generating a 2048×2048 image using a pre-trained SDXL model and achieves an FID b improvement of 11.6 and 3.7, respectively. FreCaS can be easily extended to more complex models such as SD3. The source code of FreCaS can be found at https://github.com/xtudbxk/FreCaS . INTRODUCATION In recent years, diffusion models, such as Imagen (Saharia et al., 2022) , SDXL (Podell et al., 2023) , PixelArt-α (Chen et al., 2023) and SD3 Esser et al. (2024), have achieved a remarkable success in generating high-quality natural images. However, these models face challenges in generating very high resolution images due to the increased complexity in high-dimensional space. Though efficient diffusion models, including ADM (Dhariwal & Nichol, 2021 ), CascadedDM (Ho et al., 2022) and LDM (Rombach et al., 2022) , have been developed, the computational burden of training diffusion models from scratch for high-resolution image generation remains substantial. As a result, popular diffusion models, such as SDXL (Podell et al., 2023) and SD3 (Esser et al., 2024), primarily focus on generating 1024 × 1024 resolution images. It is thus increasingly attractive to explore trainingfree strategies for generating images at higher resolutions, such as 2048 × 2048 and 4096 × 4096, using pre-trained diffusion models. MultiDiffusion (Bar-Tal et al., 2023) is among the first works to synthesize higher-resolution images using pre-trained diffusion models. However, it suffers from issues such as object duplication, which largely reduces the image quality. To address these issues, Jin et al. (2024) proposed to manually adjust the scale of entropy in the attention operations. He et al. (2023) and Huang et al. (2024)