ACL2024
Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines
Michael Toker, Hadas Orgad, Mor Ventura, Dana Arad, Yonatan Belinkov
Abstract
Text-to-image diffusion models (T2I) use a latent representation of a text prompt to guide the image generation process. However, the encoder that produces the text representation is largely unexplored. We propose the DIFFU-SION LENS, a method for analyzing the text encoder of T2I models by generating images from its intermediate representations. Using the DIFFUSION LENS, we perform an extensive analysis of two recent T2I models. We find that the text encoder gradually builds prompt representations across multiple scenarios. Complex scenes describing multiple objects are composed progressively and more slowly than simple scenes; earlier layers encode the concepts in the prompts without a clear interaction, which emerges only in later layers. Moreover, the retrieval of uncommon concepts requires further computation until a faithful representation of the prompt is achieved. Concepts are built from coarse to fine, with details being added until the very late layers. Overall, our findings provide valuable insights into the text encoder component in T2I pipelines. 1 the computation process by which the text encoder 041 builds the prompt representation?". To this end, 042 we propose the DIFFUSION LENS, a method for 043 analyzing the representations at intermediate layers 044 of the text encoder. 045 Current T2I architectures use a pre-trained trans-046 former (Vaswani et al., 2017) as their text encoder. 047 Usually, to generate images, the input prompt is 048 passed through the text encoder and the representa-049 tion after the final layer is used to condition the dif-050 fusion process. The DIFFUSION LENS conditions 051 the diffusion process on intermediate representa-052 tions of the prompt, leading to visually-coherent, 053 human-understandable images for most layers (Fig-054 ure 1). Notably, the DIFFUSION LENS relies solely 055 on the pre-trained weights of the model and does 056 not depend on any specific task or external modules. 057 Comparing images generated from different layers, 058 we reveal patterns that emerge during the computa-059 111 • Through rigorous experiments, we uncover 112 how complexity, commonality, and syntactic 113 structure influence the computation process of 114 text encoders. 115 Ultimately, we shed light on text encoder dynam-116 ics, and hope this method aids the community in 117 building and evaluating T2I models. 118 2 Diffusion Lens 119 Preliminiaries. Current text-to-images diffusion 120 models comprise two main components (Saharia 121 et al., 2022; Ramesh et al., 2022): a language model 122 used as a text encoder that takes the textual prompt 123 as input and produces latent representations; and 124 a diffusion model that is conditioned on the repre-125 sentations from the text encoder and generates an 126 image from an initial input noise. 127 The language model in the T2I pipeline is typ-128 ically a transformer model. Transformer models 129 consist of a chain of transformer blocks, each com-130 posed of three sub-blocks: attention, multi-layer 131 perceptron, and layer norm (Vaswani et al., 2017). 132 We denote the transformer block at layer l as F l . 133 The input to the model is a sequence of T word 134 embeddings, denoted as h 0 = [h 0 1 , . . . , h 0 T ]. Then, 135 the output of the transformer block at layer l is a 136 sequence of hidden states h l+1 : 137 h l+1 = F l (h l ) (1) 138 The output representations of the last block, L, 139 go through a final layer norm, denoted as ln f . 140 Then, they condition the image generation process 141 through cross-attention layers, resulting in an im-142 age I. We abstract this process as: 143 senting the intermediate state of the text-encoder 160 as interpreted by the diffusion model. 161 3 Experimental Setup 162 Models. The experiments are performed on Sta-163 ble Diffusion 2.1 (denoted SD, Rombach et al., 164 2022) and Deep Floyd (denoted DF, StabilityAI, 165 2023). SD is an open-source implementation 166 of latent diffusion (Rombach et al., 2022), with 167 OpenCLIP-ViT/H (Ilharco et al., 2021) as the text-168 encoder. DF is another open-source implementa-169 tion of latent diffusion inspired by Saharia et al. 170 (2022), with a frozen T5-XXL (Raffel et al., 2020) 171 as the text encoder. We usually only report the re-172 sults on DF, unless there is a difference between 173 the models, which we then discuss. The full results 174 on SD are given in Appendix E. 175 Data. Depending on the specific experiment, we 176 either curate prompt templates and automatically 177 generate a list of prompts from a collected list of 178 concepts we are interested in investigating, or use 179 a list of natural, handwritten prompts from COCO 180 (Lin et al., 2015). The data for each experiment is 181 detailed in the next sections. With each prompt, we 182 generate images that are conditioned on representa-183 tions from every fourth layer in the model, which 184 serves as a representative subset. This results in 7 185 images for DF (which ha