CVPR2025

Visual Lexicon: Rich Image Features in Language Space

Xudong Wang, Xingyi Zhou, Alireza Fathi, Trevor Darrell, Cordelia Schmid

摘要

Figure 1 . Given the cute corgi painting in the top left corner, how can we extract a visual representation that captures semantic-level information -such as object categories and layouts -while preserving rich visual details like image styles, textures and colors? We introduce ViLex model that generates image representations in the text vocabulary space, acting as a new visual "language", while retaining intricate visual details that are difficult, if not impossible, to convey in natural language. The set of images (generated under different diffusion noises) in the 2×2 grid, which are highly semantically and visually similar to each other, is created by using ViLex as "text" prompts for text-to-image diffusion models.