NeurIPS2023

T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, Xihui Liu

293 citations

Abstract

A red book and a yellow vase An oval coffee table and a rectangular rug A metallic spoon and a glass vase The sharp blue scissors cut through the thick white paper A woman is holding a yoga mat A book on the left of a bird Attribute-color Attribute-shape Attribute-texture Object relationships 2D-spatial relationship Non-spatial relationship Complex compositions Attribute binding Numeracy Four swans and two suitcases A cat in front of a chair 3D-spatial relationship Fig. 1: Failure cases of Stable Diffusion v2 [1]. Our compositional text-to-image generation benchmark consists of three categories: attribute binding (including color, shape, and texture), generative numeracy, object relationships (including 2D/3D-spatial relationship and non-spatial relationship), and complex compositions. Abstract-Despite the impressive advances in text-to-image models, they often struggle to effectively compose complex scenes with multiple objects, displaying various attributes and relationships. To address this challenge, we present T2I-CompBench++, an enhanced benchmark for compositional text-to-image generation. T2I-CompBench++ comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions. These are further divided into eight sub-categories, including newly introduced ones like 3D-spatial relationships and numeracy. In addition to the benchmark, we propose enhanced evaluation metrics designed to assess these diverse compositional challenges. These include a detection-based metric tailored for evaluating 3D-spatial relationships and numeracy,