ACL2024

ICC : Quantifying Image Caption Concreteness for Multimodal Dataset Curation

Moran Yanuka, Morris Alper, Hadar Averbuch-Elor, Raja Giryes

摘要

Web-scale training on paired text-image data is becoming increasingly central in multimodal learning, but is challenged by the highly noisy nature of datasets in the wild. Standard data filtering approaches succeed in removing mismatched text-image pairs, but permit semantically related but highly abstract text. In this work, we propose a new metric, Image Caption Concreteness (ICC), that evaluates caption text without an image reference to measure its concreteness and relevancy for use in multimodal learning. Our approach leverages strong foundation models for measuring visual-semantic information loss in multimodal representations. We demonstrate that this strongly correlates with human evaluation of concreteness in both single-word and sentence-level texts. Moreover, we show that curation using ICC complements existing approaches and succeeds in distilling multimodal web-scale datasets for more effective learning. captions irrelevant to their images, or using rule-041 based proxies such as measuring the complexity 042 of captions via semantic parsing (Radenovic et al., 043 2023). However, these approaches fail to identify 044 captions that are highly abstract and may contain 045 subjective, non-visual information, despite being 046 semantically aligned with the image and having 047 a sufficiently complex grammar. Figure 1 shows 048 examples of such image-caption pairs. A caption 049 such as "It does not look like something I would 050 want to eat" is semantically related to the image, 051 but a model trained to predict this caption from its 052 image may learn to hallucinate details, e.g., liking 053 a certain type of food in this example, which are 054 not visually grounded and are highly subjective. 055 In this vein, we consider the visual concreteness 056 of image captions, referring to the degree to which 057 text describes a specific visual scene that can be 058 vividly imagined (as opposed to abstract text that 059 may correspond to many possible visual represen-060 tations). Visual concreteness provides a comple-061 1 LLM Linear Layer "A black dog" Reconstruction Text-to-Image Model Reconstruction LM Acquiring Information via Visual-Semantic Autoencoders Distillation 0.6 "A black dog" "A dog standing" "A nice location" 0 1.0 0.4 "A black dog" "A nice location" 0.9 0.15 "A black dog" "The nice place" "A beach at sunset" "A nice location" 0.1 SBA VBA CLIP text embeddings Captioning Model CLIP Text Encoder Trained Parameters Frozen Parameters Figure 2: Predicting visual concreteness scores of image captions with our method. We first acquire information using a semantic-bottleneck autoencoder (SBA, top left) and an visual-bottleneck autoencoder (VBA, bottom left). We then distill a weighted combination of their reconstruction scores into a smaller language model (LM, right), which learns to produce ICC scores for new texts. We visualize reconstruction scores for highly concrete ("A black dog") and highly abstract ("A nice location") texts. High and low scores are colored in green and red, respectively. As illustrated, our final score, which combines the two pipelines, yields more accurate concreteness predictions. mentary dimension of textual quality to consider 062 for vision-and-language tasks, as filtering captions 063 by concreteness is a natural way to encourage 064 visually-grounded predictions. 065 We propose the Image Caption Concreteness 066 (ICC) metric for quantifying the visual concrete-067 ness of image captions calculated from text alone, 068 i.e., without an image reference. We measure 069 concreteness using autoencoding pipelines with 070 visual-semantic information bottlenecks, previ-071 ously used for other aims (Kamath et al., 2023; 072 Yang et al., 2023). Specifically, we use a semantic-073 bottleneck autoencoder that identifies how well an 074 LLM recovers the input caption from its seman-075 tic CLIP embedding, and a visual-bottleneck au-076 toencoder that leverages the competence of text-077 to-image generative models. Our ICC metric is 078 distilled from these pipelines; see Figure 2. 079 Extensive experiments show ICC's effective-080 ness in filtering multimodal web-scale data for 081 downstream tasks such as image captioning and 082 text-based image retrieval. We will release our 083 data, code, and trained models, anticipating the 084 use of ICC for further tasks that require curation 085 of web-scale visually-grounded text. 086 2 Method 087 Given an image caption (of an unseen image), we 088 aim to predict its degree of visual concreteness. 089 Our underlying assumption is that more visually 090 concrete text can be mapped to or from a visual 091 representation with less information loss. Con-092 generates multiple images from a caption and 167 measures the average similarity between the text 168 and generated images. Due to its high computa-169 tional cost, we only evaluate it on a statistically-170 significant portion of the single-word benchmark, 171 which contains nearl