EMNLP2025

AbsVis - Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images

Tarun Tater, Diego Frassinelli, Sabine Schulte im Walde

2 citations

Abstract

concepts like mercy and peace often lack clear visual grounding, thus making it challenging to study how they are associated with images. To address this, we introduce AbsVisa dataset of 675 images annotated with 14, 175 concept-explanation pairs from humans and two Vision-Language Models (VLMs: Qwen and LLaVA), where each concept is supported by a textual explanation. We compare human and VLM attributions in terms of diversity, abstractness, and alignment, and find that humans attribute more varied concepts. AbsVis also includes 2, 680 human preference judgments evaluating the quality of a subset of these annotations, showing that overlapping concepts (attributed by both humans and VLMs) are most preferred. Explanations clarify and strengthen the perceived attributions, both from humans and VLMs. Finally, we show that VLMs can approximate human preferences and use them to fine-tune VLMs via Direct Preference Optimization (DPO), yielding improved alignments with preferred concept-explanation pairs.