ACL2025

V-ALPHASOCIAL: Benchmark and Self-Reflective Chain-of-Thought Generation for Visual Social Commonsense Reasoning

Zongyu Lin, Zhikun Xu, Xiaohan Song, Yixin Wan, Xingcheng Yao, Tsung-Han Lin, Selina Song, Pranav Subbaraman, Ben Zhou, Kai-Wei Chang, Yizhou Sun

被引用 3 次

DOI 出版方

摘要

Social commonsense reasoning naturally involves both the verbal and non-verbal cues of a social interaction. It is important for Large Vision-Language Models (VLMs) to leverage both textual and visual information in performing tasks like social understanding and reasoning. However, while current LLMs have shown good social reasoning capabilities in textual context, whether they can effectively incorporate visual information in social comprehension remains under-explored. To narrow the gap, we first construct and propose a benchmark: V-SOCIAL, featuring well-aligned text and visual content, tailored to assess visual social commonsense for multimodal foundation models. Through experimenting with V-SOCIAL, we find that even the most advanced VLM, GPT-4o, often falls short in social commonsense reasoning. This highlights the critical need to enhance the social grounding of VLMs. One major obstacle for improving this is the lack of high-quality data with good reasoning process. To overcome this obstacle, we introduce V-ALPHASOCIAL, a novel method that generates high-quality chain-of-thought reasoning paths from unlabeled data. We design a visual reasoning reward model to improve VLM, and then iteratively refine both the VLM and the reward model. Our extensive analysis showcases how our method enhances social commonsense reasoning, proposing an effective approach that facilitates deeper exploration into field. 1 1979), incorporating not only textual but also visual 041 cues such as gestures, facial expressions, and ac-042 tions. Integrating these features from visual modal-043 ities into social commonsense reasoning tasks is 044 crucial for understanding and improving models' 045 social commonsense reasoning holistically. 046 By incorporating visual encoder with LLM, 047 Vision-Language Models has demonstrated decent 048 performance on a wide range of tasks such as im-049 age & video captioning and understanding (Wang 050 and Zhao, 2023; Lin et al., 2023). These models 051 show potential in processing nuanced and context-052 rich social interactions. However, there is a lack of 053 comprehensive benchmarks specifically designed 054 to evaluate their abilities in visual social common-055 sense reasoning. Current benchmarks are often con-056 strained by high-quality and aligned multi-modal