EMNLP2023

ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos

Te-Lin Wu, Zi-Yi Dou, Qingyuan Hu, Yu Hou, Nischal Reddy Chandra, Marjorie Freedman, Ralph M. Weischedel, Nanyun Peng

3 citations

Abstract

Multimodal counterfactual reasoning is a vital ability for AI systems. It involves predicting the outcomes of hypothetical circumstances based on vision and language inputs, which enables AI models to learn from failures and explore hypothetical scenarios. Despite its importance, there are only a few benchmark datasets targeting on evaluating the counterfactual reasoning abilities of multimodal models. Further more, existing datasets either only cover reasoning over synthetic environments, or focus only on specific types of events (e.g. traffic collisions), making them hard to reliably benchmark the model generalization ability in diverse real-world scenarios and reasoning dimensions. To overcome these limitations, we develop a video question answering dataset, ACQUIRED, which consists of 3.7K annotated videos, encompassing a wide range of event types and including both first and thirdperson viewpoints, ensuring real-world diversity. In addition, each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal, which can comprehensively evaluate the model counterfactual abilities along multiple aspects. We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap (>13%) between models and humans. The findings suggest that multimodal counterfactual reasoning remains an open challenge and AC-QUIRED is a comprehensive and reliable benchmark for inspiring future research in this direction. Our dataset and code are at: https: //github.com/PlusLabNLP/acquired