ACL2024

M³CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, Wanxiang Che

13 citations

Abstract

Multi-modal Chain-of-Thought (MCoT) requires models to leverage knowledge from both textual and visual modalities for step-bystep reasoning, which gains increasing attention. Nevertheless, the current MCoT benchmark still faces some challenges: (1) absence of visual modal reasoning, (2) single-step visual modal reasoning, and (3) Domain missing, thereby hindering the development of MCoT. Motivated by this, we introduce a novel benchmark (M 3 CoT) to address the above challenges, advancing the multi-domain, multi-step, and multi-modal CoT. Additionally, we conduct a thorough evaluation involving abundant MCoT approaches on Vision Large Language Models (VLLMs). In addition, we highlight that the current VLLMs still struggle to correctly reason in M 3 CoT and there remains a large gap between existing VLLMs and human performance in M 3 CoT, despite their superior results on previous MCoT benchmarks. To our knowledge, we take the first meaningful step toward the multi-domain, multi-step, and multi-modal scenario in MCoT. We hope that M 3 CoT can serve as a valuable resource, providing a pioneering foundation in multi-domain, multi-step, multi-modal chain-of-thought research. * Corresponding Author Q : … supports the plant … Which part do we usually eat? A: (B) the stem O: … (B) Only to indicate the time A: (B) soft A: (C) To indicate … R: … The feather is soft… (b) Single-step visual modal reasoning. (c) Multi-step visual modal reasoning. Q: Which property matches this object? O: … (B) soft O: …(B) stem R: Step 1: The wind vane on top … indicate the wind direction. Step 2: …. The clock on top …it is used to indicate the time. VLLM VLLM VLLM (a) Absence of visual modal reasoning. R: … we usually eat is the stem. It supports the plant … Single Step Missing Multi-Step 1 Multi-Step 2 Q : What is the purpose of the tower? (C) To indicate the time and wind direction…