EMNLP2025

MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models

Xiaolong Wang, Zhaolu Kang, Wangyuxuan Zhai, Xinyue Lou, Yunghwei Lai, Ziyue Wang, Yawen Wang, Kaiyu Huang, Yile Wang, Peng Li, Yang Liu

摘要

Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. MLLMs have shown promising capability in aligning visual and textual modalities, allowing them to process image-text pairs with clear and explicit meanings. However, resolving the inherent ambiguities present in real-world language and visual contexts remains a challenge. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MU-CAR includes: (1) a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and (2) a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models-encompassing both opensource and proprietary architectures-reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning. * Equal contribution. Corresponding authors. C: "I'm going to the bank." "Go fishing? Good Luck!" Q: Is there a misunderstanding between them? Answer: Yes. Explanation: The "Bank" in this sentence means a bank where money can be deposited and withdrawn. Answer: No. Explanation: The "Bank" in this sentence means the riverbank. Homonymy C: "I'm standing on the shoulders of giants now." Q: Does this sentence have a metaphor? Answer: Yes. Explanation: Each generation innovates and develops on the basis of the predecessors. Answer: No. Explanation: This is a real scene from "Gulliver's Travels". Polysemy C: The chicken is ready to eat. Q: What is the subject in the sentence going to eat? Answer: Chicken feed. Explanation: The chicken itself is hungry and ready to eat something. Answer: Chicken. Explanation: The chicken is cooked and prepared, so it is ready for someone to eat. Semantics C: 我的门没有锁。 Q: 上文的"锁"是动词还是名词?(Is "锁" above a verb or noun?) Answer: 名词。(Noun.