AAAI2026

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

Qilang Ye, Wei Zeng, Meng Liu, Jie Zhang, Yupeng Hu, Zitong Yu, Yu Zhou

摘要

Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audioabsent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an "Audio-Visual Confusion" scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs "Is there a/an muted-object sound". Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves the accuracy by 10∼30% over the baseline model with limited training data. Follow: https://github.com/rikeilong/AVConfusion .