CVPR2025

Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Henghui Du, Guangyao Li, Chang Zhou, Chunjie Zhang, Alan Zhao, Di Hu

摘要

Please recognize the category of object that makes the sound and then output the location Spatial localization coordinates. Please describe the events and time range that occurred in the video. A red car appears from a distance and drives down a dirt road, kicking up dust and creating a cloud of smoke. The car is visible and audible from the 4th to the 9th second. Temporal localization Please segment out the object that makes sound on the left. Please segment out the sounding object. Pixel-level understanding In the video, three people are playing musical instruments in front of a Christmas tree. The man on the left is playing the cello, the man in the middle is playing the violin, and the man on the right is playing the piano. At the 2nd second, the piano sounds first. Then, starting from the 4th second, three instruments play together. The instrument on the left of the piano is the cello. So the answer is cello. Spatio-temporal reasoning What is the left instrument of the first sounding instrument? Audio-visual scene understanding task AVE/AVVP/AVQA/AVS/ARIG/… MLLMs Temporal localization Spatial localization Spatio-temporal reasoning Pixel-level understanding 20241114 Figure 1. We present Crab, a unified audio-visual scene understanding model with explicit cooperation, which can complete various audio-visual tasks. It is trained on an instruction-tuning dataset with explicit reasoning process, which clarifies the cooperative relationship among tasks. Furthermore, to alleviate the interference caused by the learning process of complex audiovisual data and facilitate concrete cooperation, an interaction-aware LoRA structure is designed to enable the model focus on different aspects of data interaction.