WWW2026

CCAF: Coarse-to-fine Cross-Modal Alignment and Fusion for Multimodal Sentiment Analysis

Xianbing Zhao, Shengzun Yang, Buzhou Tang

摘要

Multimodal sentiment analysis (MSA) has witnessed remarkable advancements in recent years. Existing MSA methods focus primarily on learning coarse-grained representations from different modalities to perform global cross-modal alignment or fusion. However, these approaches often neglect fine-grained valuable sentimental clues derived from local cross-modal interactions. Furthermore, the cross-modal alignment and fusion of complex global and local cross-modal information pose significant challenges in MSA tasks. To address this issue, we propose a novel MSA framework that simultaneously captures coarse-grained and fine-grained cross-modal sentiment cues through global and local cross-modal alignment and fusion. Our approach consists of three key components: i) optimal transport-based global and local cross-modal alignment, which separately aligns valuable global and local sentiment clues across modalities, ii) global and local cross-modal gated attention, which respectively fuse the aligned global and local cross-modal representations, and iii) prototype-informed information bottleneck, which utilizes learnable sentiment prototypes and contrastive prototype match to eliminate redundant cross-modal information at both global and local levels. Extensive experiments conducted on two publicly available MSA datasets demonstrate the effectiveness and superiority of our proposed model.