WWW2026

Multi-view Semantic Contrastive Alignment for Multimodal Recommendation

Jiuqiang Li, Hongjun Wang

Abstract

Multimodal recommendation advocates integrating the multimodal features of items with historical user behaviors to enhance recommendation accuracy across various online media platforms. The majority of existing methods concentrate on leveraging cross-modal learning over multimodal features to augment node representations. However, these approaches are confronted with two key challenges: i) augmented representations offer limited information gain for interactive prediction in the collaborative view, and ii) semantic discrepancy between the collaborative view and modality-augmented features remains inadequately addressed. To overcome these obstacles, we present a new Multi-view Semantic Contrastive Alignment (MSCA) approach for multimodal recommendation, which models and aligns node representations from multiple views. Specifically, we introduce a multi-view semantic pattern encoder that learns basic embeddings from the collaborative view and independently captures augmented semantic patterns from the item-item structural view and intra-modal view. Furthermore, a semantic contrastive alignment task is designed to mitigate the semantic divergence between collaborative embeddings and augmented representations by maximizing the mutual consistency between them, thereby facilitating an effective integration of both. Comprehensive experiments on three benchmark datasets confirm that the proposed MSCA consistently excels over diverse state-of-the-art baselines.