WWW2026

Dual-Branch Multi-Granularity Network with Structured Contrastive Ranking for Cross-Modal Retrieval

Zihao Chen, Chenyang Bu, Shengwei Ji, Xindong Wu

Abstract

Cross-modal retrieval (CMR) has advanced considerably by mapping image and text features into a shared embedding space; however, these approaches still face two persistent challenges: (1) semantic sparsity, where discriminative cues are confined to localized regions, making it difficult to identify implicit visual evidence; and (2) ranking uncertainty under semantic ambiguity, where models struggle to maintain the correct retrieval order when candidates share similar contexts. To address these issues, we propose the Dual-Branch Multi-Granularity Network (DBMG) with Structured Contrastive Ranking, which enriches visual semantics by leveraging a multimodal large language model to generate auxiliary descriptions, aligns sparse cues through a dual-branch architecture capturing both global and local interactions, and enforces ranking consistency via a three-stage contrastive objective that progressively optimizes category clustering, instance alignment, and margin-based ranking. Extensive experiments on four standard CMR benchmarks demonstrate that DBMG outperforms 12 strong baselines, achieving an average 15.91% improvement in mAP, establishing a new state-of-the-art. The code is available at https://github.com/DMiC-Lab-HFUT/DBMG.