WWW2026

HL-CMR: Hypergraph Learning for Cross-Modal Retrieval

Guohui Ding, Jing Li, Yimin Xu, Rui Zhou

摘要

Cross-modal retrieval is a fundamental task in multimedia understanding, aimed at querying samples with similar semantics in one modality (e.g., text) using another modality (e.g., image). Existing methods merely focus on point-to-point comparisons between individual samples, while overlooking the widely present many-to-many structural relationships in real-world scenarios. However, the many-to-many relationships formed by multiple samples sharing similar semantics are crucial for effectively achieving semantic alignment and accurately constructing shared semantic representations. To address this, we propose a novel hypergraph-based cross-modal retrieval approach, which explicitly establishes many-to-many associations between multiple samples using a label-driven hypergraph construction mechanism, combined with differentiated hyperedge weighting. Additionally, to avoid the limitation of information interaction direction imposed by traditional unidirectional cross-attention mechanisms, we design a bidirectional cross-attention structure, with image and text as separate query sources, to achieve symmetric semantic enhancement between modalities. The resulting joint image-text representations are then mapped as hypergraph vertices, further enhancing the model's ability to align cross-modal semantics. Since constructing a global hypergraph on a large-scale sample set would incur high computational cost, we introduce global label co-occurrence frequency to supervise the batch-level hypergraph construction, enhancing the local graph's ability to capture global semantics. Experimental results show that our model outperforms existing state-of-the-art methods on three benchmark cross-modal retrieval datasets.