WWW2026

Grasp: Refining Semantic Graphs into Purified Knowledge for Cross-Modal Communication

Liang Chen, Xiaoding Wang, Limei Lin, Dajin Wang, Zhiquan Liu, Jie Wu

Abstract

The explosive growth of multimodal web data demands communication that transmits meaning rather than raw bits. Existing semantic-communication systems often fail under noise, missing modalities, and distribution shifts because they optimize surface features instead of modality-invariant knowledge. We present Grasp, a knowledge-centric framework for cross-modal communication. Grasp segments streams into semantic blocks and builds a graph over them; a lightweight Graph Neural Networks (GNN) produces schedulable, importance-weighted representations. At its core is knowledge purification: we minimize a conditional mutual information upper bound to perform a three-way disentanglement-strongly related, weakly related, and task-irrelevant components-so that only essential semantics are transmitted while non-essential factors are suppressed. To maintain synchrony, we introduce one-totwo temporal contrastive learning to achieve triple alignment of video, audio, and text despite sampling asynchrony. For efficient transmission, Grasp uses a cross-modal shared vector-quantization codebook-a discrete knowledge codebook-updated by multimodal attention. At the receiver, a soft-recovery mechanism leverages this shared knowledge to robustly reconstruct semantics under low signal-to-noise ratio (SNR) or missing modalities, yielding graceful degradation. Across web tasks-including cross-modal retrieval and missing-modality inference-Grasp improves knowledge consistency, semantic fidelity, and downstream performance over strong baselines while maintaining low latency. These results show that communication structured around purified knowledge is key to building robust, semantic-aware systems for the modern web. CCS Concepts • Theory of computation → Semantics and reasoning; • Computing methodologies → Machine learning.