WWW2026

VisionST: Coordinating Cross-modal Traffic Prediction with Interactive Geo-image Encoding

Jinwen Chen, Hao Miao, Chenxi Liu, Yan Zhao, Kai Zheng

Abstract

Traffic prediction plays a pivotal role in contemporary web technologies, motivating various intelligent web services such as route planning and remote traffic management. Many recent proposals that target deep learning for traffic prediction solely leverage historical traffic observations to predict future ones. However, traffic prediction is always susceptible to different factors such as road networks and social events, exhibiting different modalities. Most existing methods focus on a single modality, failing to capture the comprehensive traffic patterns among various factors, resulting in sub-optimal performance. Web-sourced geo-images, e.g., satellite imagery, encompass comprehensive contextual information and offer an effective way to represent diverse modalities. To unleash the power of such geo-images, we propose VisionST, a Vision-augmented Spatial-Temporal Neural Network, which coordinates cross-modal traffic prediction with interactive geo-image encoding. To bolster resilience against highly intricate and overlapping traffic patterns, VisionST features a visual semantic extraction mechanism and a pattern-guided aggregation mechanism. The former extracts node-level visual tokens and node-to-node visual relation patterns from geo-referenced images. The latter generates relation patterns that encompass visual, spatial, and temporal aspects, constraining nodes to interact with these relation patterns for contextual information interaction. Extensive experiments on real large-scale datasets offer insight into the effectiveness of the proposed solutions, showing that VisionST consistently outperforms state-of-the-art baselines.