CVPR2025

SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language

Zehan Wang, Sashuai Zhou, Shaoxuan He, Haifeng Huang, Lihe Yang, Ziang Zhang, Xize Cheng, Shengpeng Ji, Tao Jin, Hengshuang Zhao, Zhou Zhao

Abstract

Contrastive Language-Image Pre-training (CLIP) learns robust visual models through language supervision, making it a crucial visual encoding technique for various applications. However, CLIP struggles with comprehending spatial concepts in images, potentially restricting the spatial intelligence of CLIP-based AI systems. In this work, we propose SpatialCLIP, an enhanced version of CLIP with better spatial understanding capabilities. To capture the intricate 3D spatial relationships in images, we improve both "visual model" and "language supervision" of CLIP. Specifically, we design 3D-inspired ViT to replace the standard ViT in CLIP. By lifting 2D image tokens into 3D space and incorporating design insights from point cloud networks, our visual model gains greater potential for spatial perception. Meanwhile, captions with accurate and detailed spatial information are very rare. To explore better language supervision for spatial understanding, we re-caption images and perturb their spatial phrases as negative descriptions, which compels the visual model to seek spatial cues to distinguish these hard negative captions. With the enhanced visual model, we introduce SpatialLLaVA, following the same LLaVA-1.5 training protocol, to investigate the importance of visual representations for MLLM's spatial intelligence. Furthermore, we create SpatialBench, a benchmark specifically designed to evaluate CLIP and MLLM in spatial reasoning. Spatial-CLIP and SpatialLLaVA achieve substantial performance improvements, demonstrating stronger capabilities in spatial perception and reasoning, while maintaining comparable results on general-purpose benchmarks. Recently, Contrastive Language-Image Pretraining (CLIP) models [13, 22, 47, 53, 73] have demonstrated the ability to * Equal Contribution. † Corresponding author. Accurate: The darker-haired cat is behind the white cat. CLIP Score: 22.28 Category Error: The darker-haired bird is behind the white cat. CLIP Score: 20.43 (-1.85) Accurate: The white refrigerator is closer to the camera than brown door. CLIP Score: 29.74 Category Error: The white cabinet is closer to the camera brown door. CLIP Score: 25.66 (-4.08) Accurate: The dark coffee machine is to the left of the mug. CLIP Score: 22.10 Category Error: The ceramic cup is to the left of the mug. CLIP Score: 19.48 (-2.62) Accurate: Two people are standing behind a pile of huge-sized pipes. CLIP Score: 26.62 Category Error: Two people are standing behind a pile of huge-sized papers. CLIP Score: 16.40 (-10.22) Attribute Error: The light-haired cat is behind the white cat. CLIP Score: 21.22 (-1.06) Spatial Error: The darker-haired cat is to the right of the white cat. CLIP Score: 23.66 (+1.38) Attribute Error: The dark refrigerator is closer to the camera than opened door. CLIP Score: 29.39 (-0.35) Spatial Error: The white refrigerator is further from the camera than brown door. CLIP Score: 29.62 (-0.12) Attribute Error: The old broken coffee machine is to the left of the mug. CLIP Score: 21.27 (-0.83) Spatial Error: The dark coffee machine is in front of the mug. CLIP Score: 22.53 (+0.43) Attribute Error: Two people are standing behind a pile of small pipes. CLIP Score: 25.78 (-0.84) Spatial Error: Two people are standing inside a pile of huge-sized pipes.