ICLR2025

Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

Congpei Qiu, Yanhao Wu, Wei Ke, Xiuxiu Bai, Tong Zhang

Abstract

Contrastive Language-Image Pre-training (CLIP) excels in global alignment with language but exhibits limited sensitivity to spatial information, leading to strong performance in zero-shot classification tasks but underperformance in tasks requiring precise spatial understanding. Recent approaches have introduced Region-Language Alignment (RLA) to enhance CLIP's performance in dense multimodal tasks by aligning regional visual representations with corresponding text inputs. However, we find that CLIP ViTs fine-tuned with RLA suffer from notable loss in spatial awareness, which is crucial for dense prediction tasks. To address this, we propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates above degradation. To further enhance spatial correlations, we introduce a lightweight Refiner that extracts refined correlations directly from CLIP before feeding them into SCD, based on an intriguing finding that CLIP naturally capture high-quality dense features. Together, these components form a robust distillation framework that enables CLIP ViTs to integrate both visual-language and visual-centric improvements, achieving state-ofthe-art results across various open-vocabulary dense prediction benchmarks. 1 Published as a conference paper at ICLR 2025 (b) Refined-Spatial-Correlation guided RLA (a) Dense Feature Quality CLIP CLIPSelf Ours 22.6 17.1 16.2 23.0 CLIP CLIPSelf RegionCLIP Ours mIoU RLA Unsupervised Segmentation IE Dense Feature Language Supervision ( RLA ) Refiner Visual-centric Supervision ℒ 𝑆𝐶𝐷 Figure 1: (a) Evaluation of dense feature quality. We visualize the object-level dense features of image encoder with t-SNE and present the unsupervised segmentation results. Existing Region-Language Alignment methods lead to significant degradation of visual-centric feature quality. (b) The framework of our fine-tuning structure. We design an additional visual-centric branch for RLA to enhance model's spatial awareness. in the visual-centric quality of dense features. We attribute it to the lack of spatial granularity in language supervision, which compromises the model's ability to rich visual-centric perception, rendering RLA methods suboptimal for OV dense prediction tasks. Given these insights, our objective is to improve models spatial awareness during the RLA process, enhancing OV dense prediction from both visual-centric and vision-language perspectives. In this paper, we propose a Spatial-Correlation-guided Region-Language Alignment (SC-RLA) framework, designed to preserve the spatial awareness of CLIP ViTs during the RLA process. One key challenge is domain conflict, as the RLA process projects dense visual embeddings into a text-oriented domain, making them incompatible with visual-centric objectives. To address this, we extend the correlation distillation mechanism (Li et al., 2020; Zhang & Ma, 2023) , which focuses on preserving the consistency of spatial relationships between visual concepts encoded by the dense features, to the cross-modal domain, enabling the transfer of visual-centric spatial knowledge. Specifically, we distill spatial correlations from the original CLIP ViT into the student model, enforcing consistency in spatial correlations during fine-tuning and thereby preserving the model's spatial awareness. While our experiments validate the effectiveness of SC-RLA in preserving CLIP's spatial awareness, a significant limitation persists: CLIP's native spatial awareness remains suboptimal (Wei et al., 2023) , which consequently constrains the full potential of SC-RLA. To mitigate this issue, we propose a selfsupervised refinement mechanism aimed at enhancing the spatial awareness of CLIP ViTs, thereby improving the supervision quality of SC-RLA. This approach is motivated by a key observation: CLIP ViTs exhibit strong inherent spatial awareness if irrelevant semantic contaminants of CLIP's feature map are filtered out. Building on this insight, we introduce a lightweight module, the Refiner, which generates high-quality spatial refinements from the frozen CLIP ViTs. This process unlocks the dense perception capabilities of the model in a visual-centric manner, without requiring external supervision. By integrating the Refiner into the SC-RLA pipeline, we present R-SC-RLA, a robust framework that enhances CLIP ViTs from both visual-centric and vision-language perspectives. The effectiveness of our method is experimentally validated on the open-vocabulary dense prediction tasks, including object detection and image segmentation. With only a few epochs of finetuning on small datasets like COCO (Lin et al., 2014), our method achieves non-trivial performance improvements when integrated with the recent RLA methods like CLIPSelf (Wu et al., 2023b) and RegionCLIP (Zhong et al., 2022) for object detection tasks. For the segmentation benchmarks, our method also improves the performance of the recent state-of-the-art model Cat-Seg