CVPR2025
Scaling Vision Pre-Training to 4K Resolution
Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, Hongxu Yin
Abstract
Figure 1 . Left: Regular vision models such as SigLIP [44] processes images at a low resolution (e.g., 378 × 378 pixels), which is not enough for many daily tasks such as spotting the stop sign while driving. In contrast, PS3 is able to both encode low-res features and efficiently process high-res information of 4K-resolution images via top-down patch selection, i.e., selectively processing relevant patches based on any text prompt. Top Right: SigLIP is pre-trained by contrasting global vision features and global captions, which is costly for high-resolution images. PS3 is pre-trained with additional contrast between local high-res features with local captions, enabling pre-training at 4K resolution with 79× less cost than SigLIP. Bottom Right: VILA-HD uses PS3 to selectively process high-res regions based on the user prompt, outperforming state-of-the-art MLLMs such as Qwen2-VL [38] on the proposed 4KPro benchmark while achieving 2.96× speedup.