ICLR2025

Efficient and Context-Aware Label Propagation for Zero-/Few-Shot Training-Free Adaptation of Vision-Language Model

Yushu Li, Yongyi Su, Adam Goodge, Kui Jia, Xun Xu

Abstract

Vision-language models (VLMs) have transformed computer vision by enabling zero-shot image understanding, allowing models to generalize to unseen tasks with-out task-specific training. This paper reviews recent advancements in VLMs, focusing on architectures, pretraining strategies, and applications in zero-shot image classification, object detection, and visual reasoning. We propose a framework integrating contrastive learning, multimodal prompt tuning, and baseline prompts to enhance performance. Experiments on ImageNet, MS COCO, and Visual Genome demonstrate superior accuracy and robustness. We address ethical challenges, such as dataset biases, and propose mitigation strategies. Future directions include scalable and fair VLMs for real-world applications.