CVPR2025

ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning

Haoyuan Yang, Xiaoou Li, Jiaming Lv, Xianjun Cheng, Qilong Wang, Peihua Li

Abstract

Adapting CLIP models for few-shot recognition has recently attracted significant attention. Despite considerable progress, these adaptations remain hindered by the pervasive challenge of data scarcity. Text-to-image models, capable of generating abundant photorealistic labeled images, offer a promising solution. However, existing approaches simply treat synthetic images as complements to real images, rather than as standalone knowledge repositories stemming from distinct foundation models. To overcome this limitation, we frame synthetic images as an imagined base set (iBase), i.e., an independent, large-scale synthetic dataset encompassing diverse concepts. Building on this perspective, we introduce ImagineFSL, a novel CLIP adaptation methodology that pretrains on iBase and then fine-tunes for downstream few-shot tasks. We find that, compared to no pretraining, both supervised and selfsupervised pretraining are beneficial, with the latter providing better performance. Based on on this finding, we propose an improved self-supervised method tailored for few-shot scenarios, enhancing the transferability of representations from synthetic to real image domains. Additionally, we present a systematic and scalable pipeline that employs chain-of-thought and in-context learning techniques, harnessing foundation models to automatically generate diverse, realistic images. Validated across eleven datasets, our methods consistently outperform state-of-the-art approaches by substantial margins.