ICLR2025

Accessing Vision Foundation Models via ImageNet-1K

Yitian Zhang, Xu Ma, Yue Bai, Huan Wang, Yun Fu

Abstract

Fixing a foundation model (e.g., trained via self-supervised learning) and only adapting a simple task-specific model is sufficient for many problems • This lecture will cover following foundation models for vision • Discriminative models (e.g., self-supervised models, CLIP) • Generative models (e.g., text-to-image diffusion models) • Vision-specific models (e.g., Segment Anything (SAM) • In specific, this lecture will answer (or at least hint) to the following questions: • How to train foundation models? • What are the zero-shot capabilities of foundation models? • How to exploit foundation models on specific tasks? • DINO v2 [Oquab et al., 2023] • DINO v2 is better at transferring to vision tasks • Semantic segmentation on ADE20K, Cityscapes, Pascal VOC with frozen feature • Depth estimation on NYUd, KITTI, NYUd -> SUN RGB-D with frozen feature