CVPR2025

Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis

Arpita Chowdhury, Dipanjyoti Paul, Zheda Mai, Jianyang Gu, Ziheng Zhang, Kazi Sajeed Mehrab, Elizabeth G. Campolongo, Daniel I. Rubenstein, Charles V. Stewart, Anuj Karpatne, Tanya Y. Berger-Wolf, Yu Su, Wei-Lun Chao

DOI Publisher

Abstract

The supplementary is organized as follows. • Appendix A: Related Work • Appendix B: Details of Architecture Variant (cf. subsection 2.4 of the main paper) • Appendix C: Dataset Details (cf. subsection 3.1 of the main paper) • Appendix D: Inner Workings of Visualization (cf. subsection 2.3 of the main paper) • Appendix E: Additional Experiment Settings (cf. subsection 3.1 of the main paper) • Appendix F: Additional Experiment Results and Analysis (cf. subsection 3.2 of the main paper) • Appendix G: More visualizations of different dataset (cf. Figure 4 of the main paper) A. Related Work Pre-trained Vision Transformer. Vision Transformers (ViT) [9], pre-trained on massive amounts of data, has become indispensable to modern AI development. For example, ViTs pre-trained with millions of image-text pairs via a contrastive objective function (e.g., a CLIP-ViT model) show an unprecedented zero-shot capability, robustness to distribution shifts and serve as the encoders for various power generative models (e.g. Stable Diffusion [35] and LLaVA [19]). Domain-specific CLIP-based models like BioCLIP [38] and RemoteCLIP [18], trained on millions of specialized image-text pairs, outperform general-purpose CLIP models within their respective domains. Moreover, ViTs trained with self-supervised objectives on extensive sets of well-curated images, such as DINO and DI- NOv2 [4, 29] , effectively capture fine-grained localization features that explicitly reveal object and part boundaries. We employ DINO, DINOv2, and BioCLIP as our backbone models in light of our focus on fine-grained analysis. Prompting Vision Transformer. Traditional approaches to adapt pre-trained transformers-full fine-tuning and linear probing-face challenges: the former is computationally intensive and prone to overfitting, while the latter struggles with task-specific adaptation [22, 23] . Prompting, first popularized in natural language processing (NLP), addressed such challenges by prepending task-specific instructions to input text, enabling large language models like GPT-3 to perform zero-shot and few-shot learning effectively [3] . Recently, prompting has been introduced in vision transformers (ViTs) to enable efficient adaptation while leveraging the vast capabilities of pre-trained ViTs [12, 42, 53] .