CVPR2025

BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models

Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, Yiming Xiao

Abstract

Recent advancements in vision-language models (VLMs), such as CLIP, have demonstrated substantial success in self-supervised representation learning for vision tasks. However, effectively adapting VLMs to downstream applications remains challenging, as their accuracy often depends on time-intensive and expertise-demanding prompt engineering, while full model fine-tuning is costly. This is particularly true for biomedical images, which, unlike natural images, typically suffer from limited annotated datasets, unintuitive image contrasts, and nuanced visual features. Recent prompt learning techniques, such as Context Optimization (CoOp) intend to tackle these issues, but still fall short in generalizability. Meanwhile, explorations in prompt learning for biomedical image analysis are still highly limited. In this work, we propose BiomedCoOp, a novel prompt learning framework that enables efficient adaptation of BiomedCLIP for accurate and highly generalizable few-shot biomedical image classification. Our approach achieves effective prompt context learning by leveraging semantic consistency with average prompt ensembles from Large Language Models (LLMs) and knowledge distillation with a statistics-based prompt selection strategy. We conducted comprehensive validation of our proposed framework on 11 medical datasets across 9 modalities and 10 organs against existing state-of-the-art methods, demonstrating significant improvements in both accuracy and generalizability. The code is publicly available at https://github.com/HealthX-Lab/BiomedCoOp . * Corresponding author training (CLIP) [37] , which align visual and textual information through contrastive pre-training, allow the exploration of open-set visual concepts, thanks to the adoption of natural language supervision. However, the success of these models often relies heavily on the quality of the textual prompts that guide their predictions while full-model fine-tuning for large-scale VLMs is impractical. To mitigate these, prompt learning that optimizes textual prompts in vision-language models [25, 50, 51] has emerged as one of the critical techniques to enhance performance without the need for extensive fine-tuning. Notably, the pioneering work of Context Optimization (CoOp) [51] introduced this approach for CLIP by treating text prompts as learnable context vectors and preserving the pre-trained model weights. Meanwhile, other approaches [16, 19, 47] focus on lightweight few-shot adaptation through Adapters [18] and Linear Probes [37] to offer parameter-efficient solutions for model adaptation in downstream tasks. Different from natural images, biomedical images include a wide range of contrasts and modalities, depending on the image acquisition devices and parameters. These images, such as MRI and ultrasound, often have unique visual appearances that can be more difficult to interpret than typical photographs. In addition, image features (e.g., color, texture, shape, and anatomical context) that are related to physiological and pathological changes are more nuanced and complex to describe, and can differ between image modalities. Finally, due to privacy concerns and the high requirement for clinical expertise, large datasets of wellannotated biomedical images are scarce for developing clinical deep learning models. While VLMs and the associated prompt learning techniques have shown success across natural image datasets and benchmarks, their application in the biomedical imaging domain (e.g., diagnosis), which has distinct challenges, remains largely under-explored. Due to the unique domain knowledge of biomedical images, the backbone vision-language model for prompt learning may require tailored pre-training for the best outcome. Biomed-specific VLMs, such as BiomedCLIP [48]-pretrained on 15 million biomedical image-text pairs from internet resources-are better suited for biomedical tasks This CVPR paper is the Open Access version, provided by the Computer Vision Foundation. Except for this watermark, it is identical to the accepted version; the final published version of the proceedings is available on IEEE Xplore.