WWW2026
PLIKD: Prompt Learning with Instance-aware Knowledge Distillation for Web-scale Semantic Image Classification
Jianye Xie, Chunhua Hu, Lianyong Qi, Fan Wang, Xiaolong Xu, Haolong Xiang, Xuyun Zhang, Shichao Pei, Amin Beheshti, Wanchun Dou, Xiaokang Zhou
摘要
With the rapid growth of multi-modal content on the Web, robust vision-language models are essential for semantic understanding and classification of web images under diverse and dynamic contexts, supporting Web applications such as multimedia search and recommendation. Prompt learning has proven effective for enhancing vision-language models in semantic image classification tasks. However, previous methods often suffer from poor generalization: the learned prompts tend to overfit the base classes seen during training, leading to poor performance on unseen classes and under distribution shifts. This issue is especially challenging in Web-scale data, where new classes emerge and distributions shift dynamically. To address these limitations, we propose PLIKD, a novel prompt learning method that integrates instance-aware knowledge distillation for robust Web-scale semantic image classification. Specifically, PLIKD introduces an instance-aware knowledge extraction module, which leverages multi-modal large language models through a step-by-step strategy to extract external knowledge for each image instance. To incorporate this extracted knowledge, PLIKD further introduces an instance-aware knowledge distillation module, which consists of two key steps: (1) a dual-teacher strategy for robust and informative knowledge distillation, and (2) fine-grained cross-modal alignment via Smooth and Sparse Optimal Transport. Extensive experiments demonstrate that PLIKD significantly improves generalization to both seen and unseen classes, and remains robust under distribution shifts, outperforming existing state-of-the-art methods on Web-scale semantic image classification.