ICLR2026

RLAP-CLIP: Continual Multimodal Learning with Prototype Adaptation and Difficulty-Aware Routing

Ruikun Luo, Jiarui Wang, Yuan Gao, Jing Yang, Jieming Yang, Song Wu, Hai Jin, Xiaoyu Xia

摘要

Vision-language models such as CLIP achieve strong zero-shot performance through contrastive pre-training but face significant challenges in classincremental image classification scenarios. When learning new classes sequentially, current methods suffer from degradation in prototype quality due to passive averaging and underutilize their visual adaptation capabilities. We propose RLAP-CLIP, which addresses these limitations through three components. First, Reinforcement Learning-based Prototype Optimization (RLPO) formulates prototype construction as a reinforcement learning problem to actively optimize class separability rather than relying on simple averaging. Second, difficulty-aware crossmodal fusion uses a mixture-of-experts architecture to route samples through specialized processing pathways based on complexity. Third, dual-modal prompting balances visual and textual adaptation. Experiments on eight image classification benchmarks spanning general classification, fine-grained recognition, and domain-shift scenarios demonstrate consistent improvements, with RLAP-CLIP achieving average accuracy gains of up to 4.52 percentage points and final accuracy improvements of up to 6.26 percentage points over state-of-the-art methods.