CVPR2024
Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters
Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, You He
80 citations
Abstract
Continual learning can empower vision-language models to continuously acquire new knowledge, without the need for access to the entire historical dataset. However, mitigating the performance degradation in large-scale models is non-trivial due to (i) parameter shifts throughout lifelong learning and (ii) significant computational burdens associated with full-model tuning. In this work, we present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models. Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters in response to new tasks. To preserve the zero-shot recognition capability of vision-language models, we further introduce a Distribution Discriminative Auto-Selector (DDAS) that automatically routes in-distribution and out-of-distribution inputs to the MoE Adapter and the original CLIP, respectively. Through extensive experiments across various settings, our proposed method consistently outperforms previous state-of-the-art approaches while concurrently reducing parameter training burdens by 60%. Our code locates at https://github.com/JiazuoYu/MoE- Adapters4CL * Corresponding author data with historical datasets. In contrast, Continual Learning (CL), offering an efficient incremental training strategy, emerges as a solution by focusing on new data at each training stage. However, CL faces the significant hurdle of "catastrophic forgetting" where a model loses previously acquired knowledge upon learning new tasks [24, 52] . To remedy this issue, one of the popular solutions in current CL methods [1, 16, 23, 43] is to develop dynamic expansion frameworks by incrementally adding task-specific components to a shared base model (see Figure 1 (a) ). Although these methods show promise in memorization and scalability, they cannot distinguish unseen data and thus overlook zero-shot transfer capability. Recent advancements like ZSCL [79] have brought the zero-shot transfer ability into continual learning by leveraging a pretrained Vision Language Model (VLM). As illustrated in Figure 1 (b), this method relies on knowledge distillation to integrate zero-shot generalization ability from the frozen CLIP and uses parameter regularization to prevent knowledge degradation in continual learning. However, these designs often entail large computational burdens and exhibit limitations in long-term memorization. It's then natural to ask whether we can combine the merits of the pretrained foundation model and dynamic expansion strategy to form an effective system with robust memorization and zero-shot transfer abilities. Recently, Parameter-Efficient Fine-Tune (PEFT) methods [22, 28, 30, 66, 74, 77] have demonstrated that largescale models can quickly adapt to downstream tasks via only fine-tuning less-parameterized adapters. This inspires us to build a dynamic expansion framework on VLM with task-specific adapters to relieve the parameter burdens in long-term CL. Nevertheless, the intuitive approach of stacking adapters during incremental learning introduces a dependency on task identity. This poses challenges in practical scenarios such as class incremental learning where task identity may be unavailable. Furthermore, the use of independent adapters neglects the potential for inter-task knowl-This CVPR paper is the Open Access version, provided by the Computer Vision Foundation. Except for this watermark, it is identical to the accepted version; the final published version of the proceedings is available on IEEE Xplore.