ICLR2026

Reversible Primitive–Composition Alignment for Continual Vision–Language Learning

Canran Xiao, Tianxiang Xu, siyuanma, Yiyang Jiang, Haoyu Gao, Yuhan Wu

摘要

Vision-language(VL) models are increasingly deployed in non-stationary settings, yet under sequential adaptation they often preserve primitive recognition while losing compositional structure, especially with tight rehearsal budgets and no task IDs. We address this gap by asking how a continual VL system can maintain structurally dependable behaviour while safeguarding zero-shot performance. We introduce COMPO-REALIGN, a structure-first recipe built around three components: a reversible composer that maps primitive embeddings to compositions by design, a multi-positive InfoNCE that jointly aligns textual and composed views of the same target, and a spectral trust region that clips updates when alignment sensitivity inflates. Across compositional DIL and multi-domain MTIL retrieval, COMPO-REALIGN sets a new state of the art, improves over the strongest prior by +2.4 R@1, and reduces forgetting by 40%. We provide a compact, reversible alignment head with geometry-aware training for compositionally robust VL continual learning. * Corresponding author RELATED WORK Continual VL under non-stationary streams. Early continual captioning framed forgetting as transient-vs-shared dynamics in sequence models, introducing task-conditioned gating and gradient masking to protect recurrent states and vocabularies (Del Chiaro et al., 2020) . For contrastive VL, recent work scales to multi-domain retrieval and pretraining: momentum/distillation and topologyaware objectives curb drift across datasets and time (e.g., BMU-MoCo for video-text (Gao et al., 2022) , Open-VCLIP for zero-shot video (Weng et al., 2023) , CTP for VL continual pretraining with compatible momentum and topology preservation (Zhu et al., 2023) ). At web scale, TiC-CLIP shows that warm-starting from the last checkpoint plus replay offers a practical path close to retraining-from-scratch (Garg et al., 2024) . For retrieval, DKR emphasizes rectifying mismatched affinities before distillation to avoid propagating earlier errors (Cui et al., 2024) . Much of this line has focused on task/domain retention and large-scale training mechanics (Zhang et al., 2025) . However, real deployments also require compositional robustness-i.e., preserving how attributes and objects bind-when rehearsal is scarce and task identities are unknown. Zero-shot stability and structure preservation. A second line studies how to keep VL geometry stable so zero-shot transfer remains reliable. Mod-X preserves off-diagonal similarity structure to maintain negative-pair geometry across domains (Ni et al., 2023) , ZSCL performs reference-set distillation with weight averaging to protect zero-shot predictions (Zheng et al., 2023) , CTP distils neighbourhood/topological relations (Zhu et al., 2023) , and ZAF stabilizes consecutive zero-shot outputs on unlabeled data as a strong anti-forgetting signal (Gao et al., 2024) . Probabilistic finetuning (CLAP4CLIP) further improves calibration and continual robustness (Jha et al., 2024) . These approaches strengthen global stability but still leave open whether the model retains the internal structure that enables binding-for instance, whether a composition embedding can reliably support recovering its primitive set and resist counterfactual swaps. Against this backdrop, this paper targets the above pain point from a structure-first perspective: we use a minimal head that (i) treats textual and composed representations as joint positives to keep the "meaning of a composition" anchored, (ii) makes the primitive-composition map reversible by design so binding remains recoverable.