ICLR2026
Identifying Robust Neural Pathways: Few-Shot Adversarial Mask Tuning for Vision-Language Models
Wonjeong Choi, Sejong Ryu, Jungmoon Lee, Dong-Jun Han, Jaekyun Moon
Abstract
Recent vision-language models (VLMs), such as CLIP, have demonstrated remarkable transferability across a wide range of downstream tasks by effectively leveraging the joint text-image embedding space, even with only a few data samples. Despite their impressive performance, these models remain vulnerable to adversarial attacks, raising significant concerns about their security and reliability in practical deployments. To address this issue, we propose Adversarial Mask Tuning (AdvMask), a method that effectively enhances the robustness of VLMs without directly modifying their pre-trained weights. Instead, our AdvMask learns a set of binary masks that selectively deactivate model parameters vulnerable to adversarial perturbations. By identifying robust neural pathways within the vision encoder, AdvMask facilitates the generation of features and predictions that are resistant to adversarial attacks. Furthermore, we introduce a Layer-wise Adaptive Feature Alignment (LAFA) loss, specifically designed to optimize AdvMask in few-shot scenarios. The LAFA loss adaptively aligns intermediate-layer features from clean and adversarial samples across each transformer block, enhancing the representational robustness of the model. Experimental results across multiple benchmarks confirm that our AdvMask approach substantially outperforms existing adversarial tuning techniques for VLMs, especially in few-shot settings. Our code is available in https://github.com/wonjeongchoi/AdvMask . Published as a conference paper at ICLR 2026 to resist adversarial perturbations. While these approaches only require updating a small number of learnable parameters, they overlook the inherent properties in the model's pre-trained structure (i.e., neurons), limiting their capability to produce robust representations against adversarial attacks. Other works attempt to directly fine-tune the model using adversarial training strategies; however, these approaches can lead to overfitting in few-shot settings (where only a small number of labeled samples are available for each downstream task) and may compromise the generalization ability of the original pre-trained VLM. Furthermore, several methods targeting zero-shot robustness (Yu et al., 2024; Mao et al., 2023) rely on a held-out dataset for adversarial tuning (i.e., no task-specific samples are available), but they often fail to achieve satisfactory performance on downstream tasks. The effectiveness of these approaches largely depends on the quality of the held-out dataset. An extended discussion of related works is provided in Sec. 4. Motivated by these challenges, in this work, we aim to answer the following key question: What is the most effective way to achieve robustness against adversarial attacks on pre-trained VLMs in few-shot downstream settings? Key Ideas. Unlike previous methods that predominantly focus on prompt adaptation or direct parameter updates, we propose an adversarial mask tuning (AdvMask) approach that searches for robust subnetwork within well-trained VLMs as a promising alternative. Inspired by recent studies (Zheng et al., 2023; Zhao et al., 2020; Lin et al., 2020) demonstrating the effectiveness of identifying neural pathways for adapting large-scale pre-trained models, we introduce a novel perspective of a robust neural pathway, which, to the best of our knowledge, has not been explored in previous works. Specifically, given a few samples from the downstream task, our goal is to learn a binary mask that identifies a subnetwork structure within the pre-trained VLM, one that not only facilitates downstream adaptation but also inherently resists adversarial perturbations. Consequently, by identifying the robust neural pathway, our approach selectively emphasizes robust features during forward passes, substantially improving the adversarial robustness. Interestingly, we demonstrate that such a robust neural pathway indeed exists (further intuitive explanations are provided in Sec. 3.3). Within our AdvMask training paradigm, we introduce the Layer-wise Adaptive Feature Alignment (LAFA) loss, which enables enhanced robustness and stability. Previous objective functions for adversarial tuning (Mao et al., 2023; Zhou et al., 2024) primarily provide supervision at the final output stage (i.e., the joint text-image embedding space), overlooking the importance of robust intermediate representations within the vision encoder. In contrast, our LAFA loss explicitly guides each transformer's intermediate representations to be robust against adversarial perturbations by closely aligning features extracted from adversarial samples with their corresponding clean sample features. Additionally, to effectively handle the limited data in few-shot settings, we adopt an adaptive weighting mechanism based on predictive reliability. Specifically, within our LAFA loss, features from samples that the model predicts correctly with high confidence provide more reliable alignment signals, whereas s