CVPR2025

FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering

Chengyue Huang, Brisa Maneechotesuwan, Shivang Chopra, Zsolt Kira

Abstract

Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are essential for maintaining performance across in-distribution (ID) and out-of-distribution (OOD) scenarios, current evaluation settings are primarily unimodal or particular to some types of OOD, offering limited insight into the complexities of multi-modal contexts. In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness Across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. We utilize ten existing VQA benchmarks, including VQAv2, IV-VQA, VQA-CP, OK-VQA and others, and categorize them into ID, near and far OOD datasets covering uni-modal, multi-modal and adversarial distribution shifts. We first conduct a comprehensive comparison of existing robust fine-tuning methods. We then quantify the distribution shifts by calculating the Mahalanobis distance using uni-modal and multimodal embeddings extracted from various models. Further, we perform an extensive analysis to explore the interactions between uni-and multi-modal shifts as well as modality importance for ID and OOD samples. These analyses offer valuable guidance on developing more robust fine-tuning methods to handle multi-modal distribution shifts. The code is available at https://github.com/ chengyuehuang511/FRAMES-VQA. * Equal contribution. and clipart, challenging models to generalize across different styles and representations. Similarly, various ImageNet variants [10, 22, 39, 50] introduce shifts through image variations, adversarial examples, rendering transformations, and changes in texture or background. Collectively, these datasets provide a comprehensive framework for assessing how well models withstand visual distribution changes. While robust fine-tuning algorithms are widely examined under distribution shifts in a single modality (images), few studies have explored robust fine-tuning for VQA tasks, where distribution shifts are multi-modal and models must adapt to variations across both visual and textual inputs. Apart from visual shift [1], there are question shifts [15, 41] involving variations in phrasing, structure, or vocabulary, as well as answer shifts [2] with changes in answer distributions such as frequency and formatting. Beyond uni-modal shift, these variations may occur simultaneously across visual, question, and answer inputs [9, 31, 42, 44, 49] , posing an even greater challenge as models must generalize across complex, combined shifts. Therefore, we build upon our preliminary exploration [25] and propose a benchmark FRAMES-VQA (Fine-Tuning Robustness Across Multi-Modal Shifts in VQA) to systematically evaluate the robustness of fine-tuning in VQA task. We leverage ten existing VQA datasets and categorize distribution shifts into uni-modal and multi-modal types, quantified by Mahalanobis distance across various backbones to capture both near and far OOD scenarios. We conduct a comprehensive comparison of the existing robust fine-tuning baselines on ID and OOD performance using the benchmark. Furthermore, we analyze shift scores and modality importance across fine-tuning methods. To summarize, our contributions are: • We propose FRAMES-VQA for evaluating robust finetuning in VQA, including ten VQA datasets categorized by uni-modal (e.g., image, question) and multi-modal shifts. We quantify dataset shifts under different modalities using Mahalanobis distance and embeddings from different backbones. • We perform an in-depth comparison of robust fine-tuning This CVPR paper is the Open Access version, provided by the Computer Vision Foundation. Except for this watermark, it is identical to the accepted version; the final published version of the proceedings is available on IEEE Xplore.