ICLR2026

Reasoning-Driven Multimodal LLM for Domain Generalization

Zhipeng Xu, Zilong Wang, XINYANG JIANG, Dongsheng Li, De Cheng, Nannan Wang

1 citation

Abstract

This paper addresses the domain generalization (DG) problem in deep learning. While most DG methods focus on enforcing visual feature invariance, we leverage the reasoning capability of multimodal large language models (MLLMs) and explore the potential of constructing reasoning chains that derives image categories to achieve more robust predictions under domain shift. To this end, we systematically study the role of reasoning in DG using DomainBed-Reasoning, a newly constructed extension of DomainBed dataset, in which each sample is paired with class-relevant reasoning chains. Our analysis reveals two key challenges: (i) finetuning MLLMs with reasoning chains for classification is more challenging than direct label supervision, since the model must optimize complex reasoning sequences before label prediction; and (ii) mismatches in reasoning patterns between supervision signals and fine-tuned MLLMs lead to a trade-off between semantic richness (informative but harder to optimize) and optimization efficiency (easier to optimize but less informative). To address these issues, we propose RD-MLDG (Reasoning-Driven Multimodal LLM for Domain Generalization), a framework with two components: (i) MTCT (Multi-Task Cross-Training), which introduces an additional direct classification pathway to guide reasoning supervision; and (ii) SARR (Self-Aligned Reasoning Regularization), which preserves the semantic richness of reasoning chains while mitigating reasoning-pattern mismatches via iterative self-labeling. Experiments on standard DomainBed datasets (PACS, VLCS, OfficeHome, TerraInc) demonstrate that RD-MLDG achieves state-of-theart performances, highlighting reasoning as a promising complementary signal for robust out-of-domain generalization. INTRODUCTION In real-world scenarios, deep learning models are often required to generalize reliably to unseen distributions. However, due to domain shift, their performance often degrades substantially when deployed in new environments. To address this issue, Domain Generalization (DG) has emerged as a key research area, aiming to learn representations and prediction functions that transfer well to unseen domains using only source-domain data. Existing DG approaches generally fall into four categories: invariant representation learning, data augmentation, regularization, and meta-learning. While effective, these methods primarily focus on feature-level invariance and often fail to capture higher-level cross-domain commonalities, limiting their ability to generalize in complex scenarios. Recently, multi-modal large language models (MLLMs) have made significant progress and demonstrate strong reasoning capabilities. This capability offers a new opportunity for DG: rather than