ACL2025

R2-MultiOmnia: Leading Multilingual Multimodal Reasoning via Self-Training

Leonardo Ranaldi, Federico Ranaldi, Giulia Pucci

9 citations

Abstract

Reasoning is an intricate process that tran-scends both language and vision; because of its inherently modality-agnostic nature, developing effective multilingual and multimodal reasoning capabilities is a substantial challenge for Multimodal Large Language Models (MLLMs). They struggle to activate complex reasoning behaviours, delivering step-wise explanation, questioning and reflection, particularly in multilingual settings where high-quality supervision across languages is lacking. Recent works have introduced eclectic strategies to enhance MLLMs’ reasoning; however, they remain related to a single language. To make MLLMs’ reasoning capabilities aligned among languages and improve modality performances, we propose R2-MultiOmnia , a modular approach that instructs the models to abstract key elements of the reasoning process and then refine reasoning trajectories via self-correction. Specifically, we instruct the models producing multimodal synthetic demonstrations by bridging modalities and then self-improving their capabilities. To stabilise learning and the reasoning processes structure, we propose Curriculum Learning Reasoning Stabilisation with structured output rewards to gradually refine the models’ capabilities to learn and deliver robust reasoning processes. Experiments show that R2-MultiOmnia improves multimodal reasoning, gets aligned performances among the languages approaching strong models.