WWW2026

LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Multilingual Text-Centric VQA

Jing Huang, Zhiya Tan, Shutao Gong, Fanwei Zeng, Joey Tianyi Zhou, Changtao Miao, Huazhe Tan, Weibin Yao, Jianshu Li

Abstract

Multilingual Text-Centric Visual Question Answering (TEC-VQA) has become crucial for real-world applications, as it requires fine-grained understanding and reasoning over multilingual scene text. Recent advances in vision-language models (VLMs) have demonstrated strong potential in tackling multimodal tasks. However, most existing approaches rely primarily on textual Chain-of-Thought (CoT) and provide limited support for multilingual multimodal reasoning. To address this gap, we introduce LaV-CoT, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of text summary with bounding box, language identification, spatial object-level captioning, and step-by-step logical reasoning. To improve reasoning accuracy and cross-lingual generalization, we propose a novel verifiable Multi-Aspect Reward Optimization in addition to supervised fine-tuning that incorporates rewards for linguistic consistency, structural fidelity, and response accuracy. Extensive evaluations on public datasets, including MMMB, Multilingual MMBench, and MTVQA, show that LaV-CoT outperforms open-source models of similar size by up to 9.5% accuracy, even surpassing open-source models more than twice its size, and further exceeding several state-of-the-art proprietary models. Moreover, LaV-CoT has been integrated into our online Intelligent Document Processing platform. A further online A/B test demonstrates an \sim8.7% improvement in acceptance rate, validating its effectiveness in industrial deployment and commercial applications. Our code is available at this https://github.com/HJNVR/LaV-CoT repository.