ICML2025
Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger
Qi Yang, Chenghao Zhang, Lubin Fan, Kun Ding, Jieping Ye, Shiming Xiang
Abstract
Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks through multimodal Retrieval-Augmented Generation (RAG). However, existing methods still face challenges, such as the scarcity of knowledge containing reasoning examples and erratic responses from retrieved knowledge. To address these issues, in this study, we propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method. Specifically, we introduce a self-consistent evaluation mechanism to enrich the knowledge base with intrinsic reasoning patterns. We further propose a Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR) to prioritize the most relevant examples. This ensures that LVLMs can leverage high-quality contextual reasoning for better and more consistent responses. Extensive experiments demonstrate that our framework achieves state-of-the-art performance across multiple VQA datasets, significantly outperforming both In-Context Learning (ICL) and Vanilla-RAG methods. It highlights the effectiveness of our knowledge base and reranking method in improving LVLMs. Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger Similarity Search Top-N Contexts Embeddings MCTS Re-ranking Knowledge Base Q: Which rhetorical appeal is primarily used in this ad? A: logos (reason) C: Identify the type of appeal: The ad claims that the Vilaplus vacuum picks up more dirt than User Question In this food web, which organism contains matter that eventually moves to the bat star? Below is a food web from an ocean ecosystem in Monterey Bay. LVLM Reasoning Contexts Answer User Question Retrieval Model MLP Retrieval Model #3 #1 Q: Based on the arrows, Below is a food web from an ocean ecosystem. A: kelp bass C: Identify the type of question and context: The user is asked to identif y which living ... Q: Which of these organisms contains matter that was once part of the bear sedge? A: snowy owl C: Identify the question: The question asks which organism contains matter that … 1. Identify the Image Content: The image shows the Great Wall of China, which is a series of fortifications built over several centuries, primarily during the Qin and Han dynasties. 2. Understand the Historical Context: The Great Wall was ... Reasoning Context 1. Identify the Image Content: The image shows the Great Wall of China, which is a series of fortifications built over several centuries, primarily during the Qin and Han dynasties. 2. Understand the Historical Context: The Great Wall was ... Reasoning Context 1. Understand the Image: The image depicts the Great Wall of China, a historical structure that stretches across the mountainous terrain. 2. Contextual Knowledge Integration: The Great Wall was constructed and extended over ...