EMNLP2025

VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft

Honghao Fu, Junlong Ren, Qi Chai, Deheng Ye, Yujun Cai, Hao Wang

Abstract

Large language models (LLMs) have shown significant promise in embodied decisionmaking tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on largescale domain-specific data entail prohibitive development costs. This paper introduces Vista-Wise, a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes a dedicated object detection model for visual analysis. It reduces the requirement for domain-specific training data from millions of samples to a few hundred. VistaWise integrates visual information and textual dependencies into a cross-modal knowledge graph (KG), enabling a comprehensive and accurate understanding of multimodal environments. We also equip the agent with a retrieval-based pooling strategy to extract task-related information from the KG, and a desktop-level skill library to support direct operation of the Minecraft desktop client via mouse and keyboard inputs. Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance. * The work was done during an internship at HKUST(GZ). † Corresponding Author. ‡ Equal Contribution. KG Construction Memory Stack Desktop-level Skill Library Skill 1 def mine(duration): pyautogui.mouseDown(button='left') time.sleep(duration) pyautogui.mouseUp(button='left')