AAAI2026

Steering Visuomotor Policy in Open Worlds via Cross-View Goal Alignment

Shaofei Cai, Zhancun Mu, Anji Liu, Yitao Liang

Abstract

We aim to develop a goal specification method that is semantically clear, spatially sensitive, domainagnostic, and intuitive for human users to guide agent interactions in 3D environments. Specifically, we propose a novel cross-view goal alignment framework that allows users to specify target objects using segmentation masks from their camera views rather than the agent's observations. We highlight that behavior cloning alone fails to align the agent's behavior with human intent when the human and agent camera views differ significantly. To address this, we introduce two auxiliary objectives: cross-view consistency loss and target visibility loss, which explicitly enhance the agent's spatial reasoning ability. According to this, we develop RO C K E T-2, a state-of-the-art agent trained in Minecraft, achieving an improvement in the efficiency of inference 3× to 6× compared to ROCKET-1. We show that ROCKET-2 can directly interpret goals from human camera views, enabling better human-agent interaction. Remarkably, ROCKET-2 demonstrates zero-shot generalization capabilities: despite being trained exclusively on the Minecraft dataset, it can adapt and generalize to other 3D environments like Doom, DMLab, and Unreal through a simple action space mapping. The project page is available at https://craftjarvis.github. io/ROCKET-2/ . trade build a bridge use a portal activate ender portal shoot dragon rescue combat Minecraft Minecraft Minecraft Minecraft Minecraft Unreal Doom Figure 1 | Powered by cross-view goal specification, we are the first to show that AI agents can complete complex tasks such as building a bridge and damaging the dragon in Minecraft. In addition, it demonstrates an impressive zero-shot generalization to other 3D games.