ICLR2026

RAVEN: End-to-end Equivariant Robot Learning with RGB Cameras

David Klee, Boce Hu, Andrew Cole, Heng Tian, Dian Wang, Robert Platt, Robin Walters

摘要

Recent work has shown that equivariant policy networks can achieve strong performance on robot manipulation tasks with limited human demonstrations. However, existing equivariant methods typically require structured inputs, such as 3D point clouds or top-down camera views, which prevents their use in low-cost setups or dynamic environments. In this work, we propose the first SE(3)-equivariant policy learning framework that operates with only RGB image observations. The key insight is to treat image-based data as collections of rays that, unlike 2D pixels, transform under 3D roto-translations. Extensive experiments in both simulation with diverse robot configurations and real-world settings demonstrate that our method consistently surpasses strong baselines in both performance and efficiency. Our project page is available at https://dmklee.github.io/raven . Published as a conference paper at ICLR 2026 flexible policy head that supports both world-frame (absolute) and end-effector-frame (relative) action representations, allowing compatibility with a wide range of robotic systems and control interfaces. Unlike prior equivariant methods that often suffer from slow training due to heavy architectures, our method achieves SE(3)-equivariance with a training speed that is even faster than a baseline Diffusion Policy, but with significantly improved performance. This paper makes the following contributions: • We introduce an encoder that expresses an image as a set of SE(3)-equivariant geometric tokens. Tokens from different, arbitrarily placed cameras can be seamlessly combined to produce a single, consistent representation of the world. • Based on this encoder, we propose RAVEN, the first end-to-end SE(3)-equivariant robotic policy learning framework that can operate directly on RGB image inputs. • We characterize our method both in simulation and on a physical robot. It outperforms the strongest baseline by 12% over 12 MimicGen tasks, 17% over 6 DexMimicGen tasks, and 35% over 4 real-world tasks. Finally, while prior equivariant models are often slow to train, our method trains approximately 1.6× faster than the previous equivariant diffusion method.