ICLR2026

$\pi^3$ : Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He

被引用 148 次

arXiv 出版方

摘要

We introduce π 3 , a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, π 3 employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design not only makes our model inherently robust to input ordering, but also leads to higher accuracy and performance. These advantages enable our simple and bias-free approach to achieve state-ofthe-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are available at Pi3. INTRODUCTION Visual geometry reconstruction, a long-standing and fundamental problem in computer vision, holds substantial potential for applications such as augmented reality (Engel et al., 2023 ), robotics (Zhu et al., 2024), and autonomous navigation (Mur-Artal et al., 2015) . While traditional methods addressed this challenge using iterative optimization techniques like Bundle Adjustment (BA) (Hartley & Zisserman, 2003) , the field has recently seen remarkable progress with feed-forward neural networks. End-to-end models like DUSt3R (Wang et al., 2024) and its successors have demonstrated the power of deep learning for reconstructing geometry from image pairs (