ICLR2026

TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen

被引用 85 次

摘要

have long been the foundation for 3D structure reconstruction and camera pose estimation. These methods rely on associating 2D correspondences [4, 25, 47, 50, 63] or minimizing reprojected photometric errors [29, 30] , followed by bundle adjustment (BA) [1, 12, 77, 79, 80, 86] for structure and motion refinement. Although highly effective when assembled into comprehensive systems [50, 66] , these approaches often struggle in conditions of small camera parallax or ill-posed conditions (e.g., dynamic or textureless), leading to performance degradation. Recent work, such as MegaSaM [43] and VIPE [35] , has demonstrated progress in adapting traditional SLAM paradigms to dynamic scenes by integrating semantic segmentation [35, 39] , optical flows [35, 43, 104, 105] , and geometric constraints [35, 39, 43, 48, 104] . Concurrently, methods like VGGT-SLAM [49] and seek improved robustness by integrating learned front-ends [51, 87, 91] . However, these methods require iterative optimization based on off-the-shelf estimation, where synchronization barriers often lead to cumulative errors and high computational overhead. This reliance hinders real-time online inference and learning scalability (e.g., the 'tabula rasa' blank slate limitation [89] ). In this work, we investigate data-driven feed-forward models with generalizable priors to enable dense 3D reconstruction even from dynamic and textureless video sequences.