ICLR2026

TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen

85 citations

Abstract

have long been the foundation for 3D structure reconstruction and camera pose estimation. These methods rely on associating 2D correspondences [4, 25, 47, 50, 63] or minimizing reprojected photometric errors [29, 30] , followed by bundle adjustment (BA) [1, 12, 77, 79, 80, 86] for structure and motion refinement. Although highly effective when assembled into comprehensive systems [50, 66] , these approaches often struggle in conditions of small camera parallax or ill-posed conditions (e.g., dynamic or textureless), leading to performance degradation. Recent work, such as MegaSaM [43] and VIPE [35] , has demonstrated progress in adapting traditional SLAM paradigms to dynamic scenes by integrating semantic segmentation [35, 39] , optical flows [35, 43, 104, 105] , and geometric constraints [35, 39, 43, 48, 104] . Concurrently, methods like VGGT-SLAM [49] and seek improved robustness by integrating learned front-ends [51, 87, 91] . However, these methods require iterative optimization based on off-the-shelf estimation, where synchronization barriers often lead to cumulative errors and high computational overhead. This reliance hinders real-time online inference and learning scalability (e.g., the 'tabula rasa' blank slate limitation [89] ). In this work, we investigate data-driven feed-forward models with generalizable priors to enable dense 3D reconstruction even from dynamic and textureless video sequences.