CVPR2025
Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers
Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis
Abstract
5 IACM-Forth Inputs Predictions Oracle Figure 1. Our framework predicts future semantic segmentation and depth maps using a multimodal transformer architecture. Leveraging masked visual modeling and cross-modal fusion, it excels in future semantic prediction, achieving state-of-the-art results in both tasks.