CVPR2025

Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better

Zihang Lai, Andrea Vedaldi

Abstract

Figure 1. Left: The Tracktention Layer is a plug-and-play module that can convert an image-based network (e.g., for monocular depth prediction) into a state-of-the-art video network (e.g., for video depth prediction). It does so by integrating the output of any off-the-shelf, modern, and powerful point trackers via track cross-attention. Right: For example, Tracktention achieves state-of-the-art and efficient video depth prediction by transforming Depth Anything into a video depth model. See Tab. 2 for detailed results. ⇤ Single-image models.