ICLR2026

DispViT: Direct Stereo Disparity Regression with a Single-Stream Vision Transformer

Tongfan Guan, Jiaxin Guo, Tianyu Huang, Jinhu Dong, Chen Wang, Yun-Hui Liu

摘要

Deep stereo disparity estimation has long been dominated by a matching-centric paradigm, built on constructing cost volumes and iteratively refining local correspondences. Despite its success, this paradigm exhibits an intrinsic vulnerability: visual ambiguities from occlusion or non-Lambertian surfaces invevitably induce errorneous matches that refinement cannot recover. This paper introduces DispViT, a new architecture that establishes a regression-centric paradigm. Instead of explicit matching, DispViT directly regresses disparity from tokenized binocular representations using a single-stream Vision Transformer. This is enabled by a set of lightweight yet critical designs, such as a probability-based disparity parameterization for stable training and an asymmetrically initialized stereo tokenizer for effective view distinction. To better align the two views during stereo tokenization, we introduce a novel shift-embedding mechanism that encodes different disparity shifts into channel groups, preserving geometric cues even under large view displacements. A lightweight refinement module then sharpens the regressed disparity map for fine-grained accuracy. By prioritizing holistic regression over explicit matching, DispViT streamlines the stereo pipeline while improving robustness and efficiency. Experiments on standard benchmarks show that our approach achieves state-of-the-art accuracy, with strong resilience to matching ambiguities and wide disparity ranges. Code will be released.