ICLR2026

Quantized Visual Geometry Grounded Transformer

Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu

被引用 10 次

DOI arXiv 出版方

摘要

Published as a conference paper at ICLR 2026 while maintaining reconstruction accuracy above 98% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github. com/wlfeng0509/QuantVGGT . INTRODUCTION Recent advances in learning-based 3D reconstruction have demonstrated unprecedented capabilities in recovering dense geometry and camera trajectories directly from image sequences. Traditional approaches (Mur-Artal et al., 2015; Mur-Artal & Tardós, 2017; Schonberger & Frahm, 2016; Hartley & Zisserman, 2003) are grounded in geometric priors and optimization, but their reliance on handcrafted design choices and iterative solvers often leads to limited scalability and reduced robustness in complex scenes. In contrast, large-scale deep models have shifted the paradigm toward datadriven frameworks, offering remarkable generalization ability across diverse environments (Wang et al., 2025b; Yang et al., 2025) . A milestone in this evolution is the Visual Geometry Grounded Transformer (VGGT) (Wang et al., 2025a) . This 1.2B-parameter model unifies multiple 3D tasks, including dense depth estimation, point map regression, camera pose prediction, and point tracking within a single forward pass, consistently surpassing task-specialized counterparts. Despite its success, the billion-scale parameterization of VGGT incurs prohibitive computational and memory costs, severely restricting its deployment in real-world scenarios. Model quantization (Gholami et al., 2022; Jacob et al., 2018) is an effective compression technique by converting weights and activations of model from high-precision floating-points to low-precision integers. While this technique has been widely validated in large language models (Frantar et al., 2022; Xiao et al., 2023) and 2D vision models (Yuan et al., 2022; Wu et al., 2024) , the quantization of billionscale 3D reconstruction transformers such as VGGT remains largely unexplored. In our study, we identify two model-specific properties of VGGT that make its quantization particularly challenging: ❶ The presence of data-independent special tokens (camera and register tokens). Unlike reg- ular image tokens that are encoded from input images, these tokens are pretrained and injected into image tokens to encode global context and cross-view geometry. This data-independent property causes activation distributions to deviate from typical patterns, amplifying heavy tails and producing extreme channel and token variance. These skewed statistics are unfriendly to standard quantization, leading to substantial information loss. ❷ The inherently semantic complexity of 3D data. Each input sequence involves non-identical and complex views, meaning that the underlying semantic space is both high-dimensional and highly redundant. For quantization calibration, the ideal process is to perceive the expected major data distribution. If calibration samples are rare outliers and not diverse, the estimated quantization ranges become biased and fail to generalize, causing performance degradation across unseen scenes. Thus, sample diversity and representativeness are far more critical than in 2D vision tasks. To address these challenges, we present the first systematic investigation of Post-Training Quantization (PTQ) for VGGT and propose a tailored framework, QuantVGGT. Our approach introduces Dual-Smoothed Fine-Grained Quantization (DSFQ), which mitigates skewed statistics by combining (1) a pre-global rotation via Hadamard transforms to disperse outliers and smooth heavy-tailed distributions, and (2) a post-local smoothing step that normalizes channel-level variance in the rotated space. Additionally, to overcome calibration instability, we design Noise-Filtered Diverse Sampling (NFDS), which leverages deep-layer activation statistics to filter noisy extremes and employs frame-aware clustering aligned with VGGT's inductive biases. Together, these components yield robust, efficient, and accurate quantization of billion-scale 3D reconstruction transformers. Our contributions are summarized as follows: 1. We provide the first systematic analysis of PTQ on VGGT, highlighting quantization challenges rooted in its data-independent tokens and multi-view activation statistics. 2. We propose a dual-stage smoothing scheme that globally disperses heavy-tailed distributions and locally balances channel variance, significantly reducing quantization errors. 3. We design a calibration strategy that filters outliers and utilizes VGGT's inductive bias to construct frame-aware clusters, ensuring a representative and stable calibration set.