ICML2025
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
Philippe Hansen-Estruch, David Yan, Ching-Yao Chuang, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen
摘要
Visual tokenization through auto-encoding enhances state-of-the-art image and video generative models by compressing pixels into a latent space. However, questions persist regarding how the design of the auto-encoder affects both reconstruction and downstream generative performance. This paper investigates the impact of scaling autoencoders for reconstruction and generation by substituting the convolutional backbone with an enhanced Vision Transformer for Tokenization (Vi-Tok). This paper's results show that scaling the auto-encoder bottleneck correlates with improved reconstruction, though its relationship with generative performance is more complex. In contrast, scaling the encoder does not lead to gains, while scaling the decoder enhances reconstruction with minimal effect on generation. These findings indicate that scaling the existing autoencoder paradigm does not significantly improve generative performance. When paired with Diffusion Transformers, ViTok achieves competitive image reconstruction & generation performance on 256p and 512p ImageNet-1K. For videos, Vi-Tok achieves state-of-the-art in both reconstruction & generation performance on 128p UCF-101.