ICLR2025

JetFormer: An autoregressive generative model of raw images and text

Michael Tschannen, André Susano Pinto, Alexander Kolesnikov

Abstract

Removing modeling constraints and unifying architectures across domains has been a key driver of the recent progress in training large multimodal models. However, most of these models still rely on many separately trained components such as modality-specific encoders and decoders which can limit performance on certain tasks. For example, general-purpose (VQ-)VAEs for images can limit generalization to fine-grained dense prediction tasks due to their lossy latent representation.