ICLR2025
JetFormer: An autoregressive generative model of raw images and text
Michael Tschannen, André Susano Pinto, Alexander Kolesnikov
摘要
Removing modeling constraints and unifying architectures across domains has been a key driver of the recent progress in training large multimodal models. However, most of these models still rely on many separately trained components such as modality-specific encoders and decoders which can limit performance on certain tasks. For example, general-purpose (VQ-)VAEs for images can limit generalization to fine-grained dense prediction tasks due to their lossy latent representation.