ACL2024

PIXAR: Auto-Regressive Language Modeling in Pixel Space

Yintao Tai, Xiyang Liao, Alessandro Suglia, Antonio Vergari

Abstract

Recent work showed the possibility of building open-vocabulary large language models (LLMs) that directly operate on pixel representations. These models are implemented as autoencoders that reconstruct masked patches of rendered text. However, these pixel-based LLMs are limited to discriminative tasks (e.g., classification) and, similar to BERT, cannot be used to generate text. Therefore, they cannot be used for generative tasks such as freeform question answering. In this work, we introduce PIXAR, the first pixel-based autoregressive LLM that performs text generation. Consisting of only a decoder, PIXAR can perform free-form generative tasks while keeping the number of parameters on par with previous encoder-decoder models. Furthermore, we highlight the challenges of generating text as non-noisy images and show this is due to using a maximum likelihood objective. To overcome this problem, we propose an adversarial pretraining stage that improves the readability and accuracy of PIXAR by 8.1 on LAMBADA and 8.5 on bAbI-making it comparable to GPT-2 on text generation tasks. This paves the way to build open-vocabulary LLMs that operate on perceptual input only and calls into question the necessity of the usual symbolic input representation, i.e., text as (sub)tokens. language models (LLMs) allocate millions of pa-044 rameters just for this. 1 Additionally, fixing a vo-045 cabulary a priori can lead to performance degrada-046 tion due to unseen out-of-vocabulary (OOV) words 047 (Kaddour et al., 2023). Tokenizers with smaller 048 granularities, such as characters and bytes, can alle-049 viate the OOV issue but are still brittle as they can 050 suffer from orthographic attacks (Eger et al., 2020). 051 On the other hand, humans are incredibly robust to 052 a variety of text permutations (Rayner et al., 2006) 053 because they leverage the graphical information in 054 text (Sun et al., 2021). 055 To tackle these problems, Rust et al. (2023) pro-056 posed PIXEL, a pixel-based LLM that treats text 057 as images. Pixel-based embeddings remove the 058 need for a finite vocabulary and keep the visual 059 information of text, questioning whether we need 060 symbolic representations of text as input at all, or 061 if an LLM can learn symbols implicitly. PIXEL 062 achieved comparable performance with BERT (De-vlin et al., 2019) in a range of downstream classifi-064 cation and regression NLP tasks while being robust 065 to character-level visual attacks (Eger et al., 2020). 066 However, because of its close architectural similar-067 ities with BERT, PIXEL cannot deal with free-form 068 generative tasks, such as generative question an-069 swering (Lawrence et al., 2019). 070 To fill this gap, we present PIXAR 2 , the first 071 pixel-based autoregressive LLM that can generate 072 short sequences of text as images. PIXAR is to 073 GPT-like architectures as PIXEL is to BERT-like 074 architectures: it consists of a Transformer decoder 075 (Radford et al., 2019) that autoregressively gen-076 erates text image patches as output. Generating 077 new text as pixels starting from pixels only is, how-078 ever, more challenging than selecting symbolic to-079 kens from a vocabulary (as GPT-like models) or 080 reconstructing masked image patches (as in PIXEL). 081 This is because the model has to learn to generate 082 longer sequences of pixels. To this end, we intro-083 duce a two-stage pretraining strategy for PIXAR. 084 First, following previous work on autoregressive 085 LLMs (Radford et al., 2019) and image generation 086 models (Chen et al., 2020a), PIXAR is trained by 087 reconstructing the next patch of pixels derived from 088 a large-scale corpus of rendered text using teacher-089 forcing. This maximum-likelihood approach, how-090 ever, can generate image patches containing noisy 091 text. To mitigate this problem, we proposed a sec-092 ond pretraining stage, where PIXAR is trained with 093 an additional adversarial loss. 094 Our experiments in Section 4 show that 200 steps 095 of stage 2 pretraining improve the readability of 096 generated text significantly and achieve comparable 097 performance with GPT-2 (Radford et al., 2019) on 098 open-answer short generative tasks such as bAbI 099 (Weston et al., 2015) and LAMBADA (Paperno 100 et al., 2016). Additionally, PIXAR achieves bet-101 ter performance than PIXEL on the GLUE bench-102 mark (Wang et al., 2018) while using a computa-103 tional budget and a number of model parameters 104 equivalent to the encoder part of PIXEL. 105 2 Beyond token-based LLMs 106 The idea of using pixel-based representations of 107 text has been applied in various NLP tasks. For 108 instance, Liu et al. (2017) used a CNN-based block 109 to extract character-level visual representations of 110 Chinese writing and similarly, Sun et al. (2018) for 111 text classification. Graphical features of Chinese 112 2 The code is submitted with this submission. conditioned on a sequence of observed (gold) 213 patches