EMNLP2023

Text Rendering Strategies for Pixel Language Models

Jonas F. Lotz, Elizabeth Salesky, Phillip Rust, Desmond Elliott

被引用 3 次

摘要

Pixel-based language models process text rendered as images, which allows them to handle any script, making them a promising approach to open vocabulary language modelling. However, recent approaches use text renderers that produce a large set of almost-equivalent input patches, which may prove sub-optimal for downstream tasks, due to redundancy in the input representations. In this paper, we investigate four approaches to rendering text in the PIXEL model (Rust et al., 2023) , and find that simple character bigram rendering brings improved performance on sentence-level tasks without compromising performance on tokenlevel or multilingual tasks. This new rendering strategy also makes it possible to train a more compact model with only 22M parameters that performs on par with the original 86M parameter model. Our analyses show that character bigram rendering leads to a consistently better model but with an anisotropic patch embedding space, driven by a patch frequency bias, highlighting the connections between image patchand tokenization-based language models. Megabyte: Predicting million-byte sequences with multiscale transformers. arXiv preprint.