ICLR2026
Neologism Learning for Controllability and Self-Verbalization
John Hewitt, Oyvind Tafjord, Robert Geirhos, Been Kim
被引用 5 次
摘要
Humans invent new words when there is a rising demand for a new useful concept (e.g., doomscrolling). We explore and validate a similar idea in our communication with LLMs: introducing new words to better understand and control the models, expanding on the recently introduced neologism learning. This method introduces a new word by adding a new word embedding and training with examples that exhibit the concept with no other changes in model parameters. We show that adding a new word allows for control of concepts such as flattery, incorrect answers, text length, as well as more complex concepts in AxBench. We discover that neologisms can also further our understanding of the model via self-verbalization: models can describe what each new word means to them in natural language, like explaining that a word that represents a concept of incorrect answers means "a lack of complete, coherent, or meaningful answers. . . " To validate self-verbalizations, we introduce plug-in evaluation: we insert the verbalization into the context of a model and measure whether it controls the target concept. In some self-verbalizations, we find machine-only synonyms: words that seem unrelated to humans but cause similar behavior in machines. Finally, we show how neologism learning can jointly learn multiple concepts in multiple words. INTRODUCTION Language model alignment can be framed as a problem of communicating human values to machines, and understanding machine concepts, like their interpretations of our values. Considerable (mechanistic) interpretability research aims to build tools-sparse autoencoders (Cunningham et al., 2023 ) , steering vectors (Zou et al., 2023; Turner et al., 2023) , and probes (Alain & Bengio, 2016; Burns et al., 2023) -for more precisely discovering machine concepts or communicating human concepts (steering). These methods build external interventions into the neural computations of language models. Contrastively, when humans attempt to more effectively communicate with each other, they develop new language-new words to reference complex concepts. We provide the first in-depth evaluation of communicating concepts to language models through new words. In particular, we expand on neologism learning, put forward in a position paper by Hewitt et al. (2025) . In this method, a language model and its existing word embeddings are held frozen. New words are introduced, with new word embeddings. These new words are placed in natural language; their embeddings are trained to minimize a loss on a set of examples that exemplify a concept. Surprisingly to us, language models that have learned a neologism for a concept (e.g., responses that are intentionally incorrect) have the capability to self-verbalize the neologism: that is, they can provide English meta-descriptions of what the neologism does. For example, Gemma-3-4B-IT self-verbalizes this incorrect-response neologism as causing responses characterized by the following, despite not being trained on descriptions of this neologism's intended behavior: neologism answers are characterized by a lack of complete, coherent, or meaningful answers. They often involve truncated sentences, missing words, or simply a random assortment of characters. They're like a digital shrug, a refusal to engage fully with the question. Basically, they're just... there. 1 1 The new word embedding for neologism is initialized to a neutral word not related to correctness.