ICLR2025
Controlling Language and Diffusion Models by Transporting Activations
Pau Rodríguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, Xavier Suau
Abstract
The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (ACT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. ACT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that ACT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how ACT enables fine-grained style control and concept negation. Once upon a time, there was an old man who lived in the forest. He had no family and he spent his days alone collecting mushrooms for food to survive on. 0.5 Once upon a time, there was an amazing woman named Sarah. She had the most beautiful smile and kindest heart you could ever imagine! Sarah loved to play soccer with her friends on Saturday mornings at 9am sharp every week. 1 Once upon a time, the only way to watch football was on TV. The game of soccer had been played in England since 1863 and by the early twentieth century it became one of Britain's most popular sports. * Equal contribution. Published as a conference paper at ICLR 2025 M.6 STYLE PROMPTS Table 15: List of tags generated with Llama-8B-instruct (right) to induce different styles (left). Anime anime style, large expressive eyes, stylized hair, bold outlines, simplified colors, dynamic perspective, exaggerated features, angular shapes, chibis, manga inspired, emotive facial expressions, action sequences, speed lines, cell shading, graphic backgrounds, vibrant palettes Art nouveau Art Nouveau, Alphonse Mucha, Gustav Klimt, flowing lines, organic shapes, floral motifs, geometric patterns, ornamental designs, Jugendstil, Secessionism, symbolism, female figures, gold leaf, intricate details, turn of the century art, early 20th century Impressionism impressionism, Claude Monet, brush strokes, light, color, outdoor scenes, water lilies, haystacks, Rouen Cathedral, reflections, nature, atmospheric, vibrant colors, visible textures, 19th century art, French impressionism Cyberpunk cyberpunk, neon lights, urban jungles, high-tech architecture, augmented reality, AI technology, biopunk, futuristic cities, post-apocalyptic scenes, digital hacking, megacorporations, androids, dystopian societies, cybernetic enhancements, chromed details, glowing neon signs, rain-soaked streets