ICLR2026

Task Tokens: A Flexible Approach to Adapting Behavior Foundation Models

Ron Vainshtein, Zohar Rimon, Shie Mannor, Chen Tessler

Abstract

Recent advancements in imitation learning for robotic control have led to transformer-based behavior foundation models (BFMs) that enable multi-modal, human-like control for humanoid agents. These models generate solutions when conditioned on high-level goals or prompts, for example, walking to a coordinate when conditioned on the position of the robot's pelvis. While excelling at zero-shot generation of robust behaviors, BFMs often require meticulous prompt engineering for specific tasks, potentially yielding suboptimal results. In this work, we introduce "Task Tokens" -a method to effectively tailor BFMs to specific tasks while preserving their flexibility. Our approach integrates naturally within the transformer architecture of BFMs. Task Tokens trains a task-specific encoder (tokenizer), with the original BFM remaining untouched. Our method reduces trainable parameters per task by up to ×125 and converges up to ×6 faster compared to standard baselines. In addition, by keeping the original BFM unchanged, Task Tokens enables utilizing the pre-existing encoders. This allows incorporating user-defined priors, balancing reward design and prompt engineering. We demonstrate Task Tokens' efficacy across various tasks, including out-of-distribution scenarios, and show their compatibility with other prompting modalities. Our results suggest that Task Tokens offer a promising approach for adapting BFMs to specific control tasks while retaining their generalization capabilities. Recent advances in imitation learning have facilitated the emergence of behavior foundation models (BFMs) designed for humanoid control (Peng et al., 2022; Won et al., 2022; Luo et al., 2024a; Tessler et al., 2024) . These models, generate diverse behaviors when trained on large-scale human demonstration data. In this work, we focus on a specific type of BFM, which we call Goal-Conditioned Behavior Foundation Models (GC-BFMs). Methods such as Masked Trajectory Models and MaskedMimic fall into this category (Wu et al., 2023; Tessler et al., 2024) . These methods use transformer architectures that process sequences of tokenized goals -high-level objectives such as "follow a path" or "reach with your right hand towards the object" are mapped to embedding tokens. These tokens condition the model's behavior generation. Specifically, we focus on MaskedMimic, which has manifested as a particularly effective framework, demonstrating robust zero-shot generalization (ability to handle new, unseen tasks without additional training) through its token-based goal conditioning mechanism. For real-world usage, BFMs must be flexible enough to solve a variety of tasks, but at the same time specialized enough to effectively solve complex tasks. Despite MaskedMimic's proficiency in generating diverse motions from high-level goals, significant challenges persist in defining precise goal specifications, or prompts, for complex tasks. Typically, an environment-specific reward can be designed, but this is prone to potential errors in complex, long-horizon tasks. In contrast, GC-BFMs provide a "prompt-engineering" interface, where the user can specify high-level goals, which can result in a more stable motion, but might be less intuitive for some tasks. Consider a game character tasked with walking to an object and striking it. Even in this simple task, on the one hand, a common emerging error of using reward design is that the character walks backward to the goal, but on the other, specifying high-level goals for the striking motion to precisely hit the target is hard. This