ICML2025
Meta-Reinforcement Learning with Adaptation from Human Feedback via Preference-Order-Preserving Task Embedding
Siyuan Xu, Minghui Zhu
摘要
This paper studies meta-reinforcement learning with adaptation from human feedback. It aims to pre-train a meta-model that can achieve few-shot adaptation for new tasks from human preference queries without relying on reward signals. To solve the problem, we propose the framework adaptation via Preference-Order-preserving EMbedding (POEM). In the meta-training, the framework learns a task encoder, which maps tasks to a preference-order-preserving task embedding space, and a decoder, which maps the embeddings to the task-specific policies. In the adaptation from human feedback, the task encoder facilitates efficient task embedding inference for new tasks from the preference queries and then obtains the task-specific policy. We provide a theoretical guarantee for the convergence of the adaptation process to the task-specific optimal policy and experimentally demonstrate its state-ofthe-art performance with substantial improvement over baseline methods.