EMNLP2024

Towards Aligning Language Models with Textual Feedback

Saüc Abadal Lloret, Shehzaad Dhuliawala, Keerthiram Murugesan, Mrinmaya Sachan

2 citations

Abstract

We present ALT (ALignment with Textual feedback), an approach that aligns language models with user preferences expressed in text. We argue that text offers greater expressiveness, enabling users to provide richer feedback than simple comparative preferences, leading to more efficient and effective alignment. ALT aligns the model by conditioning its generations on the textual feedback. Our method relies solely on language modeling techniques and requires minimal hyper-parameter tuning while retaining the main benefits of RL-based alignment algorithms. We demonstrate the efficacy and efficiency of textual feedback across different tasks, including toxicity reduction, summarization, and dialogue response generation. Notably, ALT outperforms PPO in toxicity reduction and matches its performance on summarization with only 20% of the samples. We also explore using ALT with feedback from an existing LLM, examining constrained and unconstrained feedback. Additionally, we outline future directions to align models with natural language feedback. 1