ICML2025

CollabLLM: From Passive Responders to Active Collaborators

Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao

Abstract

Website: aka.ms/CollabLLM โ‘ฃ Multiturn-aware Reward โ‘ก Response ๐’š Real-world or Simulated User Policy ๐… ๐œฝ ๐’š ๐’™ โ‘ข Collaborative Simulation Forward Sampling Reward Computation #1 #2 #3 โ‘  Context state (๐’™) I need to write about how optimism can improve our well-being. To get us started, what kind of tone are you aiming for? Online generation RL finetuning #1 #2 #3 โ€ฆ โ€ฆ (๐’™, ๐’š) Extrinsic Reward e.g., Performance Intrinsic Reward Interactivity Efficiency Figure 1: COLLABLLM Framework: Given a context 1 , the model generates a response 2 to maximize long-term collaboration gains, termed Multiturn-aware Rewards (MR). During training, MRs are estimated via 3 collaborative simulation, which forward-samples conversations with simulated users. Finally, 4 reinforcement fine-tuning is applied using the MRs.