ICML2025
CollabLLM: From Passive Responders to Active Collaborators
Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao
Abstract
Website: aka.ms/CollabLLM โฃ Multiturn-aware Reward โก Response ๐ Real-world or Simulated User Policy ๐ ๐ฝ ๐ ๐ โข Collaborative Simulation Forward Sampling Reward Computation #1 #2 #3 โ Context state (๐) I need to write about how optimism can improve our well-being. To get us started, what kind of tone are you aiming for? Online generation RL finetuning #1 #2 #3 โฆ โฆ (๐, ๐) Extrinsic Reward e.g., Performance Intrinsic Reward Interactivity Efficiency Figure 1: COLLABLLM Framework: Given a context 1 , the model generates a response 2 to maximize long-term collaboration gains, termed Multiturn-aware Rewards (MR). During training, MRs are estimated via 3 collaborative simulation, which forward-samples conversations with simulated users. Finally, 4 reinforcement fine-tuning is applied using the MRs.