ICML2025

CollabLLM: From Passive Responders to Active Collaborators

Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao

出版方

摘要

Website: aka.ms/CollabLLM ④ Multiturn-aware Reward ② Response 𝒚 Real-world or Simulated User Policy 𝝅 𝜽 𝒚 𝒙 ③ Collaborative Simulation Forward Sampling Reward Computation #1 #2 #3 ① Context state (𝒙) I need to write about how optimism can improve our well-being. To get us started, what kind of tone are you aiming for? Online generation RL finetuning #1 #2 #3 … … (𝒙, 𝒚) Extrinsic Reward e.g., Performance Intrinsic Reward Interactivity Efficiency Figure 1: COLLABLLM Framework: Given a context 1 , the model generates a response 2 to maximize long-term collaboration gains, termed Multiturn-aware Rewards (MR). During training, MRs are estimated via 3 collaborative simulation, which forward-samples conversations with simulated users. Finally, 4 reinforcement fine-tuning is applied using the MRs.