ICLR2026

Using Reinforcement Learning to Train Large Language Models to Explain Human Decisions

Jian-Qiao Zhu, Hanbo Xie, Dilip Arumugam, Robert Wilson, Thomas L. Griffiths

被引用 6 次

摘要

A central goal of cognitive modeling is to develop models that not only predict human behavior but also provide insight into the underlying cognitive mechanisms. While neural network models trained on large-scale behavioral data often achieve strong predictive performance, they typically fall short in offering interpretable explanations of the cognitive processes they capture. In this work, we explore the potential of pretrained large language models (LLMs) to serve as dualpurpose cognitive models -capable of both accurate prediction and interpretable explanation in natural language. Specifically, we employ reinforcement learning with outcome-based rewards to guide LLMs toward generating explicit reasoning traces that explain human risky choices. Our findings demonstrate that this approach produces high-quality explanations at scale alongside strong quantitative predictions of human decisions.