WWW2026
Reinforcement Learning with Verbalized Probabilities for LLM Classification
Liyao Li, Hao Chen, Jiaming Tian, Wentao Ye, Lirong Gao, Chao Ye, Ningtao Wang, Xing Fu, Yu Cheng, Haobo Wang, Gang Chen, Junbo Zhao
Abstract
While Large Language Models (LLMs) excel at many reasoning tasks, their native inability to produce calibrated, multi-class probability distributions limits their use in high-stakes Web applications like content moderation and fraud detection. Existing methods to elicit probabilities from LLMs either sacrifice their crucial Chain-of-Thought (CoT) reasoning capabilities or suffer from poor calibration. To address this, we introduce a new paradigm, Verbalized Probability Distribution, and a novel training framework, RLVP (Reinforcement Learning with Verbalized Probabilities). RLVP fine-tunes an LLM to generate both an interpretable CoT and a complete, verbalized probability distribution. We overcome the ''insufficient reward granularity'' problem in standard Reinforcement Learning (RL) for classification by using soft probabilities from expert tabular models as a dense reward curriculum. Through large-scale joint training on 169 tabular tasks, we demonstrate that a single RLVP-trained model can surpass a strong, task-specific XGBoost baseline on up to 55% of tasks. More importantly, the trained model achieves state-of-the-art few-shot performance on unseen, heterogeneous Web benchmarks that mix structured data with free text, achieving performance comparable to or superior than expert models trained on the same limited data. This showcases a strong capability for generalization and knowledge transfer to complex Web data. Our work presents a viable path toward building general-purpose, probabilistically-sound, and interpretable foundation models for the Web.