ICLR2026

RL of Thoughts: Navigating LLM Reasoning with Inference-time Reinforcement Learning

Qianyue Hao, Sibo Li, Jian Yuan, Yong Li

15 citations

Abstract

Despite rapid advancements in large language models (LLMs), the token-level autoregressive nature constrains their complex reasoning capabilities. To enhance LLM reasoning, inference-time techniques, including Chain/Tree/Graph-of-Thought(s), successfully improve the performance, as they are fairly cost-effective by guiding reasoning through external logical structures without modifying LLMs' parameters. However, these manually predefined, task-agnostic frameworks are applied uniformly across diverse tasks, lacking adaptability. To improve this, we propose RL-of-Thoughts (RLoT), where we train a lightweight navigator model with reinforcement learning (RL) to generate task-adaptive logical structures at inference time, enhancing LLM reasoning. Specifically, we design five basic logic blocks from the perspective of human cognition. During the reasoning process, the trained RL navigator dynamically selects the suitable logic blocks and combines them into task-specific logical structures according to problem characteristics. Experiments across multiple reasoning benchmarks (AIME, MATH, GPQA, etc.) with multiple LLMs (GPT, Llama, Qwen, and DeepSeek) illustrate that RLoT outperforms established inference-time techniques in most cases and improves up to 13.4% in challenging situations. Remarkably, with less than 3K parameters, our RL navigator is able to make sub-10B LLMs comparable to 100B-scale counterparts. Moreover, the RL navigator demonstrates strong transferability: a model trained on one specific LLM-task pair can effectively generalize to unseen LLMs and tasks. Our code is open-source at https://github.com/tsinghua-fib-lab/RL-LLM-Reasoning . INTRODUCTION Recent years have witnessed unprecedented advancements in large language models (LLMs), achieving remarkable success across diverse natural language tasks (Chang et al., 2024), including translation (Xu et al., 2024) , semantic analysis (Lan et al., 2024b;a), and information retrieval (Hao et al., 2024) . Despite these advancements, the inherent token-level autoregressive nature of LLMs poses a significant limitation for complex reasoning tasks (Zhao et al., 2023), such as solving mathematical problems (Ahn et al., 2024) or answering intricate questions (Zhuang et al., 2023) . These tasks require sophisticated logical structures and long-term dependencies that go beyond the scope of simple sequential token prediction, leaving a considerable gap between current LLM capabilities and the demands of advanced reasoning applications. Plentiful research has been devoted to enhancing LLM reasoning. On one hand, fine-tuning approaches attain substantial improvements on pretrained LLMs (Zhong et al., 2024; DeepSeek-AI et al., 2025; Team et al., 2025) . However, these methods demand massive computational resources and large-scale datasets, being costly to implement. On the other hand, inference-time techniques, exemplified by Chain-of-Thought (Wei et al., 2022), Tree-of-Thoughts (Yao et al., 2023), and Graphof-Thoughts (Besta et al., 2024), offer a lightweight alternative by enhancing reasoning through predefined external logical structures. While cost-effective, their logical structures rely on manual design and are task-agnostic, lacking the adaptability to diverse reasoning tasks. Addressing such limitations in inference-time techniques presents significant challenges. First, reasoning tasks span various domains, including mathematics, STEM, commonsense, etc., where