NeurIPS2025
Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning
Xiaoxue Cheng, Junyi Li, Zhenduo Zhang, Xinyu Tang, Xin Zhao, Xinyu Kong, Zhiqiang Zhang
被引用 23 次
摘要
Large reasoning models (LRMs) have demonstrated strong performance on complex reasoning tasks, but often suffer from overthinking, generating redundant content regardless of task difficulty. Inspired by the dual process theory in cognitive science, we propose Adaptive Cognition Policy Optimization (ACPO), a reinforcement learning framework that enables LRMs to achieve efficient reasoning through adaptive cognitive allocation and dynamic system switch. ACPO incorporates two key components: (1) introducing system-aware reasoning tokens to explicitly represent the thinking modes thereby making the model's cognitive process transparent, and (2) integrating online difficulty estimation and token length budget to guide adaptive system switch and reasoning during reinforcement learning. To this end, we propose a two-stage training strategy. The first stage begins with supervised fine-tuning to cold start the model, enabling it to generate reasoning paths with explicit thinking modes. In the second stage, we apply ACPO to further enhance adaptive system switch for difficulty-aware reasoning. Experimental results demonstrate that ACPO effectively reduces redundant reasoning while adaptively adjusting cognitive allocation based on task complexity, achieving efficient hybrid reasoning. Recent advances in large reasoning models (LRMs) [1] have demonstrated remarkable success on complex tasks such as mathematical reasoning [2, 3, 4, 5], largely attributed to reinforcement learning that encourages the generation of detailed, step-by-step reasoning processes. LRMs improve answer accuracy through self-reflection and self-verification during long reasoning paths. As the reasoning length increases, the performance of model tends to improve accordingly [2, 6, 7] . Although the long chain-of-thought (CoT) [8] reasoning in LRMs is effective for solving complex problems, it often leads to overthinking [9, 10], producing redundant reasoning paths. Most existing LRMs rely on fixed reasoning strategies, lacking the ability to dynamically switch between different thinking modes based on task complexity. This rigidity results in inefficient inference, particularly for simple problems that could be resolved more effectively with concise and direct reasoning. Several recent efforts have explored long CoT compression for efficient reasoning [11, 12] . One line of work fine-tunes LRMs using supervision from shorter chain-of-thought exemplars [13, 14] , encouraging the model to arrive at correct answers with fewer intermediate steps. Another line introduces length penalties into reinforcement learning reward functions [3, 15, 16, 17] , explicitly * Equal Contribution.