ICLR2025

CodePlan: Unlocking Reasoning Potential in Large Language Models by Scaling Code-form Planning

Jiaxin Wen, Jian Guan, Hongning Wang, Wei Wu, Minlie Huang

Abstract

Despite the remarkable success of large language models (LLMs) on traditional natural language processing tasks, their planning ability remains a critical bottleneck in tackling complex multi-step reasoning tasks. Existing approaches mainly rely on prompting or task-specific fine-tuning, often suffering from poor robustness and cross-task generalization. To address the limitation, we introduce CODEPLAN, a scalable framework that empowers LLMs to generate and follow code-form planspseudocode that outlines high-level, structured reasoning processes. By leveraging the structured and versatile nature of code, CODEPLAN effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks. Importantly, CODEPLAN allows automatic extraction of code-form plans from massive, wide-ranging text corpora without the need for curated, task-specific datasets. This enables it to scale up efficiently and improve LLM's reasoning capabilities across diverse scenarios. To train CODEPLAN, we construct a large-scale dataset of 2M examples that integrate code-form plans with standard prompt-response pairs from existing corpora. With minimal computation overhead during both training and inference, CODEPLAN achieves a 25.1% relative improvement compared with directly generating responses, averaged across 13 challenging multi-step reasoning benchmarks, spanning mathematical reasoning, symbolic reasoning, instructionfollowing, multi-hop QA, and decision-making tasks. Further analysis reveals CODEPLAN's increasing performance gains on more complex reasoning tasks, as well as significant data efficiency thanks to its generalization ability. * Equal Contribution † Corresponding Authors To train CODEPLAN, we construct a large-scale dataset with 2M examples in the form of ⟨prompt, code-form plan, response⟩. We validate the effectiveness of CODEPLAN in multiple models, including Mistral (Jiang et al., 2023) and Llama (Touvron et al., 2023; Dubey et al., 2024) . Extensive experiments show that CODEPLAN consistently and significantly outperforms directly generating responses without planning, yielding a relative 25.1% performance gain averaged across 13 challenging reasoning benchmarks spanning mathematical reasoning, symbolic reasoning, instruction-following, multi-hop question answering, and decision-making tasks. These results provide compelling evidence for the models' enhanced ability to tackle complex multi-step reasoning problems. By leveraging code-form plans as an intermediate representation during training, we pioneer a scalable framework for endowing LLMs with structured, versatile, and interpretable reasoning -a capability that has remained elusive when relying solely on natural language. In summary, this work makes several pivotal contributions: I. We introduce CODEPLAN, a novel, scalable framework that empowers LLMs to generate and follow code-form plans-pseudocode that outlines high-level, structured reasoning processes. This framework unlocks new frontiers for structured reasoning with LLMs, transcending the limitations imposed by the obscured implicit planning signals in natural language text. II. CODEPLAN allows efficient and cost-effective training data construction from massive, wideranging corpora, enabling promising data scalability. We exemplify this by curating a large-scale dataset comprised of 2M prompt-response pairs along with their corresponding code-form plans. This dataset also establishes a rich resource for future research on reasoning in LLMs. III. We demonstrate CODEPLAN's remarkable efficacy and generality across 13 challenging reasoning benchmarks on multiple backbone models, scaling from 7B to 13B. Further analysis reveals its growing advantage over baselines as problem complexity increases, and its strong data efficiency. METHODOLOGY We formally define the multi-step reasoning task as follows: Given a prompt X that poses a problem, the goal is to generate a response Y that requires a comprehensive solution through a sequence of log-p u t ( found mug , " c o f f e e m a c h i n e " ) Act: think: To solve the task, I need to find and take a mug, then cool it with fridge, then put it in coffeemachine. Obs: OK.