ICLR2026

Enhancing Language Model Reasoning with Structured Multi-Level Modeling

Siheng Xiong, Ali Payani, Faramarz Fekri

Abstract

Inference-time scaling enhances a model's reasoning by extending its chain-ofthought (CoT). However, existing approaches typically rely on a single policy trained with outcome-reward reinforcement learning (RL), which often suffers from long-horizon plan failures where the implicit plan drifts from valid strategies, especially for small LMs with limited capacity. To address this, we propose Multi-Level Reasoning (MLR), which reformulates long-CoT generation as a two-level stochastic process. A high-level planner generates structured step descriptors specifying both the reasoning mode and the semantic subgoal. The low-level executor then produces detailed reasoning conditioned on these descriptors, forming an alternating plan-execute loop. To maintain scalability, we adopt a minimal design where the base model serves as the low-level policy and a lightweight LoRA module implements the high-level policy. For training, we observe that outcomereward RL provides sparse and delayed feedback for long trajectories (e.g., several thousand tokens), hindering credit assignment. We therefore introduce iterative Step-DPO, a process-level preference optimization scheme that leverages Twisted Sequential Monte Carlo (TSMC) to provide scalable stepwise supervision. This yields more effective training, improved stability, and higher accuracy. Extensive experiments on challenging math, science, and logical reasoning benchmarks show that, under the same reduced data budget (10% SFT and 5% preference relative to the DeepSeek-R1 distillation setup), MLR outperforms both SFT-based distillation and strong RL/preference-optimization baselines across multiple base models and tasks. Moreover, MLR exhibits slower performance degradation on long-horizon reasoning, demonstrating stronger robustness under extended CoT generation 1 . Published as a conference paper at ICLR 2026 Published as a conference paper at ICLR 2026 • We perform extensive experiments on challenging benchmarks in math, science, and logical reasoning. Results show that our approach consistently outperforms both SFT-based distillation and strong RL/preference-optimization baselines under the same data budget. INFERENCE-TIME SCALING VIA LONG CHAIN-OF-THOUGHT Formulation. Given a query q, a reasoning model generates a CoT c before producing the final response a, where q, c, a are all sequences of tokens, i.e., c To improve performance, these models allocate more computation to reasoning, often realized as generating longer c with behaviors such as exploration, self-verification, and reflection. The generation of long CoTs follows the standard autoregressive modeling: the probability of each token c[l] depends on its preceding tokens (c[1 : l -1]), which enables the factorization of the joint likelihood of the entire sequence as: Note that, for notational simplicity, we omit the conditioning on q in Eq. 1 and in the following derivations. Training the model p θ involves maximizing the likelihood of each token conditioned on its prefix, i.e., optimizing p θ (c[l] | c[1 : l -1]) over the training data. Post-training. Guo et al. (2025) detail how they incentivize the long CoT generation from a base model through large-scale RL without relying on SFT. Specifically, they employ GRPO guided by rule-based outcome reward. For each query q, GRPO samples a group of outputs o 1 , o 2 , • • • , o G from the old policy π θold , where each output is composed of a CoT followed by the final response, i.e., o i = [c i , a i ], and then optimizes the policy π θ by maximizing the corresponding objective. Discussion on the weakness of single-policy long CoT. The above approach of using single-policy long CoT enables inference-time scaling with LMs, but introduces several issues: d (2) c (3) c High-level Abstraction Low-level Details (3) d Summarization (3) c Figure 5: Overview of the architecture. The model alternates between generating high-level descriptors and the corresponding low-level content. Ablations motivating the design are in Section B. generates the corresponding detailed content. The low-level policy is implemented with the base LM, which conditions on the sequence of prior descriptors and detailed contents, together with the current descriptor, to generate the next detailed content. The high-level policy is implemented with a lightweight LoRA module (Hu et al., 2022) , which conditions on previous descriptors and summaries to produce the next descriptor. Since descriptors are much shorter than full reasoning content, this component remains compact and computationally efficient. The design rationale behind this architecture as well as ablation studies are provided in Section B. Additionally, we fine-tune an independent, lightweight LLM for summarization, which is shared across different base models. ITERATIVE STEP-DPO WITH PROCESS-LEVEL PREFERENCES To train our model effectively, we introduce an iterative Step-DPO pipeline that performs stepwise preference optimization with