ICLR2026

OR-PRM: A Process Reward Model for Algorithmic Problem in Operations Research

Yilin Wang, Heng Zhou, Dongxing Mao, Linjie Li, Jingru Tan, Haochen Han, Zhengyuan Yang, Alex Jinpeng Wang, Min Li

Abstract

Large language models (LLMs) with Process Reward Models (PRMs) have shown strong reasoning ability, yet their potential in Operations Research (OR) remains unexplored. We present the first PRM tailored for OR, but find that directly training on mainstream datasets yields surprisingly weak performance. To understand this gap, we conduct a systematic analysis and identify the primary bottleneck: the datasets themselves, where over 30% of annotations are severely flawed. To overcome these limitations, we first collect all existing synthetic datasets and apply a carefully designed filtering pipeline to construct a high-quality seed dataset. Building upon this seed, we then build OR-ProcessQA, the first large-scale dataset for OR with step-by-step supervision, where diverse solution pathways are generated via Monte Carlo Tree Search (MCTS) and each step is validated for logical consistency by GPT-4o. Building on this foundation, we train OR-PRM, the first Process Reward Model in the OR domain, designed to evaluate and guide reasoning at every step rather than only the final outcome. Together, these advances enable OR-PRM to substantially improve LLMs reasoning capability, achieving a maximum absolute improvement of 12.5% over the base model in Best-of-N settings, and highlighting the power of process-oriented supervision for reliable problem solving in operations research. Modeling：We can construct an integer linear programming model to maximize the number of copies distributed while minimizing the total cost. The model can be formulated as follows: ### Decision Variables: ..... eg.The implementation contains syntax errors, undefined variables, missing libraries, or logical bugs... Question：A company needs to complete the printing of three books: Book 1, Book 2 …… Code: import coptpy as cp from coptpy import COPT Create a COPT environment ..... 1.Infeasible Problem eg. The solver returns infeasible because the demand exceeds capacity...