ICLR2026
ContextPRM: Leveraging Contextual Coherence for multi-domain Test-Time Scaling
Haotian Zhang, Liu Liu, Baosheng Yu, Jiayan Qiu, Likang Xiao, Yanwei Ren, Quan Chen, Xianglong Liu
摘要
Process reward models (PRMs) have demonstrated significant efficacy in enhancing the mathematical reasoning capabilities of large language models (LLMs) by leveraging test-time scaling (TTS). However, while most PRMs exhibit substantial gains in mathematical domains, the scarcity of domain-specific training data and knowledge-based learning patterns limits their generalization ability when faced with other domains. To address this limitation, we shift the learning objective from verifying domain-specific knowledge to modeling domainagnostic logical flow. Centering on contextual coherence between chain-of-thought (CoT) steps, our approach is realized through a novel data annotation and training framework, which enhances the model's generalization capabilities across diverse domains. For instance, our resulting model, ContextPRM, achieves a notable 6.5% average accuracy improvement over the majority voting baseline via weighted majority voting across nine nonmathematical domains in MMLU-Pro, including law, history, and philosophy, significantly surpassing the 2.2% improvement from VersaPRM and 0.5% gains from other mathematics-focused PRMs, demonstrating consistent performance across both mathematical and non-mathematical domains. Figure 1. Key results demonstrating ContextPRM's effectiveness. (Left): Training on a single non-math domain surpasses the prior multi-domain SOTA, VersaPRM. (Right): In non-math-adjacent domains, ContextPRM consistently outperforms all baselines under WMV sampling, showcasing superior generalization.