ICLR2025

MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation

Zhongshen Zeng, Pengguang Chen, Shu Liu, Haiyun Jiang, Jiaya Jia

摘要

In this work, we introduce a novel evaluation 001 paradigm for Large Language Models, one that 002 challenges them to engage in meta-reasoning. 003 This approach addresses critical shortcomings 004 in existing math problem-solving benchmarks, 005 traditionally used to evaluate the cognitive capa-006 bilities of agents. Our paradigm shifts the focus 007 from result-oriented assessments, which often 008 overlook the reasoning process, to a more holis-009 tic evaluation that effectively differentiates the 010 cognitive capabilities among models. For ex-011 ample, in our benchmark, GPT-4 demonstrates 012 a performance five times better than GPT3.5. 013 The significance of this new paradigm lies in 014 its ability to reveal potential cognitive deficien-015 cies in LLMs that current benchmarks, such 016 as GSM8K, fail to uncover due to their satura-017 tion and lack of effective differentiation among 018 varying reasoning abilities. Our comprehen-019 sive analysis includes several state-of-the-art 020 math models from both open-source and closed-021 source communities, uncovering fundamental 022 deficiencies in their training and evaluation ap-023 proaches. 024 1 Introduction 025 Pretrained on trillions of tokens and possessed 026 with billions of parameters, today's large language 027 model (OpenAI, 2023; Anthropic, 2023; Touvron 028 et al., 2023) is capable of generating coherent texts 029 and achieved super-human performances in many 030 tasks (Bubeck et al., 2023; Hendrycks et al., 2021). 031 With the hope of differentiating different model's 032 cognitive ability, math questions are often selected 033 as a proxy evaluation task. However, despite the 034 complexity and diversity of these math problems, 035