ACL2025

Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

Joykirat Singh, Akshay Uttama Nambi, Vibhav Vineet

被引用 10 次

摘要

Large Language Models (LLMs) have been applied to Math Word Problems (MWPs) with transformative impacts, revolutionizing how these complex problems are approached and solved in various domains including educational settings. However, the evaluation of these models often prioritizes final accuracy, overlooking the crucial aspect of reasoning capabilities. This work addresses this gap by focusing on the ability of LLMs to detect and correct reasoning mistakes. We introduce a novel dataset MWP-MISTAKE, incorporating MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models. Our comprehensive benchmarking reveals significant insights into the strengths and weaknesses of state-of-the-art models, such as GPT-4o, GPT-4, GPT-3.5Turbo, and others. We highlight GPT-4o's superior performance in mistake detection and rectification and the persistent challenges faced by smaller models. Additionally, we identify issues related to data contamination and memorization, impacting the reliability of LLMs in real-world applications. Our findings emphasize the importance of rigorous evaluation of reasoning processes and propose future directions to enhance the generalization and robustness of LLMs in mathematical problem-solving. Math Word Problems (MWPs) convey mathematical concepts and calculations through written descriptions, typically involving narrative scenarios [28] . Solvers must extract relevant mathematical information from these narratives and apply appropriate principles to arrive at solutions. Studies [34, 15, 11] have demonstrated that LLMs are proficient at understanding the contextual subtleties of MWPs, translating textual descriptions into mathematical expressions, and delivering precise solutions. Central to this process is mathematical reasoning, which enables models to adeptly manage complex, multi-step problems, draw logical inferences, and provide accurate solutions. Despite achieving remarkable accuracy rates exceeding 90% on datasets like GSM-8K (Grade School Math dataset with linguistically diverse word problems) [9] , foundational LLMs such as Claude-3-Opus [2], Gemini Ultra [29], and OpenAI reveal a significant gap in our understanding of their capabilities in mathematical reasoning [11] . Current research predominantly focuses on evaluating the final accuracy of MWPs [23, 35] , neglecting the intricate reasoning processes necessary to derive solutions. We argue that the reasoning steps play a pivotal role, and Preprint. Under review.