EMNLP2025

Can Large Language Models Win the International Mathematical Games?

Alessio Cocchieri, Luca Ragazzi, Giuseppe Tagliavini, Lorenzo Tordi, Antonella Carbonaro, Gianluca Moro

摘要

Recent advances in large language models (LLMs) have demonstrated strong mathematical reasoning abilities, even in visual contexts, with some models surpassing human performance on existing benchmarks. However, these benchmarks lack structured age categorization, clearly defined skill requirements, and-crucially-were not designed to assess human performance in international competitions. To address these limitations, we introduce MATHGAMES, a new benchmark of 2,183 high-quality mathematical problems (both text-only and multimodal) in an openended format, sourced from an international mathematical games championships. Spanning seven age groups and a skill-based taxonomy, MATHGAMES enables a structured evaluation of LLMs' mathematical and logical reasoning abilities. Our experiments reveal a substantial gap between state-of-theart LLMs and human participants-even 11year-olds consistently outperform some of the strongest models-highlighting the need for advancements. Further, our detailed error analysis offers valuable insights to guide future research. The data is publicly available at https:// disi-unibo-nlp.github.io/math-games/ . * Equal contribution (co-first authors). Number of participants per score C1 (11-13 y/o) 1413 C2 (13-15 y/o) 709 L1 (15-18 y/o) 338 L2 (18-20 y/o) 177 GP (20-25 y/o) 112 0 10 20 30 40 50 60 70 80 90 100 Competition Score (%) HC (25+ y/o) 25 Early Teenager (11-13 y/o) Late Teenager (13-18 y/o) Adult (18-25+ y/o) Question: How many small spheres of different colors are there in the figure? Question: In figure you see tennis balls placed on top of each other, forming at each "plane" of the squares, without holes in the middle. The highest level contains only one ball; the second, coming down, contains 4; the third contains 9 and so on. If you use 7714 balls, how many floors will your pyramid of tennis balls be constituted? Question: In figure you see a pentagonal tile, quite singular, whose sides BC and AE measure 1 dm while AB measures 2 dm. Which is in cm 2 , rounded to the nearest cm 2 , the area of our tile? (If necessary, use 1,414 for √ 2 and 1,732 for √ 3).