EMNLP2025

Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time

Huihan Li, You Chen, Siyuan Wang, Yixin He, Ninareh Mehrabi, Rahul Gupta, Xiang Ren

Abstract

Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue is especially acute in Chain-of-Thought (CoT) reasoning, where spurious memorized patterns can trigger intermediate errors that cascade into incorrect final answers. We introduce STIM, a novel framework for Source-aware Token-level Identification of Memorization, which attributes each token in a reasoning chain to one of multiple memorization sources -local, mid-range, or long-rangebased on their statistical co-occurrence with the token in the pretraining corpus. Our token-level analysis across tasks and distributional settings reveals that models rely more on memorization in complex or long-tail cases, and that local memorization is often the dominant driver of errors, leading to up to 67% of wrong tokens. We also show that memorization scores from STIM can be effective in predicting the wrong tokens in the wrong reasoning step. STIM offers a powerful tool for diagnosing and improving model reasoning and can generalize to other structured step-wise generation tasks. 1 * Equal Contribution † Work done in Amazon AGI 1 https://github.com/INK-USC/STIM Model Reasoning Steps: Rick killed 10 + 15 = 25 animals last night. Today Rick killed 3 * 3 = 9 wolves. Today Rick killed 15 -3 = 12 cougars. Today Rick killed 9 + 12 = 21 animals. Total animals killed by Rick is 25 + 21 = 46. So the answer is 46. Input Question: Last night Rick killed ten wolves and 15 cougars while hunting. Today Rick killed three times as many wolves as cougars and three fewer cougars than the previous night. How many animals did Rick kill? Examples of Simultaneous Memorization Sources of Different Strength Question: Here is a list: [pitanga, pitanga, yumberry, yumberry, pitanga, yumberry, pitanga, yumberry, pitanga, pitanga]. How many times does 'pitanga' appear on it? Model Output: 1. The first element is 'pitanga'. ... 10. The tenth element is 'pitanga'. Now, counting the occurrences of 'pitanga': We have 'pitanga' appearing 5 times at positions 1, 2, 5, 7, and 10. So the answer is 5. Analysis: Token 5 is incorrect and has the highest influence from local context, where the n-gram "appearing 5" has a high pretraining frequency. STIM scores are: local = 0.81 , mid = 0.24, long = 0.03, suggesting that the model was influenced by locally frequent patterns rather than actual counting.