ACL2025
Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking
Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
Abstract
Large language models (LLMs) face a performance ceiling as scaling parameters becomes impractical. Our observations indicate that while simple tokens are efficiently resolved in early layers with stable gradients, complex tokens trigger abrupt gradient spikes across layers, underscoring architectural limitations. Existing step-by-step reasoning methods, such as Chain-of-Thought, are hindered by their dependence on accurately generating critical tokens. We introduce Inner Thinking Transformer (ITT)-a straightforward approach that enables models to "think" more deeply about important tokens by dynamically assigning extra inference steps through a token-wise dynamic depth architecture with residual iterative reasoning and step encoding. Experiments on LLaMA2-7B models at 355M, 1B, and 3B scales show that ITT consistently outperforms vanilla Transformers, with a 355M ITT model matching the performance of a 1B Transformer, offering a scalable, architecture-aware strategy to enhance LLM reasoning capabilities.