ICML2025
Towards Global-level Mechanistic Interpretability: A Perspective of Modular Circuits of Large Language Models
Yinhan He, Wendy Zheng, Yushun Dong, Yaochen Zhu, Chen Chen, Jundong Li
Abstract
Mechanistic interpretability (MI) research aims to understand large language models (LLMs) by identifying computational circuits-subgraphs of model components with associated functional interpretations (FIs)-that explain specific model behaviors. Current MI research mainly focus on discovering task-specific circuits, which have two key limitations: (1) low generalization ability across diverse language tasks and (2) high costs due to the need for human or advanced LLMs to interpret each computational node. To address these challenges, we propose a novel modular circuit (MC) vocabulary of task-agnostic functional units, each containing a small computational subgraph with its interpretation obtained by examining the subgraph's behavior on extensive corpora. By allowing different language tasks to share common MCs, our approach enables global interpretability while reducing costs by reusing established interpretations for new tasks. Besides, we propose five criteria for characterizing the MC vocabulary and present ModCirc, a novel globallevel MI framework for discovering MC vocabularies in LLMs. We demonstrate ModCirc's effectiveness by showing that it can identify modular circuits that perform well on various metrics. 1