ICLR2026

Hedonic Neurons: A Mechanistic Mapping of Latent Coalitions in Transformer MLPs

Tanya Chowdhury, Atharva Nijasure, Yair Zick, James Allan

Abstract

Fine-tuned Large Language Models (LLMs) encode rich task-specific features, but the form of these representations-especially within MLP layers-remains unclear. Empirical inspection of LoRA updates shows that new features concentrate in mid-layer MLPs, yet the scale of these layers obscures meaningful structure. Prior probing suggests that statistical priors may strengthen, split, or vanish across depth, motivating the need to study how neurons work together rather than in isolation. We introduce a mechanistic interpretability framework based on coalitional game theory, where neurons mimic agents in a hedonic game whose preferences capture their synergistic contributions to layer-local computations. Using top-responsive utilities and the PAC-Top-Cover algorithm, we extract stable coalitions of neurons-groups whose joint ablation has non-additive effects-and track their transitions across layers as persistence, splitting, merging, or disappearance. Applied to LLaMA, Mistral, and Pythia rerankers fine-tuned on scalar IR tasks, our method finds coalitions with consistently higher synergy than clustering baselines. By revealing how neurons cooperate to encode features, hedonic coalitions uncover higher-order structure beyond disentanglement and yield computational units that are functionally important, interpretable, and predictive across domains. Recent work has shown that LoRA fine-tuning can teach LLMs new tasks by updating only mid-level MLP layers, nearly matching full fine-tuning (Hu et al., 2022; Zhou et al., 2024; Nijasure et al., 2025) . Yet inspection of these LoRA weight updates reveals little obvious structure: millions of parameters diffuse across neurons, obscuring which units encode task-specific features. We hypothesize that the key to isolating LoRA emergent behaviour lies in identifying coalitions of neurons that consistently co-adapt under fine-tuning. Inspired by game theory, we model neurons as agents in a hedonic game (Dreze & Greenberg, 1980) , where preferences reflect synergy with others. Though neurons are not literally rational, stochastic gradient descent imposes a form of selection pressure: directions that 1