ICLR2025
Residual Stream Analysis with Multi-Layer SAEs
Tim Lawson, Lucy Farnik, Conor J. Houghton, Laurence Aitchison
Abstract
Sparse autoencoders (SAEs) learn interpretable directions or latents in the representation spaces of transformer language models. But we want to understand and control model behaviors, which span multiple layers. There are two options to link latents across layers: Match latents from SAEs trained at different layers, like Balcells et al. (2024), Balagansky et al. (2024), and Paulo et al. (2024) Learn latents that represent the same concept at multiple layers, like Yun et al. (2023) and Ghilardi et al. (2024) Multi-Layer SAEs We train a single SAE on the residual stream activation vectors from every layer of a transformer. How similar are transformer layers? We expect the vector spaces at different layers to be similar: Intuitively, due to residual connections (e.g., Elhage et al. 2021) Empirically, from path patching (e.g., Goldowsky-Dill et al. 2023) The larger the model, the more similar the residual stream activation vectors at adjacent layers (cf. Lad et al. 2024