ICML2025

Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

Harrish Thasarathan, Julian Forsyth, Thomas Fel, Matthew Kowal, Konstantinos G. Derpanis

Abstract

Figure 1. Overview of Universal Sparse Autoencoders. (A) We introduce Universal Sparse Autoencoders (USAEs), a method for discovering common concepts across multiple different deep neural networks. USAEs are simultaneously trained on the activations of multiple models and are constrained to share an aligned and interpretable dictionary of discovered concepts. (B) We also demonstrate one immediate application of USAEs, Coordinated Activation Maximization, where optimizing the inputs of multiple models to activate the same concepts reveals how different models encode the same concept. Visualization reveals interesting concepts at various levels of abstraction, such as 'curves' (top), 'animal haunch' (middle) and 'the faces of crowds' (bottom). Project: yorkucvil.github.io/UniversalSAE