AAAI2025

Scaling Effects on Latent Representation Edits in GPT Models (Student Abstract)

Austin L. Davis, Gita Sukthankar

摘要

Probing classifiers are a technique for understanding and modifying the operation of neural networks in which a smaller classifier is trained to use the model's internal representation to learn a related probing task. Similar to a neural electrode array, training probing classifiers can help researchers both discern and edit the internal representation of a neural network. This paper presents an evaluation of the use of probing classifiers to modify the internal hidden state of a chess-playing transformer. We demonstrate that intervention vector scaling should follow a negative exponential according to the length of the input to ensure model outputs remain semantically valid after editing the residual stream activations.