ACL2024

Steering Llama 2 via Contrastive Activation Addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner

摘要

We introduce Contrastive Activation Addition 001 (CAA), an innovative method for steering lan-002 guage models by modifying their activations 003 during forward passes. CAA computes "steer-004 ing vectors" by averaging the difference in 005 residual stream activations between pairs of 006 positive and negative examples of a particular 007 behavior, such as factual versus hallucinatory 008 responses. During inference, these steering vec-009 tors are added at all token positions after the 010 user's prompt with either a positive or negative 011 coefficient, allowing precise control over the 012 degree of the targeted behavior. We evaluate 013 CAA's effectiveness on Llama 2 Chat using 014 multiple-choice behavioral question datasets 015 and open-ended generation tasks. We demon-016 strate that CAA significantly alters model be-017 havior, is effective over and on top of traditional 018 methods like finetuning and system prompt de-019 sign, and minimally reduces capabilities. More-020 over, we gain deeper insights into CAA's mech-021 anisms by employing various activation space 022 interpretation methods. CAA accurately steers 023 model outputs and sheds light on how high-024 level concepts are represented in Large Lan-025 guage Models (LLMs). 026 1 Introduction 027 As the capabilities of Large Language Models 028 (LLMs) have grown rapidly in recent years, an 029 increasing body of research aims to ensure they are 030 "helpful, honest, and harmless" (Askell et al., 2021) 031 to reduce risks from misaligned, unsafe behavior 032 (Bommasani et al., 2021). 033 Researchers have developed several techniques 034 for aligning LLMs, such as Reinforcement Learn-035 ing from Human Feedback (Ziegler et al., 2020) 036 (RLHF), instruction finetuning (Wei et al., 2021), 037 and prompt engineering (Brown et al., 2020). How-038 ever, many challenges remain, including collecting 039 diverse and representative datasets for the target 040 behaviors, preventing hallucination, and mitigating 041 Liu et al. (2023) steer models to reduce toxicity 134 and affect style transfer. Unlike CAA, they steer the 135 attention activations rather than the residual stream 136 and intervene at all transformer layers rather than a 137 single layer. 138 Beyond steering behaviors, work on activation 139 engineering has also motivated a formalization 140 of "linear representation" (Park et al., 2023) and 141 helped verify linear representations of sentiment in 142 LLMs (Tigges et al., 2023). 143 3 Method 144 The key idea behind CAA is to generate a steer-145 ing vector that can shift a language model's output 146 distribution towards a desired behavior during in-147 ference. We create these steering vectors using 148 pairs of prompts: one prompt demonstrating the 149 desired behavior and one prompt demonstrating the 150 opposite. By taking the average difference between 151 the language model's activations on a set of paired 152 prompts, we isolate the direction in the model's 153 latent space corresponding to the target behavior. 154 Specifically, our prompt pairs consist of multiple-155 choice questions with answer letters (either "A" or 156 "B") appended at the end. The two prompts contain 157 the same question but end with different answers; 158 the "positive" prompt ends with the letter corre-159 sponding to the behavior in question, and the "neg-160 ative" prompt ends with the letter corresponding to 161 its opposite. 162 To construct a steering vector, we compute the 163 difference in the language model's activations at 164 the position of the answer letter between all the pos-165 itive and negative prompts. This method of extract-166 ing the difference vector is called Mean Difference 167 (MD) and has been shown to produce steering vec-168 tors similar to other techniques like PCA (Tigges 169 et al., 2023). 170 Formally, given a dataset D of (prompt p, posi-171 tive completion c p , negative completion c n ) triples, 172 we calculate the MD vector v M D for a layer L as: stream activations in two dimensions emerging sud-251 denly after a particular layer. For instance, Figure 252 1 shows projected activation on the refusal con-253 trastive dataset at layers 9 and 10 of Llama 2 7B 254 Chat. The visible behavioral clustering emerges 255 suddenly at layer 10. This trend is seen across our 256 other datasets. 257 4 Effect of CAA on behaviors 258 4.1 Multiple-choice question datasets 259 We generate steering vectors for each behavioral 260 dataset (generation dataset sizes provided in Ap-261 pendix F). We then evaluate their steering effects 262 on 50 held-out multiple-choice questions with the 263 same format as our generation sets. 264 To find the optimal layer for steering, we sweep 265 over all layers and perform CAA with multipliers 266 of -1 and 1, assessing the effect size on the held-267 out test questions. 268 Charts of these sweeps are shown in Figure 2. 269 Each line corresponds to a different behavior. 270 (a) Effect of CAA at different