ICLR2025

Do LLMs "know" internally when they follow instructions?

Juyeon Heo, Christina Heinze-Deml, Oussama Elachqar, Kwan Ho Ryan Chan, Shirley You Ren, Andrew C. Miller, Udhyakumar Nallasamy, Jaya Narain

Publisher

Abstract

Motivation Methods P la c e h o ld e r A key to building safe and useful personal AI agents with LLMs lies in their ability to follow instructions precisely. Deployed models must strictly follow the instructions and constraints from users to ensure that the outputs are both safe and aligned with user intentions.