ICLR2025
Do LLMs "know" internally when they follow instructions?
Juyeon Heo, Christina Heinze-Deml, Oussama Elachqar, Kwan Ho Ryan Chan, Shirley You Ren, Andrew C. Miller, Udhyakumar Nallasamy, Jaya Narain
Abstract
Motivation Methods P la c e h o ld e r A key to building safe and useful personal AI agents with LLMs lies in their ability to follow instructions precisely. Deployed models must strictly follow the instructions and constraints from users to ensure that the outputs are both safe and aligned with user intentions.