KDD2025

A Framework for Evaluating AI Agents in Open-Ended Conversations via Scripted Simulation

Clarice Wang, Yimin Shi, Xiaokui Xiao

摘要

Traditional evaluations of conversational AI have primarily focused on "closed-ended" interactions, where a human user queries the AI system, such as in customer support.However, many advanced real-world applications-such as job interviews, podcast hosting, and legal or healthcare intake discussions-require "open-ended" interactions in which the AI must take initiative by formulating questions to fully understand the human user's story.As AI assumes broader roles that demand greater autonomy, evaluating its performance in open-ended conversations becomes significantly more complex.This paper introduces a novel framework for rigorously assessing AI agents in such open-ended interviews.In this framework, a secondary AI agent (Agent B) simulates a human interviewee by strictly following a structured script that defines topic strengths and weaknesses, along with guidelines for when to hint at or deviate from these topics.Meanwhile, the primary AI agent (Agent A) engages in dynamic questioning to uncover the script's underlying facts and narrative.By comparing the final conversation transcript to the original script, we assess Agent A using metrics such as completeness, consistency, and investigative depth.This approach not only establishes a new benchmark for open-ended conversational skills but also provides insight into how effectively an AI agent can detect and navigate strategic diversions in scripted behavior.