ACL2024

CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

Quan Tu, Shilong Fan, Zihang Tian, Tianhao Shen, Shuo Shang, Xin Gao, Rui Yan

摘要

Recently, the advent of large language models 001 (LLMs) has revolutionized generative agents. 002 Among them, Role-Playing Conversational 003 Agents (RPCAs) attract considerable atten-004 tion due to their ability to emotionally engage 005 users. However, the absence of a compre-006 hensive benchmark impedes progress in this 007 field. To bridge this gap, we introduce Char-008 acterEval, a Chinese benchmark for compre-009 hensive RPCA assessment, complemented by a 010 tailored high-quality dataset. The dataset com-011 prises 1,785 multi-turn role-playing dialogues, 012 encompassing 11,376 examples and featuring 013 77 characters derived from Chinese novels and 014 scripts. It was carefully constructed, beginning 015 with initial dialogue extraction via GPT-4, fol-016 lowed by rigorous human-led quality control, 017 and enhanced with in-depth character profiles 018 sourced from Baidu Baike. CharacterEval em-019 ploys a multifaceted evaluation approach, en-020 compassing thirteen targeted metrics on four 021 dimensions. To facilitate the convenient eval-022 uation for these subjective metrics in Charac-023 terEval, we further developed CharacterRM, a 024 role-playing reward model based on human an-025 notations, which has a higher correlation with 026 human judgment compared to GPT-4. Compre-027 hensive experiments on CharacterEval demon-028 strate that Chinese LLMs exhibit more promis-029 ing capabilities than GPT-4 in Chinese role-030 playing conversation 1 . 031 1 Introduction 032 The development of large language models (LLMs) 033 has marked the beginning of a new era in conversa-034