NeurIPS2024

WhodunitBench: Evaluating Large Multimodal Agents via Murder Mystery Games

Junlin Xie, Ruifei Zhang, Zhihong Chen, Xiang Wan, Guanbin Li

Abstract

Recently, large language models (LLMs) have achieved superior performance, empowering the development of large multimodal agents (LMAs). An LMA expected to perform practical tasks must possess a range of capabilities, including multimodal perception, interaction, reasoning, and decision-making skills. However, existing benchmarks are limited in assessing compositional skills and actions ♡ Equal contribution ♣ Corresponding authors.