ACL2024

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, Daniel Fried

25 citations

Abstract

Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal agents on realistic visually grounded web tasks. VisualWebArena comprises of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform well, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives. We evaluate state-of-the-art LLMbased autonomous agents, including several multimodal agents. Our analysis reveals several limitations of text-based LLM agents, gaps in the capabilities of state-of-the-art multimodal language agents, and insights towards building stronger autonomous agents for the web.