CVPR2025

Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback

Mohd Hozaifa Khan, Ravi Kiran Sarvadevabhatla

Abstract

We introduce Sketchtopia, a large-scale dataset and AI framework designed to explore goal-driven, multimodal communication through asynchronous interactions in a Pictionary-inspired setup. Sketchtopia captures natural human interactions, including freehand sketches, open-ended guesses, and iconic feedback gestures, showcasing the complex dynamics of cooperative communication under constraints. It features over 20K gameplay sessions from 916 players, capturing 263K sketches, 10K erases, 56K guesses and 19.4K iconic feedbacks. We introduce multimodal foundational agents with capabilities for generative sketching, guess generation and asynchronous communication. Our dataset also includes 800 human-agent sessions for benchmarking the agents. We introduce novel metrics to characterize collaborative success, responsiveness to feedback and inter-agent asynchronous communication. Sketchtopia pushes the boundaries of multimodal AI, establishing a new benchmark for studying asynchronous, goal-oriented interactions between humans and AI agents. The dataset can be found at https://sketchtopia25.github.io/ Related Works Sketch datasets: Existing sketch datasets primarily focus on recognition [13, 40] , segmentation [22, 26, 36, 47, 49, 51] , and retrieval [53] , with labels assigned to the final sketch. In contrast, our dataset includes intermediate text "guess" annotations from gameplay, providing temporal grounding for sketches throughout the drawing process. Additionally, our dataset covers a wider array of categories, including abstract concepts (nouns, verbs, adjectives). Some works incorporate text guessing to simulate Pictionary [41] , but lack the complexity and richness generated by interacting agents and gameplay dynamics.