ICLR2026

Go-Browse: Training Web Agents with Structured Exploration

Apurva Gandhi, Graham Neubig

被引用 19 次

摘要

One of the fundamental problems in digital agents is their lack of understanding of their environment. For instance, a web browsing agent may get lost in unfamiliar websites, uncertain what pages must be visited to achieve its goals. To address this, we propose Go-Browse, a method for automatically collecting diverse and realistic web agent data at scale through structured exploration of web environments. Go-Browse achieves efficient exploration by framing data collection as a graph search, enabling reuse of information across exploration episodes. We instantiate our method on the WebArena benchmark, collecting a dataset of 10K successful task-solving trajectories and 40K interaction steps across 100 URLs. Fine-tuning a 7B parameter language model on this dataset achieves a success rate of 21.7% on the WebArena benchmark, beating GPT-4o mini by 2.4% and exceeding current state-of-the-art results for sub-10B parameter models by 2.9%. 1 INTRODUCTION Despite their impressive and often superhuman performance in other domains, most pretrained LLMs do not perform well on GUI-based web agent tasks. For instance, on the WebArena benchmark (Zhou et al.) where humans achieve a 78% success rate, frontier models like GPT-4O (OpenAI, 2024a) and GPT-4O-MINI (OpenAI, 2024b) score only 38% and 19% respectively, while a smaller model like QWEN-2.5-7B-INSTRUCT (Yang et al., 2024) scores only 8%. On the other hand, models trained specifically for GUI-based interaction score much better, with CLAUDE-3.7-SONNET (Anthropic, 2025) scoring 45.4% and COMPUTER-USING AGENT (OpenAI, 2025) achieving 58%. This gap suggests that training on agent-specific interaction data is crucial for realizing effective web agents. But collecting high-quality web agent data presents its own set of challenges. Human-generated trajectories offer one source for quality demonstrations but are notoriously expensive and timeconsuming to collect for the vast datasets required. One class of methods tries to automatically scale human-generated data or use humans-in-the-loop in the dataset collection process (Shen et al., 2024; Zhou et al., 2024; Lai et al., 2024) . Another line of work attempts to improve scalability further by proposing fully unsupervised and automatic methods for data generation; for example, by generating synthetic demonstrations from wikiHow-style tutorial articles (Ou et al.) or by building an exploration policy that collects data by interacting with websites (Murty et al., 2024a;b). Among these unsupervised methods, the latter ones that directly explore web environments of interest perform significantly better than those that use indirect and more generic knowledge from the internet (16% (Murty et al., 2024b) vs. 6% (Ou et al.) success rate). This gap underscores a fundamental problem in digital agents: their lack of prior understanding of the environments they are deployed on. Learning from a tutorial or even a human-generated demonstration on how to cancel an ongoing order on Amazon is unlikely to transfer to the myriad of other websites that a web agent may need to interact with. Instead, agents are likely to be more successful if they learn directly from environments they will encounter. In this work, we introduce GO-BROWSE, a method that automatically collects diverse, realistic, and tailored web agent data through systematic and structured exploration of websites. In particular, 1 We release our code, dataset and models at https://github.com/ApGa/Go-Browse .