ICLR2026

Go-Browse: Training Web Agents with Structured Exploration

Apurva Gandhi, Graham Neubig

19 citations

Abstract

One of the fundamental problems in digital agents is their lack of understanding of their environment. For instance, a web browsing agent may get lost in unfamiliar websites, uncertain what pages must be visited to achieve its goals. To address this, we propose Go-Browse, a method for automatically collecting diverse and realistic web agent data at scale through structured exploration of web environments. Go-Browse achieves efficient exploration by framing data collection as a graph search, enabling reuse of information across exploration episodes. We instantiate our method on the WebArena benchmark, collecting a dataset of 10K successful task-solving trajectories and 40K interaction steps across 100 URLs. Fine-tuning a 7B parameter language model on this dataset achieves a success rate of 21.7% on the WebArena benchmark, beating GPT-4o mini by 2.4% and exceeding current state-of-the-art results for sub-10B parameter models by 2.9%. 1 INTRODUCTION Despite their impressive and often superhuman performance in other domains, most pretrained LLMs do not perform well on GUI-based web agent tasks. For instance, on the WebArena benchmark (Zhou et al.) where humans achieve a 78% success rate, frontier models like GPT-4O (OpenAI, 2024a) and GPT-4O-MINI (OpenAI, 2024b) score only 38% and 19% respectively, while a smaller model like QWEN-2.5-7B-INSTRUCT (Yang et al., 2024) scores only 8%. On the other hand, models trained specifically for GUI-based interaction score much better, with CLAUDE-3.7-SONNET (Anthropic, 2025) scoring 45.4% and COMPUTER-USING AGENT (OpenAI, 2025) achieving 58%. This gap suggests that training on agent-specific interaction data is crucial for realizing effective web agents. But collecting high-quality web agent data presents its own set of challenges. Human-generated trajectories offer one source for quality demonstrations but are notoriously expensive and timeconsuming to collect for the vast datasets required. One class of methods tries to automatically scale human-generated data or use humans-in-the-loop in the dataset collection process (Shen et al., 2024; Zhou et al., 2024; Lai et al., 2024) . Another line of work attempts to improve scalability further by proposing fully unsupervised and automatic methods for data generation; for example, by generating synthetic demonstrations from wikiHow-style tutorial articles (Ou et al.) or by building an exploration policy that collects data by interacting with websites (Murty et al., 2024a;b). Among these unsupervised methods, the latter ones that directly explore web environments of interest perform significantly better than those that use indirect and more generic knowledge from the internet (16% (Murty et al., 2024b) vs. 6% (Ou et al.) success rate). This gap underscores a fundamental problem in digital agents: their lack of prior understanding of the environments they are deployed on. Learning from a tutorial or even a human-generated demonstration on how to cancel an ongoing order on Amazon is unlikely to transfer to the myriad of other websites that a web agent may need to interact with. Instead, agents are likely to be more successful if they learn directly from environments they will encounter. In this work, we introduce GO-BROWSE, a method that automatically collects diverse, realistic, and tailored web agent data through systematic and structured exploration of websites. In particular, 1 We release our code, dataset and models at https://github.com/ApGa/Go-Browse .