ACL2024

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, Zhiyong Wu

被引用 33 次

摘要

Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be notably lengthy (e.g., HTML) and occasionally inaccessible (e.g., on desktops). To alleviate this issue, we propose a novel visual GUI agent -SeeClick, which only relies on screenshots for task automation. In our preliminary study, we have discovered a key challenge in developing visual GUI agents: GUI grounding -the capacity to accurately locate screen elements based on instructions. To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data. Along with the efforts above, we have also created ScreenSpot, the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments. After pre-training, SeeClick demonstrates significant improvement in ScreenSpot over various baselines. Moreover, comprehensive evaluations on three widely used benchmarks consistently support our finding that advancements in GUI grounding directly correlate with enhanced performance in downstream GUI agent tasks. The model, data and code will be open-sourced. 043 cution (Kim et al., 2023; Zheng et al., 2023). 044 However, GUI agents depend on structured text 045 face three inherent limitations: (1) Structured text 046 is not always accessible, especially for iOS or desk-047 top applications where acquiring such information 048 is challenging (Shaw et al., 2023); (2) The verbose 049 nature of structured text constitutes an inefficient 050 context for LLMs, while also omitting crucial in-051 formation such as layout, images, and icons (Deng 052 et al., 2023); (3) The variety of structured text -053 including HTML, DOM, and Android VH -ne-054 cessitates the curation of task-specific observation 055 and action spaces (Kim et al., 2023; Zhou et al., 056 2023). These entrenched deficiencies in text-based 057 approaches call for an alternative solution. 058 In this paper, we introduce SeeClick, a visual 059 GUI agent built on Large Vision-Language Mod-060 els (LVLMs). Inspired by human interaction with 061 GUIs, as illustrated in Figure 1, SeeClick is de-062 signed to perform low-level actions like clicking 063 or typing directly by observing interface screen-064 shots. This innovative approach bypasses the inter-065 action with cumbersome structured text, empower-066 ing SeeClick to universally adapt to various GUI 067 1 platforms. Building such visual agents presents a 068 foundational challenge: GUI grounding -the capac-069 ity to accurately locate screen elements based on 070 instructions, which is absent in current LVLMs.To 071 tackle this challenge, SeeClick enhances LVLM 072 with a GUI grounding pre-training strategy. We 073 devise a method to automate the curation of web 074 grounding data and adapt public mobile UI datasets 075 to obtain mobile grounding data. SeeClick employs 076 the above-curated dataset for continual pre-training 077 of the LVLM, enabling it to accurately locate ele-078 ments such as text, widgets, and icons in various 079 GUI environments. 080 Given GUI grounding is a fundamental yet un-081 derexplored capacity for GUI agents, we establish 082 ScreenSpot, the first realistic GUI grounding eval-083 uation benchmark across various GUI platforms. 084 ScreenSpot contains over 600 screenshots and 1200 085 instructions from iOS, Android, macOS, Windows, 086 and webpages, and specifically includes both text-087 based elements and a variety of widgets and icons. 088 Evaluation results confirm SeeClick's superiority 089 over current LVLMs, validating the effectiveness 090 of GUI grounding pre-training. 091 Finally, we adapt SeeClick to mobile and web 092 agent tasks, including MiniWob (Shi et al., 2017), 093 AITW (Rawles et al., 2023), and Mind2Web (Deng 094 et al., 2023). As a purely vision-based agent, 095 SeeClick achieves impressive performance. It sur-096 passes the strong visual baseline Pix2Act while 097 utilizing merely 0.3% training data. Moreover, ex-098 perimental results on these three benchmarks con-099 sistently support our findings that improvement in 100 GUI grounding directly correlates with enhanced 101 agent task performance. 102 Our main contributions are as follows: 103 • We develop a unified visual GUI agent SeeClick, 104 which solely relies on interface screenshots to 105 perform clicking and typing actions across di-106 verse GUI platforms. 107 • We prospectively explore GUI grounding for vi-108 sual GUI agents, and enhanced SeeClick with 109 proposed GUI grounding pre-training strategy. 110 • We create a realistic GUI grounding benchmark 111 ScreenSpot, encompassing more than 1200 in-112 structions from various GUI platforms. 113 • Experimental results on ScreenSpot and three 114 agent tasks demonstrate that enhancing agents' 115 grounding capa