ICLR2025

Digi-Q: Learning VLM Q-Value Functions for Training Device-Control Agents

Hao Bai, Yifei Zhou, Li Erran Li, Sergey Levine, Aviral Kumar

摘要

While a number of existing approaches for building foundation model agents rely on prompting or fine-tuning with human demonstrations, it is not sufficient in dynamic environments (e.g., mobile device control). On-policy reinforcement learning (RL) should address these limitations, but collecting actual rollouts in an environment is often undesirable in truly open-ended agentic problems such as mobile device control or interacting with humans, where each unit of interaction is associated with a cost. In such scenarios, a method for policy learning that can utilize off-policy experience by learning a trained action-value function is much more effective. In this paper, we develop an approach, called Digi-Q, to train VLM-based action-value Q-functions which are then used to extract the agent policy. We study our approach in the mobile device control setting. Digi-Q trains the Q-function using offline temporal-difference (TD) learning, on top of frozen, intermediate-layer features of a VLM. Compared to fine-tuning the whole VLM, this approach saves us compute and enhances scalability. To make the VLM features amenable for representing the Q-function, we need to employ an initial phase of fine-tuning to amplify coverage over actionable information needed for value function. Once trained, we use this Q-function via a Best-of-N policy extraction operator that imitates the best action out of multiple candidate actions from the current policy as ranked by the value function, enabling policy improvement without environment interaction. Digi-Q outperforms several prior methods on user-scale device control tasks in Android-in-the-Wild, attaining 21.2% improvement over prior best-performing method. In some cases, our Digi-Q approach already matches state-of-the-art RL methods that require interaction. The project is open-sourced at https://github.com/DigiRL-agent/digiq Recently, the community has been turning towards using reinforcement learning (RL) methods for training agentic policies. RL avoids the shortcomings of imitation and prompting, by explicitly training the policy to solve tasks (Zhou et al., 2024b; Verma et al., 2022; Snell et al., 2023; Abdulhai et al., 2023) . That said, the best performing RL methods today for improving a policy in multi-step agentic tasks rely critically on interaction due to the use of policy gradient updates (Yao et al., 2023) coupled with Monte-Carlo values (Bai et al., 2024; Putta et al., 2024; Shao et al., 2024) , which often require sufficient amounts of on-policy data to get a low-variance learning signal. The amount of on-policy data needed is likely only larger in non-stationary and dynamic environments (Bai et al., 2024) . If on the other hand, we could train a critic (i.e., an action-value function) that could score a policy's