CVPR2025

VLMs-Guided Representation Distillation for Efficient Vision-Based Reinforcement Learning

Haoran Xu, Peixi Peng, Guang Tan, Yiqian Chang, Luntong Li, Yonghong Tian

Abstract

Vision-based Reinforcement Learning (VRL) attempts to establish associations between visual inputs and optimal actions through interactions with the environment. Given the high-dimensional and complex nature of visual data, it becomes essential to learn a policy based on high-quality state representation. To this end, existing VRL methods primarily rely on interaction-collected data, combined with selfsupervised auxiliary tasks. However, two key challenges remain: limited data samples and a lack of task-relevant semantic constraints. To tackle these challenges, we propose DGC, a method that Distills Guidance from Visual Language Models (VLMs) alongside self-supervised learning into a Compact VRL agent. Notably, we leverage the state representation capabilities of VLMs, rather than their decision-making abilities. Within DGC, a novel promptingreasoning pipeline is designed to convert historical observations and actions into usable supervision signals, enabling semantic understanding within the compact visual encoder. By leveraging these distilled semantic representations, the VRL agent achieves significant improvements in sample efficiency. Extensive experiments on the Carla benchmark demonstrate our state-of-the-art performance.