ICML2025

Value-Based Deep RL Scales Predictably

Oleh Rybkin, Michal Nauman, Preston Fu, Charlie Victor Snell, Pieter Abbeel, Sergey Levine, Aviral Kumar

Abstract

Scaling data and compute is critical to the success of modern ML. However, scaling demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from small-scale runs, without running the large-scale experiment. In this paper, we show that valuebased off-policy RL methods are predictable despite community lore regarding their pathological behavior. First, we show that data and compute requirements to attain a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can predict this data requirement when given more compute, and this compute requirement when given more data. Second, we determine the optimal allocation of a total resource budget across data and compute for a given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling is enabled by first estimating predictable relationships between hyperparameters, which is used to manage effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance. (I) Compute-Data Pareto frontier (II) Budget extrapolation (III) Fits for multiple J Isaac Gym DMC OpenAI Gym Figure 1: Scaling properties when increasing compute C, data D, budget F, or performance J. Left: Compute versus data requirements Pareto frontier controlled by the UTD ratio σ. We observe that we can trade off data for compute and vice versa, and this relationship is predictable. Middle: Extrapolation from low to high performance. We observe that the optimal resource allocation controlled by σ evolves predictably with increasing budget, and can be used to extrapolate from low to high performance. Right: Pareto frontiers for several performance levels J.