ICLR2025

Model-based RL as a Minimalist Approach to Horizon-Free and Second-Order Bounds

Zhiyong Wang, Dongruo Zhou, John C. S. Lui, Wen Sun

摘要

We consider finite horizon time-homogenous MDP M = S, A, H, P ⋆ , r, s 0 S, A are the state and action space H ∈ N + is the horizon for each episode P ⋆ : S × A → ∆(S) is the ground truth unknown transition r : S × A → R is the known reward signal, and s0 is the fixed initial state.