NeurIPS2023

Dynamic Regret of Adversarial Linear Mixture MDPs

Long-Fei Li, Peng Zhao, Zhi-Hua Zhou

被引用 6 次

摘要

We study reinforcement learning in episodic inhomogeneous MDPs with adversarial full-information rewards and the unknown transition kernel. We consider the linear mixture MDPs whose transition kernel is a linear mixture model and choose the dynamic regret as the performance measure. Denote by d the dimension of the feature mapping, H the length of each episode, K the number of episodes, P T the non-stationary measure, we propose a novel algorithm that enjoys an O √ d 2 H 3 K + H 4 (K + P T )(1 + P T ) dynamic regret under the condition that P T is known, which improves previously best-known dynamic regret for adversarial linear mixture MDP and adversarial tabular MDPs. We also establish an Ω √ d 2 H 3 K + HK(H + P T ) lower bound, indicating our algorithm is optimal in K and P T . Furthermore, when the non-stationary measure P T is unknown, we design an online ensemble algorithm with a meta-base structure, which is proved to achieve an T dynamic regret and here S T is the expected switching number of the best base-learner. The result can be optimal under certain regimes. Recent studies try to combine two lines of work to establish the theoretical foundation of adversarial MDPs with large state and action space. In particular, Cai et al. [32] study adversarial linear mixture 37th Conference on Neural Information Processing Systems (NeurIPS 2023).