NeurIPS2020

Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric

13 citations

Abstract

We investigate the exploration of an unknown environment when no reward function is provided. Building on the incremental exploration setting introduced by Lim and Auer [1], we define the objective of learning the set of ε-optimal goal-conditioned policies attaining all states that are incrementally reachable within L steps (in expectation) from a reference state s 0 . In this paper, we introduce a novel modelbased approach that interleaves discovering new states from s 0 and improving the accuracy of a model estimate that is used to compute goal-conditioned policies to reach newly discovered states. The resulting algorithm, DisCo, achieves a sample complexity scaling as O(L 5 S L+ε Γ L+ε A ε -2 ), where A is the number of actions, S L+ε is the number of states that are incrementally reachable from s 0 in L + ε steps, and Γ L+ε is the branching factor of the dynamics over such states. This improves over the algorithm proposed in [1] in both ε and L at the cost of an extra Γ L+ε factor, which is small in most environments of interest. Furthermore, DisCo is the first algorithm that can return an ε/c min -optimal policy for any cost-sensitive shortest-path problem defined on the L-reachable states with minimum cost c min . Finally, we report preliminary empirical results confirming our theoretical findings. While the approaches reviewed above effectively leverage deep RL techniques and are able to achieve impressive results in complex domains (e.g., Montezuma's Revenge [15] or real-world robotic manipulation tasks [19]), they often lack substantial theoretical understanding and guarantees. Recently, some unsupervised RL objectives were analyzed rigorously. Some of them quantify how well the agent visits the states under a sought-after frequency, e.g., to induce a maximally entropic state distribution [20, 21, 22, 23] . While such strategies provably mimic their desired behavior via a Frank-Wolfe algorithmic scheme, they may not learn how to effectively reach any state of the environment and thus may not be sufficient to efficiently solve downstream tasks. Another relevant take is the reward-free RL paradigm of [24] : following its exploration phase, the agent is able to 34th Conference on Neural Information Processing Systems (NeurIPS 2020),