NeurIPS2020

POMDPs in Continuous Time and Discrete Spaces

Bastian Alt, Matthias Schultheis, Heinz Koeppl

被引用 9 次

摘要

Many processes, such as discrete event systems in engineering or population dynamics in biology, evolve in discrete space and continuous time. We consider the problem of optimal decision making in such discrete state and action space systems under partial observability. This places our work at the intersection of optimal filtering and optimal control. At the current state of research, a mathematical description for simultaneous decision making and filtering in continuous time with finite state and action spaces is still missing. In this paper, we give a mathematical description of a continuous-time partial observable Markov decision process (POMDP). By leveraging optimal filtering theory we derive a Hamilton-Jacobi-Bellman (HJB) type equation that characterizes the optimal solution. Using techniques from deep learning we approximately solve the resulting partial integro-differential equation. We present (i) an approach solving the decision problem offline by learning an approximation of the value function and (ii) an online algorithm which provides a solution in belief space using deep reinforcement learning. We show the applicability on a set of toy examples which pave the way for future methods providing solutions for high dimensional problems. Introduction Continuous-time models have extensively been studied in machine learning and control. They are especially beneficial to reason about latent variables at time points which are not included in the data. In a broad range of topics such as natural language processing [49], social media dynamics [31] or biology [18] to name just a few, the underlying process naturally evolves continuously in time. In many applications the control of such time-continuous models is of interest. There exist already numerous approaches which tackle the control problem of continuous state space systems, however, for many processes a discrete state space formulation is more suited. This class of systems is discussed in the area of discrete event systems [10] . Decision making in these systems has a long history, yet, if the state is not fully observed acting optimally in such systems is notoriously hard. Many approaches resort to heuristics such as applying a separation principle between inference and control. Unfortunately, this can lead to weak performance as the agent does not incorporate effects of its decisions for future inference. In the past, this problem was also approached by using a discrete time formulation such as a POMDP model [22] . Nevertheless, it is not always straight-forward to discretize the problem as it requires adding pseudo observations for time points without observations. Additionally, the time discretization can lead to problems when learning optimal controllers in the continuous-time setting [44] . A more principled way to approach this problem is to define the model in continuous time with a proper observation model and to solve the resulting model formulation. Still, it is not clear a priori, how to design such a model and even less how to control it in an optimal way. In this paper, we provide a formulation of this problem by introducing a continuous-time analogue to the POMDP 34th Conference on Neural Information Processing Systems (NeurIPS 2020),