NeurIPS2021

Reinforcement Learning with State Observation Costs in Action-Contingent Noiselessly Observable Markov Decision Processes

Hyunji Alex Nam, Scott L. Fleming, Emma Brunskill

被引用 30 次

摘要

Many real-world problems that require making optimal sequences of decisions under uncertainty involve costs when the agent wishes to obtain information about its environment. We design and analyze algorithms for reinforcement learning (RL) in Action-Contingent Noiselessly Observable MDPs (ACNO-MDPs), a special class of POMDPs in which the agent can choose to either (1) fully observe the state at a cost and then act; or (2) act without any immediate observation information, relying on past observations to infer the underlying state. ACNO-MDPs arise frequently in important real-world application domains like healthcare, in which clinicians must balance the value of information gleaned from medical tests (e.g., blood-based biomarkers) with the costs of gathering that information (e.g., the costs of labor and materials required to administer such tests). We develop a Probably Approximately Correct (PAC) RL algorithm for tabular ACNO-MDPs that provides substantially tighter bounds compared to generic POMDP-RL algorithms, on the total number of episodes exhibiting worse than near-optimal performance. For continuous-state ACNO-MDPs, we propose a novel method of incorporating observation information that, when coupled with modern RL algorithms, yields significantly faster learning compared to other POMDP-RL algorithms in several simulated environments. *Equal contribution 35th Conference on Neural Information Processing Systems (NeurIPS 2021). ments in which state information at each time step can be noisy or missing are more appropriately modeled as partially observable Markov decision processes (POMDPs) rather than MDPs [29] . In this paper, we provide theory and algorithms for a special class of POMDPs in which state information is complete/noiseless when observed, but may be missing at any given time step if the agent chooses not to observe the state. We call this class of POMDPs Action-Contingent Noiselessly Observable MDPs (ACNO-MDPs), which can be useful for capturing a number of important realworld settings, such as: ACNO-MDPs in healthcare. Clinicians in Intensive Care Units (ICUs) frequently have to make sequences of treatment decisions under uncertainty for patients at risk. While accurate laboratory tests can inform such decisions, administration of these tests carry a significant cost for the patient and the health system [25] . These costs, together with the fact that frequent testing may be redundant and wasteful [12, 78] , appropriately lead clinicians to refrain from constantly observing the patient state (i.e., ordering laboratory tests) [1, 17] . Similarly, other settings -such as glucose monitoring to assist in insulin dosing recommendations [18] , or white blood cell count monitoring to assist in anti-HIV drug dosing [15] -have modern tools for accurate observations of biomarkers and could be appropriately modeled by our ACNO-MDP framework. ACNO-MDPs for user-adaptive experiences. Applications on mobile phones and other personal devices can collect information on the user's status, such as location, motion, inter-user contact, and background noise, in order to adaptively suggest features that enhance the user's experience [69, 70] . While these sensors can in theory always be kept active, battery consumption due to constant sensing would make such applications less desirable to the user. Modeling this problem as an ACNO-MDP could enable a policy that balances battery usage with personalizing the user's experience. 1 Our contributions. Our main contributions are: • Proposing a Probably Approximately Correct "Observe-then-Plan" algorithm for tabular ACNO-MDPs that fully observes while exploring, then employs POMDP planning using learned models. The resulting policy can select when to observe to achieve high expected reward in environments with state observations that are costly but optional.