ICML2020

Designing Optimal Dynamic Treatment Regimes: A Causal Reinforcement Learning Approach

Junzhe Zhang

被引用 78 次

摘要

A dynamic treatment regime (DTR) consists of a sequence of decision rules, one per stage of intervention, that dictates how to determine the treatment assignment to patients based on evolving treatments and covariates' history. These regimes are particularly effective for managing chronic disorders and is arguably one of the critical ingredients underlying more personalized decisionmaking systems. All reinforcement learning algorithms for finding the optimal DTR in online settings will suffer Ω( |D X∪S |T ) regret on some environments, where T is the number of experiments and D X∪S is the domains of the treatments X and covariates S. This implies that T = Ω(|D X∪S |) trials will be required to generate an optimal DTR. In many applications, the domains of X and S could be enormous, which means that the time required to ensure appropriate learning may be unattainable. We show that, if the causal diagram of the underlying environment is provided, one could achieve regret that is exponentially smaller than D X∪S . In particular, we develop two online algorithms that satisfy such regret bounds by exploiting the causal structure underlying the DTR; one is the based on the principle of optimism in the face of uncertainty (OFU-DTR), and the other uses the posterior sampling learning (PS-DTR). Finally, we introduce efficient methods to accelerate these online learning procedures by leveraging the abundant, yet biased observational (non-experimental) data.