ICML2024

Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret with Adversarial Robustness in Partial Monitoring

Taira Tsuchiya, Shinji Ito, Junya Honda

被引用 3 次

摘要

Partial monitoring is a generic framework of online decision-making problems with limited feedback. To make decisions from such limited feedback, it is necessary to find an appropriate distribution for exploration. Recently, a powerful approach for this purpose, exploration by optimization (ExO), was proposed, which achieves optimal bounds in adversarial environments with follow-the-regularized-leader for a wide range of online decision-making problems. However, a naive application of ExO in stochastic environments significantly degrades regret bounds. To resolve this issue in locally observable games, we first establish a new framework and analysis for ExO with a hybrid regularizer. This development allows us to significantly improve existing regret bounds of best-of-both-worlds (BOBW) algorithms, which achieves nearly optimal bounds both in stochastic and adversarial environments. In particular, we derive a stochastic regret bound of O( a =a * k 2 m 2 log T /∆ a ), where k, m, and T are the numbers of actions, observations and rounds, a * is an optimal action, and ∆ a is the suboptimality gap for action a. This bound is roughly Θ(k 2 log T ) times smaller than existing BOBW bounds. In addition, for globally observable games, we provide a new BOBW algorithm with the first O(log T ) stochastic bound. making problems with limited feedback (Rustichini, 1999; Piccolboni & Schindelhauer, 2001) . A PM game with kactions and d-outcomes, denoted by G = (L, Φ), is defined by a loss matrix L ∈ [0, 1] k×d and feedback matrix Φ ∈ Σ k×d , where Σ is a set of feedback symbols.