AAAI2025

Non-stochastic Budgeted Online Pricing with Semi-Bandit Feedback

Xiang Liu, Hau Chan, Minming Li, Weiwei Wu, Long Tran-Thanh

摘要

Online control with non-stochastic disturbances and adversarially chosen convex cost functions, referred to as online non-stochastic control, has recently attracted increasing attention. We study online non-stochastic control with partial feedback, where learners can only access partially observed states and partially informed (bandit) costs. The problem setting arises naturally in real-world decision-making applications and strictly generalizes exceptional cases studied disparately by previous works. We propose the first online algorithm for this problem, with an O(T 3 /4 ) regret competing with the best policy in hindsight, where T denotes the time horizon and the O(•)-notation omits the poly-logarithmic factors in T . To further enhance the algorithms' robustness to changing environments, we then design a novel method with a two-layer structure to optimize the dynamic regret, a more challenging measure that competes with time-varying policies. Our method is based on the online ensemble framework by treating the controller above as the base learner. On top of that, we design two different meta-combiners to simultaneously handle the unknown variation of environments and the memory issue arising from the online control. We prove that the two resulting algorithms enjoy O(T 3 /4 (1 + P T ) 1 /2 ) and O(T 3 /4 (1 + P T ) 1 /4 + T 5 /6 ) dynamic regret respectively, where P T measures the environmental non-stationarity. Our results are further extended to unknown transition matrices. Finally, empirical studies in both synthetic linear and simulated nonlinear tasks validate our method's effectiveness, thus supporting the theoretical findings.