ICLR2026

Best-of-three-worlds Analysis for Dueling Bandits with Borda Winner

Zirui Hu, Tingyu Zhang, Fang Kong

摘要

The dueling bandits (DB) problem addresses online learning from relative preferences, where the learner queries pairs of arms and receives binary win-loss feedback. Most existing work focuses on designing algorithms for specific stochastic or adversarial environments. Recently, a unified algorithm has been proposed that achieves convergence across all settings. However, this approach relies on the existence of a Condorcet winner, which is often not achievable, particularly when the preference matrix changes in the adversarial setting. Aiming for a more general Borda winner objective, there currently exists no unified framework that simultaneously achieves optimal regret across these environments. In this paper, we explore how the follow-the-regularized-leader (FTRL) algorithm can be employed to achieve this objective. We investigate a hybrid negative entropy regularizer and demonstrate that it enables us to achieve Õ(K 1/3 T 2/3 ) regret in the adversarial setting, O(K log 2 T /∆ 2 min ) regret in the stochastic setting, and O(K log 2 T /∆ 2 min + (C 2 K log 2 T /∆ 2 min ) 1/3 ) regret in the corrupted setting, where K is the arm set size, T is the horizon, ∆ min is the minimum gap between the optimal and sub-optimal arms, and C is the corruption level. These results align with the state-of-the-art in individual settings, while eliminating the need to assume a specific environment type. We also present experimental results demonstrating the advantages of our algorithm over baseline methods across different environments. RELATED WORK Research has mostly targeted specific settings-stochastic, adversarial, and corrupted stochasticalong with key winner types, such as the Condorcet winner (an arm that beats all others more than half the time) and the Borda winner (an arm with the highest average preference score). In stochastic settings with Condorcet winners, where preferences stay constant, algorithms like RUCB perform well under Condorcet winners by balancing exploration and exploitation (Zoghi et al., 2014) ; fur-