NeurIPS2025

From Contextual Combinatorial Semi-Bandits to Bandit List Classification: Improved Sample Complexity with Sparse Rewards

Liad Erez, Tomer Koren

2 citations

Abstract

We study the problem of contextual combinatorial semi-bandits, where input contexts are mapped into subsets of size $m$ of a collection of $K$ possible actions. In each round, the learner observes the realized reward of the predicted actions. Motivated by prototypical applications of contextual bandits, we focus on the $s$ -sparse regime where we assume that the sum of rewards is bounded by some value $s\ll K$ . For example, in recommendation systems the number of products purchased by any customer is significantly smaller than the total number of available products. Our main result is for the $(\epsilon,\delta)$ -PAC variant of the problem for which we design an algorithm that returns an $\epsilon$ -optimal policy with high probability using a sample complexity of $\tilde{O}((poly(K/m)+sm/\epsilon^2) \log(|\Pi|/\delta))$ where $\Pi$ is the underlying (finite) class and $s$ is the sparsity parameter. This bound improves upon known bounds for combinatorial semi-bandits whenever $s\ll K$ , and in the regime where $s=O(1)$ , the leading term is independent of $K$ . Our algorithm is also computationally efficient given access to an ERM oracle for $\Pi$ . Our framework generalizes the list multiclass classification problem with bandit feedback, which can be seen as a special case with binary reward vectors. In the special case of single-label classification corresponding to $s=m=1$ , we prove an $O((K^7+1/\epsilon^2)\log(|H|/\delta))$ sample complexity bound, which improves upon recent results in this scenario. Additionally, we consider the regret minimization setting where data can be generated adversarially, and establish a regret bound of $\tilde O(|\Pi|+\sqrt{smT\log |\Pi|})$ , extending the result of Erez et al. (2024) who consider the simpler single label classification setting.