WWW2026

Revisiting IPS-based Algorithms for Off-Policy Evaluation of Contextual Bandits

Daria Korovaitceva, Marina Sheshukova, Evgeny Frolov, Sergey Samsonov

摘要

Off-policy evaluation (OPE) is widely used to compare contextual bandit policies in recommender systems. While there a lot of recent methodological developments, suggesting novel OPE schemes, they are typically validated in the synthetic environments, which not necessarily possess the structure of the real-world datasets. In this paper, we consider the inverse propensity score (IPS) method and its modifications, and study how empirical conclusions inferred from the data depend on evaluation pipelines. We show, that even in the synthetic environments, rankings of different estimators are sensitive to random seeds, log generators, and sample size. Using the popular benchmark, the Open Bandit Dataset, we analyze logging behavior and data characteristics that may violate the i.i.d. assumptions of the log generation.