KDD2025

Evaluating Decision Rules Across Many Weak Experiments

Winston Chou, Colin Gray, Nathan Kallus, Aurélien Bibaut, Simon Ejdemyr

被引用 1 次

摘要

Technology firms conduct randomized controlled experiments ("A/B tests") to learn which actions to take to improve business outcomes.In firms with mature experimentation platforms, experimentation programs can consist of many thousands of tests.To effectively scale experimentation, firms rely on decision rules: standard operating procedures for mapping the results of an experiment to a choice of treatment arm to launch to the general user population.Despite the critical role of decision rules in translating experimentation into business decisions, rigorous guidance on how to evaluate and choose decision rules is scarce.This paper proposes to evaluate decision rules based on their cumulative returns to business north star metrics.Although intuitive and easy to explain to decisionmakers, this quantity can be difficult to estimate, especially when experiments have weak signal-to-noise ratios.We develop a crossvalidation estimator that is much less biased than the naive plug-in estimator under conditions realistic to digital experimentation.We demonstrate the efficacy of our approach via a case study of 123 historical A/B tests at Netflix, where we used it to show that a new decision rule would have increased cumulative returns to the north star metric by an estimated 33%, directly leading to the adoption of the new rule. CCS Concepts Mathematics of computing Probability and statistics; Computing methodologies Rule learning; Cross-validation.