NeurIPS2020
Adaptive Sampling for Stochastic Risk-Averse Learning
Sebastian Curi, Kfir Y. Levy, Stefanie Jegelka, Andreas Krause
被引用 61 次
摘要
In high-stakes machine learning applications, it is crucial to not only perform well on average, but also when restricted to difficult examples. To address this, we consider the problem of training models in a risk-averse manner. We propose an adaptive sampling algorithm for stochastically optimizing the Conditional Valueat-Risk (CVaR) of a loss distribution, which measures its performance on the α fraction of most difficult examples. We use a distributionally robust formulation of the CVaR to phrase the problem as a zero-sum game between two players, and solve it efficiently using regret minimization. Our approach relies on sampling from structured Determinantal Point Processes (DPPs), which enables scaling it to large data sets. Finally, we empirically demonstrate its effectiveness on large-scale convex and non-convex learning tasks. Risk Measures Risk aversion is a well-studied human behavior, in which agents assign more weight to adverse events than to positive ones (Pratt, 1978) . Approaches for modeling risk include using utility functions that emphasize larger losses (Rabin, 2013) ; prospect theory that re-scales the probability of events (Kahneman and Tversky, 2013); or direct optimization of coherent risk-measures (Artzner et al., 1999) . Rockafellar et al. (2000) introduce the CVaR as a particular instance of the latter class. The CVaR has found many applications, such as portfolio optimization (Krokhmal et al., 2002) or supply chain management (Carneiro et al., 2010) , as it does not rely on specific utility or weighing functions, which offers great flexibility. CVaR in Machine Learning The ν-SVM algorithm by Schölkopf et al. (2000) can be interpreted as optimizing the CVaR of the loss, as shown by Gotoh and Takeda (2016) . Also related, Shalev-Shwartz and Wexler (2016) propose to minimize the maximal loss among all samples. The maximal loss is the limiting case of the CVaR when α → 0. Fan et al. ( 2017 ) generalize this work to the top-k average loss. Although they do not mention the relationship to the CVaR, their learning criterion is equal to the CVaR for empirical measures. For optimization, they use an algorithm proposed by Ogryczak and Tamir (2003) to optimize the maximum of the sum of k functions; this algorithm is the same as the "truncated" algorithm of Rockafellar et al. (2000) to optimize the CVaR. Recent applications of the CVaR in ML include risk-averse bandits (Sani et al., 2012) , risk-averse reinforcement learning (Chow et al., 2017) , and fairness (Williamson and Menon, 2019) . All these use the original "truncated" formulation of Rockafellar et al. (2000) to optimize the CVaR. One of the major shortcomings of this formulation is that mini-batch gradient estimates have high variance. In this work, we address this via a method based on adaptive sampling, inspired by Shalev-Shwartz and Wexler ( 2016 ), that allows us to handle large datasets and complex (deep neural network) models. Distributionally Robust Optimization The CVaR also has a natural distributionally robust optimization (DRO) interpretation (Shapiro et al., 2009, Section 6.3), which we exploit in this paper. For example, Ahmadi-Javid ( 2012 ) introduces the entropic value-at-risk by considering a different DRO set. Duchi et al. (2016); Namkoong and Duchi (2017); Esfahani and Kuhn (2018); Kirschner et al. (2020) address related DRO problems, but with different uncertainty sets. We use the DRO formulation of the CVaR to phrase its optimization as a game. To solve the game, we propose an adaptive algorithm for the learning problem. Our algorithm is most related to (Namkoong and Duchi, 2016), who develop an algorithm for DRO sets induced by Cressie-Read f -divergences. Instead, we use a different DRO set that arises in common data sets (Mehrabi et al., 2019) and we provide an efficient algorithm to solve the DRO problem in large-scale datasets.