ACL2022

Simulating Bandit Learning from User Feedback for Extractive Question Answering

Ge Gao, Eunsol Choi, Yoav Artzi

Abstract

We study learning from user feedback for extractive question answering by simulating feedback using supervised data. We cast the problem as contextual bandit learning, and analyze the characteristics of several learning scenarios with focus on reducing data annotation. We show that systems initially trained on few examples can dramatically improve given feedback from users on model-predicted answers, and that one can use existing datasets to deploy systems in new domains without any annotation effort, but instead improving the system on-the-fly via user feedback. 040 simple binary feedback, and creates a contextual 041 bandit learning scenario (Auer et al., 2002; Lang-042 ford and Zhang, 2007). Figure 1 illustrates this 043 learning signal and its potential. 044 We simulate user feedback using several widely 045 used QA datasets, and use it as a bandit signal for 046 learning. We study the empirical characteristics 047 of the learning process, including its performance, 048 sensitivity to initial system performance, and trade-049 offs between online and offline learning. We also 050 simulate zero-annotation domain adaptation, where 051 we deploy a QA system trained from supervised 052 data in one domain and adapt it solely from user 053 feedback in a new domain. 104 c i , . . . , c j where i, j ∈ [1, n] and i ≤ j in the 105 context c as an answer. When relevant, we denote 106 π θ as a QA model parameterized by θ. 107 We formalize learning as a contextual bandit 108 process: at each time step t, the model is given 109 a question-context pair (q (t) , c(t) ), predicts an an-110 swer span ŷ, and receives a reward r (t) ∈ IR. 111 The learner's goal is to maximize the total reward 112 T t=1 r (t) . This formulation reflects a setup where, 113 given a question-context pair, the QA system inter-114 acts with users, who validate the model-predicted 115 answer in context, and provide feedback which is 116 mapped to a numerical reward. 117 Learning Algorithm We learn using policy gra-118 dient. Our learner is similar to REINFORCE (Sut-119 ton and Barto, 1998; Williams, 2004), but we use 120 arg max to predict answers instead of Monte Carlo 121 sampling from the model's output distribution. 3 122 We study online and offline learning, also re-123 ferred to as on-and off-policy. In online learning 124 (Algorithm 1), the model identity is maintained be-125 tween prediction and update; the parameter values 126 that are updated are the same that were used to gen-127 erate the output receiving reward. This entails that 128 a reward is only used once, to update the model 129 after observing it. In offline learning (Algorithm 2), 130 this relation between update and prediction does 131 not hold. The learner observes reward, often across 132 many examples, and may use it to update the model 133 many times, even after the parameters drifted arbi-134 trarily far from these that generated the prediction. 135 In practice, we observe reward for the entire length 136 of the simulation (T steps) and then update for 137 E epochs. The reward is re-weighted to provide 138 an unbiased estimation using inverse propensity 139 score (IPS; Horvitz and Thompson, 1952). We clip 140 the debiasing coefficient to avoid amplifying exam-141 ples with large coefficients (line 10, Algorithm 2). 142 In general, offline learning is easier to implement 143 because updating the model is not integrated with 144 its deployment. Offline learning also uses a train-145 ing loop that is similar to optimization practices in 146 supervised learning. This allows to iterate over the 147 data multiple times, albeit with the same feedback