ACL2022

Probing the Robustness of Trained Metrics for Conversational Dialogue Systems

Jan Deriu, Don Tuggener, Pius von Däniken, Mark Cieliebak

Abstract

This paper introduces an adversarial method to stress-test trained metrics for the evaluation of conversational dialogue systems. The method leverages Reinforcement Learning to find response strategies that elicit optimal scores from the trained metrics. We apply our method to test recently proposed trained metrics. We find that they all are susceptible to give high scores to responses generated by rather simple and obviously flawed strategies that our method converges on. For instance, simply copying parts of the conversation context to form a response yields competitive scores or even outperforms responses written by humans. Introduction One major issue in developing conversational dialogue systems is the large efforts required for evaluation. This hinders rapid developments in this field because frequent evaluations are not possible or very expensive. The goal is to create automated methods for evaluating to increase the efficiency. Unfortunately, methods such as BLEU (Papineni et al., 2002) have been shown to not be applicable to conversational dialogue systems (Liu et al., 2016) . Following this observation, in recent years the trend towards training methods for evaluating dialogue systems emerged (Lowe et al., 2017; Deriu and Cieliebak, 2019; Mehri and Eskenazi, 2020; Deriu et al., 2020) . The models are trained to take as input a pair of context and candidate response, and output a numerical score that rates the candidate for the given context. These systems achieve high correlations to human judgments, which is very promising. Unfortunately, these systems have been shown to suffer from instabilities. (Sai et al., 2019) showed that small perturbations to the candidate response already confuse the trained metric. In this work. we go one step further: we propose a method that automatically finds strategies that elicit very high scores from the trained metric, while be-041 ing of obvious low quality. Our method can be ap-042 plied to automatically test the robustness of trained 043 metrics against adversarial strategies that exploit 044 certain weaknesses of the trained metric.