ACL2024
Evaluating Intention Detection Capability of Large Language Models in Persuasive Dialogues
Hiromasa Sakurai, Yusuke Miyao
摘要
We investigate intention detection in persuasive 001 multi-turn dialogs employing the largest avail-002 able Large Language Models (LLMs). Much 003 of the prior research measures the intention de-004 tection capability of machine learning models 005 without considering the conversational history. 006 To evaluate LLMs' intention detection capabil-007 ity in conversation, we modified the existing 008 datasets of persuasive conversation and created 009 datasets using a multiple-choice paradigm. It is 010 crucial to consider others' perspectives through 011 their utterances when engaging in a persuasive 012 conversation, especially when making a request 013 or reply that is inconvenient for others. This 014 feature makes the persuasive dialogue suitable 015 for the dataset of measuring intention detection 016 capability. We incorporate the concept of 'face 017 acts,' which categorize how utterances affect 018 mental states. This approach enables us to mea-019 sure intention detection capability by focusing 020 on crucial intentions and to conduct compre-021 hensible analysis according to intention types. 022 1 Introduction 023 Identifying the speaker's intention is crucial for 024 maintaining a smooth conversation. Suppose a sit-025 uation where Alice asks Bob for a donation to a 026 specific charity, and Bob responds with an evasive 027 answer such as 'Well, you know....' In this situa-028 tion, we can assume that Bob is unwilling to donate, 029 but since refusing the donation is psychologically 030 burdensome, he wants Alice to sense his hesitation. 031 The speaker's intentions can be conveyed without 032 saying them out loud, and they also vary depending 033 on the context of the conversation. We engage in 034 conversations while estimating the speaker's inten-035 tions unconsciously, and this ability is essential for 036 facilitating natural communication. 037 In recent years, there has been remarkable 038 progress in developing LLMs such as ChatGPT 1 or 130 Dutt et al. (2020) incorporates the concept of 131 face acts for analyzing dialogues in persuasive sit-132 uations, where maintaining good relationships is 133 particularly important. They identified face acts 134 as factors influencing the success of persuasion. 135 They developed a machine learning model to track 136 the conversation's dynamics, employing face acts 137 and conversation histories. They divided face acts 138 into eight categories based on the following three 139 criteria. 140 • whether it is directed toward the speaker or 141 the hearer (s/h) 142 • whether it is directed toward a positive or neg-143 ative face (pos/neg) 144 terances, evaluating the model's intention detection 216 capability. They did not employ LLMs, and how 217 well LLMs can detect the intention of utterances 218 from multi-turn persuasive dialogue is yet to be 219 revealed. 220 3 Data 221 As mentioned in the previous section, prior studies 222 on intention detection mostly did not apply multi-223 turn dialogue data. A possible approach to evaluate 224 intention detection capability is utilizing the persua-225 sive dialogue dataset created in Dutt et al. (2020) 226 and directly predicting face acts from utterances. 227 However, considering that face acts are abstract 228 intentions and are not well-known concepts, they 229 are non-intuitive for humans to handle. Also, they 230 are likely not sufficiently acquired by LLMs in in-231 context learning, as face acts should be infrequent 232 in the text data for pretraining. Thus, modifying 233 the task into an applicable format in zero-shot or 234 few-shot scenarios is necessary to evaluate LLMs' 235 intention detection capability instead of just em-236 ploying face act prediction tasks straightforwardly. 237 We modify persuasive dialogue data 3 in Dutt 238 et al. (2020) and create a dataset for evaluating 239 intention detection capability. Instead of directly 240 predicting face acts, we transform face acts into 241 intention descriptions written in natural language 242 to make the task comprehensible. Each entry in 243 our dataset is represented in Figure 1. The input 244 of this task consists of conversational history and 245 four intention descriptions for the last utterance 246 in the conversation. The output is one descrip-247 tion out of four options. This format is a reading 248 comprehension style inspired by several previous 249 dialogue reasoning studies (Cui et al., 2020; Huang 301 each utterance. We took a majority vote for three 302 descriptions and annotated gold labels if more than 303 one worker annotated the same intention descrip-304 tion. We let workers annotate 691 utterances in to-305 tal, and among them, 620 utterances had agreement 306 from at least two out of three individuals' opinions. 307 In the following process, we create a problem of 308 intention classification for these 620 utterances. To 309 assess the level of agreement among annotators, 310 we calculated Krippendorff's alpha (Krippendorf