WWW2023
A Multi-view Meta-learning Approach for Multi-modal Response Generation
Zhiliang Tian, Zheng Xie, Fuqiang Lin, Yiping Song
被引用 7 次
摘要
As massive conversation examples are easily accessible on the Internet, we are now able to organize large-scale conversation corpora to build chatbots in a data-driven manner. Multi-modal social chatbots produce conversational utterances according to both textual utterances and vision signals. Due to the difficulty of bridging different modalities, the dialogue generation model of chatbots falls into local minima that only capture the mapping between textual input and textual output, as a result, it almost ignores the non-textual signals. Further, similar to the dialogue model with plain text as input and output, the generated responses from multi-modal dialogue also lack diversity and informativeness. In this paper, to address the above issues, we propose a Multi-View Meta-Learning (MultiVML) algorithm that groups samples in multiple views and customizes generation models to different groups. We employ a multi-view clustering to group the training samples so as to attend more to the unique information in non-textual modality. Tailoring different sets of model parameters for each group boosts the genereation diversity via meta-learning. We evaluate MultiVML on two variants of the OpenViDial benchmark datasets. The experiments show that our model not only better explore the information from multiple modalities, but also excels baselines in both quality and diversity.