ACL2025

Does Your Voice Assistant Remember? Analyzing Conversational Context Recall and Utilization in Voice Interaction Models

Heeseung Kim, Che Hyun Lee, Sangkwon Park, Jiheum Yeom, Nohil Park, Sangwon Yu, Sungroh Yoon

Abstract

Recent advancements in multi-turn voice interaction models have improved user-model communication. However, while closed-source models effectively retain and recall past utterances, whether open-source models share this ability remains unexplored. To fill this gap, we systematically evaluate how well open-source interaction models utilize past utterances using ContextDialog, a benchmark we proposed for this purpose. Our findings show that speechbased models have more difficulty than textbased ones, especially when recalling information conveyed in speech, and even with retrieval-augmented generation, models still struggle with questions about past utterances. These insights highlight key limitations in opensource models and suggest ways to improve memory retention and retrieval robustness. 1 * Equal Contribution † Corresponding Author 1 ContextDialog: https://huggingface.co/datasets/Context Dialog/ContextDialog Multi-Round Spoken Dialog Hi! So do you like music? I love rock music and well many other genre's. Yes, I'm the same! I love music.... … … Stage 1. Written-Form QA GPT-4o with Generation Prompt O1-mini with Validation Prompt Stage 2. Spoken QA Spoken Question and Answer Generation with Speaker Adaptive TTS Automatic and Human Verification for Pronunciation Errors What genres of music did I say I liked? You said you like rock music and many other genres!