EMNLP2025

MULTIVOX: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions

Ramaneswaran Selvakumar, Ashish Seth, Nishit Anand, Utkarsh Tyagi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha

Abstract

The rapid progress of Large Language Models (LLMs) has empowered omni models to act as voice assistants capable of understanding spoken dialogues. These models can process multimodal inputs beyond text, such as speech and visual data, enabling more contextaware interactions. However, current benchmarks fall short in comprehensively evaluating how well these models generate context-aware responses, particularly when it comes to implicitly understanding fine-grained speech characteristics, such as pitch, emotion, timbre, and volume or the environmental acoustic context such as background sounds. Additionally, they inadequately assess the ability of models to align paralinguistic cues with complementary visual signals to inform their responses. To address these gaps, we introduce MULTIVOX, the first omni voice assistant benchmark designed to evaluate the ability of voice assistants to integrate spoken and visual cues including paralinguistic speech features for truly multimodal understanding. Specifically, MULTIVOX includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features and a range of visual cues such as images and videos. Our evaluation on 10 state-of-the-art models reveals that, although humans excel at these tasks, current open-source models consistently struggle to produce contextually grounded responses. 1 General Intelligence (AGI) (Bubeck et al., 2023; Morris et al., 2024) . While OLMs provide a wide range of applications (Xu et al., 2025) , one of their primary use cases is developing omni-modal voice assistants (OVA) (Huang et al., 2024) . Unlike traditional speech voice assistants that rely solely on speech instruction, OVAs powered by OLMs such as GPT-4o (OpenAI, 2024) and Qwen2.5 Omni (Xu et al., 2025), can understand speech dialogues and reason over multimodal inputs, including images and videos. Advancing the application of OLMs in voice assistants poses challenges not only in model development but also in constructing effective evaluation benchmarks. While existing OLM benchmarks like OmniBench (Li et al., 2024) incorpo-