ICLR2026
Speech World Model: Causal State–Action Planning with Explicit Reasoning for Speech
Xuanru Zhou, Jiachen Lian, Henry Hong, Xinyi Yang, Gopala Anumanchipalli
2 citations
Abstract
The man in the audio is speaking English and saying, "I know! Don't switch hands, okay?". It is difficult to determine the exact context of the conversation without further information. The audio clip seems to be a humorous exchange between two people, possibly a teacher and a student, as indicated by the phrase "don't switch hands, okay?" This could be a playful instruction in a classroom setting, where the teacher is trying to keep the student focused on a task. The audio is in a communication context that likely involves physical activity or interaction, such as passing objects, the speaker's fearful tone conveys a sense of urgency and insecurity. While on the surface their question, "don't switch hands okay?", simply asks for a 'yes' or 'no' answer, its underlying intent is to seek reassurance and a clear commitment from the other party. This implies a need for consistency and trust in the arrangement, as the speaker is looking for a definitive answer to resolve their fear and feel more secure in the situation. Speech World Model Ours It sounds like someone is giving instructions, possibly to someone else who's performing a task that requires focus, like a sport or a game. The speaker seems to be emphasizing not to switch hands, which might suggest they are guiding or coaching someone.