CCS2025

S2S-SED: A Speech-to-Speech Approach for Detection of Social Engineering Attacks in Audio Conversations

Leonardo Erlacher

Abstract

Voice-based social engineering attacks are becoming increasingly sophisticated, driven by advances in generative speech synthesis, neural voice cloning, and psychologically adaptive manipulation strategies. While large language models (LLMs) have demonstrated substantial capabilities in textual deception detection, their reliance on transcribed input omits essential prosodic and interactional cues -- such as stress, urgency, or conversational dynamics—that are critical for identifying manipulative intent in spoken interactions. This constitutes a fundamental limitation in current LLM-based approaches to voice fraud detection. This doctoral research aims to investigate and develop S2S-SED, a novel speech-to-speech framework for the detection of social engineering attacks in audio conversations. Unlike transcription-dependent pipelines, the envisioned architecture processes raw audio input directly by encoding continuous speech into latent representations that implicitly preserve prosodic patterns, emotional tone, and semantic content. These embeddings are fed into a unified inference model capable of assessing conversational dynamics—such as stress, urgency, and turntaking -- without relying on intermediate text or auxiliary subsystems. Building upon recent advancements in audio language mod- eling, the project envisions the construction of a domain-specific, annotated Dataset of voice-based social engineering scenarios. This dataset will serve as the foundation for training and evaluating an audio-native LLM tailored to deception detection. The research addresses a critical gap at the intersection of speech processing, AI-driven security, and human-machine communication, and aims to lay the groundwork for next-generation methods in voice threat mitigation.