EMNLP2025

Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models

Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari

摘要

Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs) by enabling audio-based human interactions. However, recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safetyalignment. Despite advances in defence measures for text and vision LLMs, effective safetyalignment strategies and audio-safety dataset specifically targeting LALMs are notably absent. Meanwhile defence measures based on Supervised Fine-tuning (SFT) struggle to address safety improvement while avoiding overrejection issues, significantly compromising helpfulness. In this work, we propose an unsupervised safety-fine-tuning strategy as remedy that reshapes model's representation space to enhance existing LALMs safety-alignment while balancing the risk of over-rejection. Our experiments, conducted across three generations of Qwen LALMs, demonstrate that our approach significantly improves LALMs safety under three modality input conditions (audiotext, text-only, and audio-only) while increasing over-rejection rate by only 0.88% on average. 1 Warning: this paper contains harmful examples.