WWW2026

Breaking Cross-modal Alignment in Embodied Intelligence: A Multimodal Adversarial Attack Framework for Vision-Language-Action Models

Zhihui Zhao, Xiaorong Dong, Yaowen Zheng, Xiaohui Chen, Yimo Ren, Hangbei Cheng, Yongle Chen, Limin Sun

Abstract

Vision–Language–Action (VLA) models underpin robotic and other embodied agents by mapping visual observations and language instructions into executable actions. Their wide adoption through open web model repositories, however, introduces new supply-chain risks: adversaries can launch adversarial attacks to manipulate the action outputs of VLAs, potentially leading to harmful real-world outcomes for embodied agents. To exploit this vulnerability, we propose MAVLA, a novel multimodal adversarial attack framework. MAVLA serves as a modular front-end that integrates seamlessly with a target VLA model, injecting perturbations into task-relevant and structure-sensitive image regions to disrupt cross-modal alignment and induce deviations in the generated action instructions. To balance attack effectiveness with stealth, we design four loss functions that jointly maximize multimodal misalignment while preserving visual stealthiness. Extensive evaluations in simulated and real-world scenarios show that at a 40% perturbation ratio, the task success rate of VLAs drops by about 70%. Compared to conventional attack baselines, MAVLA achieves superior attack effectiveness and stealthiness with low overhead. Our work reveals a practical and previously underexplored threat to embodied systems, and offers a red-team baseline to inform future defensive strategies and promote safer VLA deployment.