ICLR2025

Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives

Zeliang Zhang, Susan Liang, Daiki Shimada, Chenliang Xu

摘要

While audio-visual learning equips models with a richer understanding of the real world by leveraging multiple sensory modalities, this integration also introduces new vulnerabilities to adversarial attacks. In this paper, we present a comprehensive study of the adversarial robustness of audio-visual models, considering both temporal and modality-specific vulnerabilities. We propose two powerful adversarial attacks: 1) a temporal invariance attack that exploits the inherent temporal redundancy across consecutive time segments and 2) a modality misalignment attack that introduces incongruence between the audio and visual modalities. These attacks are designed to thoroughly assess the robustness of audio-visual models against diverse threats. Furthermore, to defend against such attacks, we introduce a novel audio-visual adversarial training framework. This framework addresses key challenges in vanilla adversarial training by incorporating efficient adversarial perturbation crafting tailored to multi-modal data and an adversarial curriculum strategy. Extensive experiments in the Kinetics-Sounds dataset demonstrate that our proposed temporal and modality-based attacks in degrading model performance can achieve state-of-the-art performance, while our adversarial training defense largely improves the adversarial robustness as well as the adversarial training efficiency. * Equal contribution. Listed order is random. Published as a conference paper at ICLR 2025 Yang et al., 2023a). However, current attack methods fail to exploit them, potentially limiting their effectiveness. Conversely, by leveraging these characteristics, we can craft more potent attacks and develop improved robust learning strategies specifically tailored for audio-visual models. In this work, we rethink the adversarial vulnerability of audio-visual models through the lenses of temporal and modality perspectives. We begin with an empirical analysis to assess the vulnerability of existing models. Our case study experiments reveal several key findings, including the presence of adversarial transferability within the audio-visual domain, and the significant impact of temporal consistency and modality correlations on model robustness. Leveraging these insights, we propose two novel adversarial attacks tailored to the unique properties of multi-modal data: 1) the temporal invariance attack, which targets robust and temporally consistent audio-visual features by introducing inconsistencies across consecutive frames, and 2) the modality misalignment attack, which crafts adversarial examples by inducing incongruencies between the audio and visual streams. To mitigate the vulnerabilities exposed by these dedicated attacks, we propose a novel audio-visual adversarial training framework that serves as a robust defense mechanism. Our framework addresses critical challenges in robust multi-modal learning by incorporating efficient adversarial perturbation crafting techniques along with an adversarial curriculum training strategy. The proposed defense aims to significantly improve the robustness of audio-visual models against adversarial attacks with minimal impact on training efficiency. Our contributions can be summarized as follows: 1. We first identify the existence of adversarial transferability in audio-visual learning, and introduce two powerful adversarial attacks, namely the Temporal Invariance-based Attack (TIA) and the Modality Misalignment-based Attack (MMA), to evaluate the adversarial robustness of audio-visual models comprehensively. 2. We propose efficient adversarial perturbation crafting and adversarial curriculum training aimed at enhancing both the robustness and efficiency of audio-visual models. 3. We validate the effectiveness of both our proposed attacks and defense mechanisms through extensive experiments conducted on the widely-used Kinetics-Sounds dataset. RELATED WORK AUDIO-VISUAL LEARNING The field of audio-visual learning encompasses a wide range of tasks, including audio-visual event recognition (