CCS2024

Blind and Low-Vision Individuals' Detection of Audio Deepfakes

Filipo Sharevski, Aziz Zeidieh, Jennifer Vander Loop, Peter Jachim

被引用 5 次

摘要

Audio deepfakes are a form of deception where convincing speech sentences are synthesized through machine learning means to give an impression of a human speaker. Audio deepfakes emerge as an attractive vector for targeting users that rely on audio accessibility, such as individuals who are blind or low vision. The critical reliance on speech both as a medium and an affordance puts this population at an undue risk of being deceived as they rely solely on themselves to detect whether a piece of audio is a deepfake or not. To better understand the nature of this risk considering the nuanced reliance on assistive technologies such as screen readers, we conducted a user study with n=16 blind and low vision individuals from the US. Our participants achieved an overall discernment accuracy of 59%, and clips identified as deep fakes were only actually deepfakes in 50.8% of the cases (precision). The participants that self-identified as "low vision" performed slightly better (accuracy of 61%, precision of 64%) compared to the ones that self-identified as "blind" (accuracy of 55%, precision of 56%). Our qualitative results show that the participants in the "blind" group mostly considered a combination of infliction, imperfections in the voice, and the intensity in the speech delivery as discernment factors. The participants in the "low vision" group mostly used the speaker's pitch, enunciation, emotion, and the fluency and articulation of the speaker as discernment cues. Overall, participants felt that audio deepfakes have the potential to deceive visually impaired individuals with political disinformation, impersonate their voice in authentication and smart homes, and specifically target them with voice phishing and enhanced scams.