CVPR2023

DeAR: Debiasing Vision-Language Models with Additive Residuals

Ashish Seth, Mayur Hemani, Chirag Agarwal

摘要

Debiased with DeAR Debiased with DeAR Object Detection: "Doctor" (A) CLIP Attribution Maps Before Debiasing After Debiasing "Doctor" "Nurse" CLIP Image Encoder CLIP Text Encoder "photo of a doctor" cos cos 0.237 0.241 ≈ 0.239 w/o Debiasing (B) Zero-shot Object Detection with CLIP Bias in VLMs < 0.244 With DeAR Figure 1. We present DEAR -a framework to de-bias large Vision-Language models (VLM) like CLIP [45], exhibited in the skewed similarity between specific language concepts and images of people of certain visual characteristics. (A) Attribution maps from the DEAR-augmented CLIP model indicate how the attribution for a text concept shifts from the person's facial characteristics to objective cues in the image. (B) Results for zero-shot object detection with CLIP-ODS [48] that uses CLIP before and after debiasing show a clear improvement in the fairness of its detection results.