ICCV2023

FACET: Fairness in Computer Vision Evaluation Benchmark

Laura Gustafson, Chloé Rolland, Nikhila Ravi, Quentin Duval, Aaron Adcock, Cheng-Yang Fu, Melissa Hall, Candace Ross

74 citations

Abstract

Computer vision models have known performance disparities across attributes such as gender and skin tone. This means during tasks such as classification and detection, model performance differs for certain classes based on the demographics of the people in the image. These disparities have been shown to exist, but until now there has not been a unified approach to measure these differences for common use-cases of computer vision models. We present a new benchmark named FACET (FAirness in Computer Vision EvaluaTion), a large, publicly available evaluation set of 32k images for some of the most common vision tasks -image classification, object detection and segmentation. For every image in FACET, we hired expert reviewers to manually annotate person-related attributes such as perceived skin tone and hair type, manually draw bounding boxes and label fine-grained person-related classes such as disk jockey or guitarist. In addition, we use FACET to benchmark state-of-the-art vision models and present a deeper understanding of potential performance disparities and challenges across sensitive demographic attributes. With the exhaustive annotations collected, we probe models using single demographics attributes as well as multiple attributes using an intersectional approach (e.g. hair color and perceived skin tone). Our results show that classification, detection, segmentation, and visual grounding models exhibit performance disparities across demographic attributes and intersections of attributes. These harms suggest that not all people represented in datasets receive fair and equitable treatment in these vision tasks. We hope current and future results using our benchmark will contribute to fairer, more robust vision models. FACET is available publicly at https://facet.metademolab.com . Dataset Dataset Size Apparent or Self-Reported Attributes Task #/people #/images #/videos #/boxes #/masks gender age skin tone race lighting additional UTK-Face[98] 20k 20k ---Yes Yes No Yes No No -FairFace[57] 108k 108k ---Yes Yes No Yes No No -Gender Shades[8] 1.2k 1.2k ---Yes Yes Yes No No No -OpenImages MIAP[84] 454k 100k -454k * Yes Yes No No No No C * DS * [94] annotations for BDDK 100k [97] 16k 2.2k -16k * No No Yes No Yes No DS * [100] annotations for COCO [63] 28k 16k -28k 28k Yes No Yes No No No C * DS Casual Conversations v1[43] 3k N/A 45k --Yes Yes Yes No Yes Yes -Casual Conversations v2 [42] 5.6k N/A 26k --Yes Yes Yes No Yes Yes -Ours -FACET 50k 32k -50k 69k Yes Yes Yes No Yes Yes CDS * represents tasks/annotations that are not included in the fairness portion of the dataset, but are included in the overall dataset. e.g COCO has been used for multi-class classification [101, 92]