NeurIPS2022

Robustness Disparities in Face Detection

Samuel Dooley, George Z. Wei, Tom Goldstein, John Dickerson

被引用 12 次

摘要

Facial analysis systems have been deployed by large companies and critiqued by scholars and activists for the past decade. Many existing algorithmic audits examine the performance of these systems on later stage elements of facial analysis systems like facial recognition and age, emotion, or perceived gender prediction; however, a core component to these systems has been vastly understudied from a fairness perspective: face detection, sometimes called face localization. Since face detection is a pre-requisite step in facial analysis systems, the bias we observe in face detection will flow downstream to the other components like facial recognition and emotion prediction. Additionally, no prior work has focused on the robustness of these systems under various perturbations and corruptions, which leaves open the question of how various people are impacted by these phenomena. We present the first of its kind detailed benchmark of face detection systems, specifically examining the robustness to noise of commercial and academic models. We use both standard and recently released academic facial datasets to quantitatively analyze trends in face detection robustness. Across all the datasets and systems, we generally find that photos of individuals who are masculine presenting, older, of darker skin type, or have dim lighting are more susceptible to errors than their counterparts in other identities. overlaps heavily with the fairness in machine learning literature; for additional coverage of that broader ecosystem and discussion around bias in machine learning writ large, we direct the reader to survey works due to [10] and [3]. Demographic effects in facial detection and recognition. The existence of differential performance of facial detection and recognition on groups and subgroups of populations has been explored in a variety of settings [8, 28, 37, 41, 56, 61] . In this work, we focus on measuring the impact of noise on a classification task, like that of [75] ; indeed, a core focus of our benchmark is to quantify relative drops in performance conditioned on an input datapoint's membership in a particular group. We view our work as a benchmark, that is, it focuses on quantifying and measuring, decidedly not providing a new method to "fix" or otherwise mitigate issues of demographic inequity in a system. Toward that latter point, existing work on "fixing" unfair systems can be split into three (or, arguably, four [64]) focus areas: pre-, in-, and post-processing. Pre-processing work largely focuses on dataset curation and preprocessing [e.g., 22, 60, 63, 71]. In-processing often constrains the ML training method or optimization algorithm itself [e.g.