CVPR2025

GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model

Yue Han, Jiangning Zhang, Junwei Zhu, Runze Hou, Xiaozhong Ji, Chuming Lin, Xiaobin Hu, Zhucun Xue, Yong Liu

Abstract

Multimodal Language Learning Models (MLLMs) have shown remarkable performance in image understanding, generation, and editing, with recent advancements achieving pixel-level grounding with reasoning. However, these models for common objects struggle with fine-grained face understanding. In this work, we introduce the FacePlayGround-240K dataset, the first pioneering largescale, pixel-grounded face caption and question-answer (QA) dataset that includes 240K images, 47 mask categories, 5.4M mask annotations, and 7.3M grounded regions, meticulously curated for alignment pretraining and instruction-tuning. We present the GroundingFace frame-→ Project lead. † Corresponding author. work, specifically designed to enhance fine-grained face understanding. This framework significantly augments the capabilities of existing grounding models in face part segmentation, face attribute comprehension, while preserving general scene understanding. Comprehensive experiments validate that our approach surpasses current state-of-the-art models in pixel-grounded face captioning/QA and various downstream tasks, including face captioning, referring segmentation, and zero-shot face attribute recognition.