ICLR2026

UniF2^2ace: A Uni\underline{Uni}fied F\underline{F}ine-grained Face\underline{Face} Understanding and Generation Model

Junzhe Li, Sifan Zhou, Liya Guo, Xuerui Qiu, Linrui Xu, TingTing Long, Chun Fan, Ming Li, Hehe Fan, Jun Liu, Shuicheng YAN

Abstract

Unified multimodal models (UMMs) have emerged as a powerful paradigm in fundamental cross-modality research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily faces two challenges: (1) fragmentation development, with existing methods failing to unify understanding and generation into a single one, hindering the way to artificial general intelligence. (2) lack of fine-grained facial attributes, which are crucial for high-fidelity applications. To handle those issues, we propose UniF2^2ace, the first UMM specifically tailored for fine-grained face understanding and generation. First, we introduce a novel theoretical framework with a Dual Discrete Diffusion (D3Diff) loss, unifying masked generative models with discrete score matching diffusion and leading to a more precise approximation of the negative log-likelihood. Moreover, this D3Diff significantly enhances the model's ability to synthesize high-fidelity facial details aligned with text input. Second, we propose a multi-level grouped Mixture-of-Experts architecture, adaptively incorporating the semantic and identity facial embeddings to complement the attribute forgotten phenomenon in representation evolvement. Finally, to this end, we construct UniF2^2aceD-1M, a large-scale dataset comprising 130K fine-grained image-caption pairs and 1M visual question-answering pairs, spanning a much wider range of facial attributes than existing datasets. Extensive experiments demonstrate that UniF2^2ace outperforms existing models with a similar scale in both understanding and generation tasks, with 7.1% higher Desc-GPT and 6.6% higher VQA-score, respectively. Code is available in the supplementary materials.