ICLR2026
HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes
Keliang Li, Hongze Shen, Hao Shi, RuiBing Hou, Hong Chang, Jie Huang, Chenghao Jia, Wen Wang, Yiling Wu, Dongmei Jiang, Shiguang Shan, Xilin Chen
1 citation
Abstract
The aspiration for artifical general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs' capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C feature over 6,000 human-verified multiplechoice questions, assessing massive tasks of 9 dimensions, including but not limited to essential skills frequently overlooked by existing benchmarks. Human-R offers a challenging manually curated video reasoning test that requires integrating multiple visual evidences, proactively extracting context beyond question cues, and applying human-like expertise. Each question includes human-annotated Chain-of-Thought (CoT) rationales with key visual evidence to support further research. Extensive evaluations on over 30 state-of-the-art models exhibit significant challenges in human-centric visual understanding, particularly in tasks involving detailed space perception, temporal understanding, and mind modeling. Moreover, analysis of Human-R reveals the struggle of models in extracting essential proactive visual evidence from diverse human scenes and their faulty reliance on query-guided retrieval. Even with advanced techniques like scaling visual contexts and test-time thinking yield only limited benefits. We hope HumanPCR and our findings will advance the development, evaluation, and human-centric application of multimodal models. * Equal contribution. Author order was determined randomly. † Corresponding author. H um an -P Spati ality Spatial Relation Objec t Existe nce Hum an Pres enc e Po st ur e Bo dy Po stu re Ha nd St at e Ha nd -O bj ec t In te ra ct io n Bo dy O rie nt at io n Ga ze Es tim at io n Ap pea ran ce Clo th ing At tri bu te Acc ess ory Rec ogn itio n Bod ypar t Visib ility Physica l Attribu te Contact Human-Object Contact Human-Hum an Contact Huma n Self-C ontac t Ide nti ty Fac e Rec ogn itio n Ide nti ty Clu ste rin g Human-C B eh av io r Ge st ur e Em ot io n Ba si c Ac tio n Kn ow le dg e-Ba se d Ac tio n Proc edu re Se qu en tia l Ac tio n Goa l Pla nni ng Proc edur e Depe nden ce Multiple Human Sequenc ial Action Irrelevant Action Rel atio n Human Comparis on Socia l Rela tion Gro up Act ivit y Str ug gle De tec tio n S ce n e Cr ow d Ev en t Cu ltu ra l Ev