AAAI2026
Measuring the Unmeasurable: Unveiling Latent Cognitive Capabilities of LLM
Cui Danxin, Sihang Jiang, Keyi Wang, Zhiyi Duan, Yanghua Xiao, Bi Yude, Jiaqing Liang, Minggui He, Shimin Tao, Yilun Liu
摘要
As large language models (LLMs) are increasingly deployed in high-stakes domains such as education, healthcare, and law, accurately evaluating their nuanced reasoning process becomes essential to ensure their safety, reliability, and trustworthiness. However, most existing benchmarks evaluate LLMs at a coarse granularity. Current benchmarks lack a unified framework and rely on single‐task datasets, overlooking the intermediate steps of complex reasoning. This results in redundant overlap across benchmarks, poor generalization to multifaceted real-world tasks, and underutilizes the rich reasoning traces generated by advanced LLMs.