ICLR2026

Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Huan Gao, Mingkun Xu, Shangyang Li

2 citations

Abstract

Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI. INTRODUCTION Recent advances in vision-language models (VLMs) have led to striking improvements across a wide range of medical AI tasks. On standard benchmarks such as MedMNIST v2 Yang et al. (2023) and MultiMedQA Singhal et al. (2023), state-of-the-art models achieve near-human or even superhuman performance in label prediction and image-text alignment. These results have created an impression that medical VLMs are nearing clinical readiness. Yet, as the clinical reasoning literature has recently underscored Schwartzstein (2024), safe and effective diagnostic practice, especially in high-stakes fields such as neurology, demands more than classification accuracy: it requires multimodal synthesis, ambiguity resolution, and the capacity to justify conclusions in a manner consistent with clinical logic. Current benchmarks, despite their scale, rarely capture these aspects. We argue that this discrepancy