ICML2025
LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models
Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D. Doan, Chandan K. Reddy
Abstract
Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenariogrounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain experts define research projects of genuine interest and decompose them into modular research scenarios from which vetted questions are sampled. The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance, where models must propose testable hypotheses, design simulations or experiments, and interpret results. Applying this two-phase scientific discovery evaluation (SDE) framework to state-of-the-art LLMs reveals a consistent performance gap relative to general science benchmarks, diminishing return of scaling up model sizes and reasoning, and systematic weaknesses shared across top-tier models from different providers. Large performance variation in research scenarios leads to changing choices of the best performing model on scientific discovery projects evaluated, suggesting all current LLMs are distant to general scientific "superintelligence". Nevertheless, LLMs already demonstrate promise in a great variety of scientific discovery projects, including cases where constituent scenario scores are low, highlighting the role of guided exploration and serendipity in discovery. This SDE framework offers a reproducible benchmark for discovery-relevant evaluation of LLMs and charts practical paths to advance their development toward scientific discovery. This tight connection among questions, scenarios, and projects built in SDE reveals the true capability of LLMs in scientific discovery. Beyond per-question evaluation as in conventional science benchmarks, we also evaluate LLMs' performance at the level of open-ended scientific discovery projects. In this setting, LLMs are put into the loop of scientific discovery, where they are required to autonomously propose testable hypotheses, run simulations or experiments, and interpret the results to refine their original hypotheses, imitating an end-to-end scientific discovery process, where their discovery-orientated outcomes (e.g., polarisability of proposed transition metal complexes) are evaluated. This project-level evaluation reveals capability gaps and failure modes across the research pipeline. Applying this multi-level evaluation framework to state-of-the-art LLMs released over time yields a longitudinal, fine-grained benchmark that reveals where current models succeed, where they fail, and why. The resulting analysis suggests actionable avenues, spanning targeted training on problem formulation, diversifying data sources, baking in computational tool use in training, and designing reinforcement learning strategies in scientific reasoning, for steering LLM development toward scientific discovery. Results Question-level evaluations Performance gap in quiz-and discovery-type questions. To go beyond the conventional science Q&A benchmark where questions are sometimes assembled opportunistically, questions in SDE are collected in a completely different routine (Fig. 1c ). In each domain, a multi-member expert panel defined roughly ten common research scenarios where LLMs could plausibly help their ongoing projects. These scenarios span a broad spectrum, from those human experts are proficient (e.g., making decisions from specific experimental observations) to those effectively intractable to human experts without the assistance of tools (e.g., inferring oxidation and spin states solely from a transition metal complex structure). When feasible, questions were generated semi-automatically by sampling and templating from open datasets, 46 with NMR spectra to molecular structure mapping as an example. Otherwise, especially for experiment-related scenarios, questions were drafted manually by an expert. Every question underwent panel review, with inclusion contingent on consensus about the validity and correctness, resulting in 1,125 questions in the SDE benchmark (see Methods section, Research scenario and question collection). This design ties every question to a research scenario, ensuring that its correctness reflects progress on a practical scientific discovery project rather than decontextualized trivia, which also allows comparisons across LLMs at the same level of granularity. With the goal of understanding how the performance of popular coding, math, and expression benchmarks translates to scientific discovery, top-tier models from various providers (i.e., OpenAI, Anthropic, Grok, and DeepSeek) are evaluated through an adapted version of lm-evaluation-harness framework, which supports flexible evaluation through a claude-sonnet-4.5 deepseek-V3.1 deepseek-R1 gpt-4o gpt-5-chat gpt-5 grok-3 grok-4 b c d e