ICLR2025

DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?

Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, Dong Yu

摘要

In the era of data-driven decision-making, the complexity of data analysis necessitates advanced expertise and tools of data science, presenting significant challenges even for specialists. Large Language Models (LLMs) have emerged as promising aids as data science agents, assisting humans in data analysis and processing. Yet their practical efficacy remains constrained by the varied demands of realworld applications and complicated analytical process. In this paper, we introduce DSEvala novel evaluation paradigm, as well as a series of innovative benchmarks tailored for assessing the performance of these agents throughout the entire data science lifecycle. Incorporating a novel bootstrapped annotation method, we streamline dataset preparation, improve the evaluation coverage, and expand benchmarking comprehensiveness. Our findings uncover prevalent obstacles and provide critical insights to inform future advancements in the field. * * Correspondence to Kan Ren. * Source code and data are available at https://github. com/MetaCopilot/dseval . Runtime Session In-memory Data country landArea pop2010 pop2023 pop2050 Calculate the population density for each country in 2023 and 2050. Result should be a new frame with "Country" as the index and "2023 Density" and "2050 Density" as the columns.