KDD2025

Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation

Yihang Zheng, Bo Li, Zhenghao Lin, Yi Luo, Xuanhe Zhou, Chen Lin, Guoliang Li, Jinsong Su

摘要

The development of Large Language Models (LLMs) has revolutionized QA across various industries, including the database domain. However, there lacks a thorough evaluation regarding the capabilities of different LLMs in database QA. To this end, we introduce DQABench, the first comprehensive database QA benchmark for LLMs. DQABench features an innovative LLM-based method to automate the generation, cleaning, and rewriting of evaluation dataset, resulting in over 200,000 QA pairs in English and Chinese. These QA pairs cover a wide range of database-specific knowledge extracted from manuals, online communities, and DB instances, allowing for assessment of LLMs' Retrieval-Augmented Generation (RAG) and Tool Invocation Generation (TIG) capabilities in the database QA task. Furthermore, we propose a highly modular and scalable testbed DQATestbed, with basic and advanced components such as Fine-tuning, Question Classification Routing (QCR), RAG, TIG, and Prompt Template Engineering (PTE). Finally, we provide an evaluation pipeline that computes various metrics throughout a standardized evaluation process to ensure the accuracy and fairness. Our evaluation reveals the strengths and limitations of nine open-source and commercial LLMs, and the impact of various service components (e.g., fine-tuning, QCR, RAG, TIG). The proposed benchmark dataset is available at https://github.com/XMUDM/DQABench.