EMNLP2025

OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain

Shuting Wang, Jiejun Tan, Zhicheng Dou, Ji-Rong Wen

被引用 6 次

摘要

Retrieval-augmented generation (RAG) has emerged as a key application of large language models (LLMs), especially in vertical domains where LLMs lack domain-specific knowledge. Nevertheless, current RAG benchmarks often suffer from narrow scenarios and limited evaluation dimensions, hindering an all-sides understanding of RAG models in real-world vertical applications. This paper introduces Om-niEval, an omnidirectional and automatic RAG benchmark for the financial domain, featured by its omnidirectional evaluation framework: First, we categorize RAG scenarios by five task classes and 16 financial topics, leading to a matrix-based structured assessment. Next, we leverage a multi-dimensional and auto-chained data generation pipeline that integrates LLMbased automatic generation and human annotation approaches, creating high-quality evaluation instances. Further, we adopt a multi-stage evaluation to assess both retrieval and generation performance, resulting in a holistic RAG evaluation. Finally, rule-based and LLM-based metrics are combined to build a multi-level evaluation system. Our experiments indicate that the performance of RAG systems varies across topics and tasks, highlighting the importance of multi-aspect and structured assessments to better locate the advantages and disadvantages of RAG systems. We release our code at https://github.com/RUC-NLPIR/OmniEval .