EMNLP2025

KoBLEX: Open Legal Question Answering with Multi-hop Reasoning

Jihyung Lee, Daehui Kim, Seonjeong Hwang, Hyounghun Kim, Gary Lee

摘要

Large Language Models (LLM) have achieved remarkable performances in general domains and are now extending into the expert domain of law.Several benchmarks have been proposed to evaluate LLMs' legal capabilities.However, these benchmarks fail to evaluate open-ended and provisiongrounded Question Answering (QA).To address this, we introduce a Korean Benchmark for Legal EXplainable QA (KOBLEX), designed to evaluate provision-grounded, multihop legal reasoning.KOBLEX includes 226 scenario-based QA instances and their supporting provisions, created using a hybrid LLM-human expert pipeline.We also propose a method called Parametric provisionguided Selection Retrieval (PARSER), which uses LLM-generated parametric provisions to guide legally grounded and reliable answers.PARSER facilitates multi-hop reasoning on complex legal questions by generating parametric provisions and employing a three-stage sequential retrieval process.Furthermore, to better evaluate the legal fidelity of the generated answers, we propose Legal Fidelity Evaluation (LF-EVAL).LF-EVAL is an automatic metric that jointly considers the question, answer, and supporting provisions and shows a high correlation with human judgments.Experimental results show that PARSER consistently outperforms strong baselines, achieving the best results across multiple LLMs.Notably, compared to standard retrieval with GPT-4o, PARSER achieves 37.91 higher F-1 and 30.81 higher LF-EVAL.Further analyses reveal that PARSER efficiently delivers consistent performance across reasoning depths, with ablations confirming the effectiveness of PARSER. 1 * Equal Contribution. 1 The code and dataset are available at https://github. com/daehuikim/