EMNLP2025
CourtReasoner: Can LLM Agents Reason Like Judges?
Sophia Simeng Han, Yoshiki Takashima, Shannon Zejiang Shen, Chen Liu, Yixin Liu, Roque K. Thuo, Sonia Knowlton, Ruzica Piskac, Scott J. Shapiro, Arman Cohan
被引用 1 次
摘要
LLMs are increasingly applied in the legal domain in tasks such as summarizing legal texts and providing basic legal advice. Yet, their capacity to draft full judicial analyses in U.S. court opinions is still largely uncharted, such as generating entire judicial reasoning sections in U.S. court decisions, remain under-explored. Given the continued adoption of LLMs and the significance of law to society at large, measurement of LLM's legal reasoning capabilities is a pressing task. We propose COURTREASONER, a novel expert-annotated judicial reasoning benchmark to evaluate the capabilities of LLM agents in complex legal reasoning. Sourcing U.S. court opinions, we construct benchmarks that measure the LLMs' abilities to construct goal-oriented legal reasoning. COURTREA-SONER measured the agent's ability to argue both ways in a legal dispute, rather than simple question answering. Our results show that in the outputs of frontier models, more than 60% contain invalid arguments and more than 53% produced irrelevant citations when conducting complex legal reasoning. We also introduce a meta-evaluation benchmark to provide insights into the capabilities of LLMs as evaluators of legal reasoning. Our data, code, and full annotation guidelines are released for future research. * 0 points: No relevant cases cited, or all cited cases are completely irrelevant to the analysis. * 1 point: All cited cases have only remote or tangential relevance to the core legal analysis. * 2 points: Most cases cited have only distant relevance, with few directly applicable precedents. * 3 points: About half of the cases cited are only remotely relevant to the analysis, while the rest are relevant. * 4 points: All or nearly all cited cases are highly relevant and directly applicable to the legal analysis. ## B. Constraints Extraction (Score: 0-4) In order to use a conclusion in the cited case, the analysis must first identify which constraints are needed to reach the conclusion in the case cited. This conclusion is useful for arguing the case this legal analysis is trying to argue. Evaluate how well the analysis identifies the necessary conditions (constraints) that must be satisfied in the case cited to reach the conclusion in the case cited that is useful for arguing the case this legal analysis is trying to argue. * 0 points: No legal constraints identified or the extraction is fundamentally incorrect. * 1 point: Some constraints extracted but fewer than 3, or contains significant errors in interpretation. * 2 points: At least 3 constraints extracted, but some are incorrectly formulated or incompletely articulated. * 3 points: All necessary constraints (typically at least 3 plus any other applicable ones) are extracted, with only minor interpretive errors. * 4 points: All constraints are fully and correctly extracted with precise legal terminology and interpretation. ## C. Argument Validity per Constraint (Score: 0-4) Evaluate how well the legal arguments support each identified constraint, factual accuracy is important here. The legal analysis must not exaggerate or change key phrases in the background information or facts. This aspect should be evaluate independent of citation relevance and constraint extraction. * 0 points: No substantive arguments provided for any of the identified constraints. * 1 point: Arguments provided for some constraints, but they are predominantly invalid, weak, or misapply legal principles. * 2 points: Arguments provided for most constraints, but several are invalid or significant constraints lack supporting arguments. * 3 points: Arguments provided for all identified constraints; most are valid but contain minor logical inconsistencies or gaps. * 4 points: Strong, valid arguments provided for each identified constraint, with sound legal reasoning throughout.