ACL2025

MedCite: Can Language Models Generate Verifiable Text for Medicine?

Xiao Wang, Mengjue Tan, Qiao Jin, Guangzhi Xiong, Yu Hu, Aidong Zhang, Zhiyong Lu, Minjia Zhang

被引用 9 次

摘要

Existing LLM-based medical questionanswering systems lack citation generation and evaluation capabilities, raising concerns about their adoption in practice. In this work, we introduce MedCite, the first end-to-end framework that facilitates the design and evaluation of citation generation with LLMs for medical tasks. Meanwhile, we introduce a novel multi-pass retrieval-citation method that generates high-quality citations. Our evaluation highlights the challenges and opportunities of citation generation for medical tasks, while identifying important design choices that have a significant impact on the final citation quality. Our proposed method achieves superior citation precision and recall improvements compared to strong baseline methods, and we show that evaluation results correlate well with annotation results from professional experts. Model Source Domain Cohen's Kappa Score Rec. Judge Prec. Judge SciFive-MedNLI Open Medical 0.2593 0.1945 JSL-MedPhi2-2.7B Open Medical 0.1845 0.2218 UltraMedical Open Medical 0.4518 0.2162 Llama-3.1-8B-Instruct Open General 0.5862 0.5422 mistral-7B-Instruct Open General 0.6211 0.4241 GPT-3.5-Turbo Close General 0.3834 0.4075 GPT-4o Close General 0.4146 0.4075 GPT-4o-mini Close General 0.3834 0.3894