ACL2024

LooGLE: Can Long-Context Language Models Understand Long Contexts?

Jiaqi Li, Mengmeng Wang, Zilong Zheng, Muhan Zhang

被引用 32 次

摘要

Large language models (LLMs) are typically limited to processing texts within contextwindow size, which has spurred significant research efforts into enhancing LLMs' longcontext understanding as well as developing high-quality benchmarks to evaluate the ability. However, prior datasets suffer from shortcomings like short length compared to the context window of modern LLMs; outdated documents that might have data leakage problems; and an emphasis on short dependency tasks only. In this paper, we present ooGLE , a Long Context Generic Language Evaluation benchmark. It features documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning varying dependency ranges in diverse domains. Human annotators meticulously crafted over 1,100 high-quality question-answer (QA) pairs with thorough cross-validation for a most precise assessment of LLMs' long dependency capabilities. We conduct a comprehensive evaluation of representative LLMs on ooGLE . The results indicate that most LLMs have shockingly bad long context ability and fail to capture long dependencies in the context, even when their context window size is enough to fit the entire document. Our results shed light on enhancing the "true long-context understanding" ability of LLMs instead of merely enlarging their context window.