ACL2024
ınftyBench: Extending Long Context Evaluation Beyond 100K Tokens
Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, Maosong Sun
Abstract
Processing and reasoning over long contexts is crucial for many practical applications of Large Language Models (LLMs), such as document comprehension and agent construction. Despite recent strides in making LLMs process contexts with more than 100K tokens, there is currently a lack of a standardized benchmark to evaluate this long-context capability. Existing public benchmarks typically focus on contexts around 10K tokens, limiting the assessment and comparison of LLMs in processing longer contexts. In this paper, we propose ∞BENCH, the first LLM benchmark featuring an average data length surpassing 100K tokens. ∞BENCH comprises synthetic and realistic tasks spanning diverse domains, presented in both English and Chinese. The tasks in ∞BENCH are designed to require well understanding of long dependencies in contexts, and make simply retrieving a limited number of passages from contexts not sufficient for these tasks. In our experiments, based on ∞BENCH, we evaluate the state-ofthe-art proprietary and open-source LLMs tailored for processing long contexts. The results indicate that existing long context LLMs still require significant advancements to effectively process 100K+ context. We further present three intriguing analyses regarding the behavior of LLMs processing long context. Introduction In recent years, large language models (LLMs) (Brown et al., 2020; OpenAI, 2023a; Touvron et al., 2023) have exhibited exceptional performance across a range of natural language processing (NLP) tasks (Qiu et al., 2020; Han et al., 2021). LLMs are showing a promising direction toward generalist task assistance, being capable of aiding users in practical tasks through conversational interactions. These tasks include web navigation (Nakano et al., 2021) , analysis of code repositories (Chen et al., 2021), and extraction of useful En.Sum En.QA En.MC En.Dia Zh.QA Code.Debug Code.Run Math.Calc Math.Find Retrieve.PassKey Retrieve.Number Retrieve.KV En.Sum 20 40 60 80 100 GPT-4 YaRN-Mistral Kimi-Chat Claude-2 Figure 1: The performance of GPT-4, Kimi-Chat, YaRN-Mistral, and Claude 2 on ∞BENCH. A higher value represents better performance. information from documents (Kočiskỳ et al., 2018), 042 indicating a step towards artificial general intelli-043 gence. For these LLM-based scenarios, the ability 044 to process long contexts is increasingly critical, in 045 addition to understanding fine-grained semantics 046 and possessing extensive knowledge (Dong et al., 047 2023; Huang et al., 2023). Textual documents, his-048 torical dialogues, complex instructions, and cum-049 bersome workflows, which constitute the data most 050 directly processed in daily tasks, must be input to 051 LLMs as long contexts for effective processing. 052 Despite this growing importance, LLMs consis-053 tently face challenges in processing long contexts, 054 primarily due to the substantial computational re-055 sources required for long sequence training (Dao 056 et al., 2022; Dao, 2023) as well as the apparent in-057 ability to generalize to sequences longer than those 058 encountered during training (Chen et al., 2023a; 059 Peng et al., 2023b). LLMs are typically trained on 060 sequences containing no more than 8K tokens (Tou-061 vron et al., 2023; Penedo et al., 2023; Biderman 062 et al., 2023), and thus cannot well handle con-063 texts exceeding 8K tokens. These limitations have 064 largely restricted most LLMs from being applied 065 to more complex tasks.