ICLR2026
Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks
Xinxi Lyu, Michael Duan, Rulin Shao, Pang Wei Koh, Sewon Min
被引用 7 次
摘要
Retrieval-augmented Generation (RAG) has primarily been studied in limited settings, such as factoid question answering; more challenging, reasoning-intensive benchmarks have seen limited success from minimal RAG. In this work, we challenge this prevailing view on established, reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We identify a key missing component in prior work: a usable, web-scale datastore aligned with the breadth of pretraining data. To this end, we introduce COMPACTDS: a diverse, high-quality, web-scale datastore that achieves high retrieval accuracy and subsecond latency on a single-node. The key insights are (1) most web content can be filtered out without sacrificing coverage, and a compact, high-quality subset is sufficient; and (2) combining in-memory approximate nearest neighbor (ANN) retrieval and ondisk exact search balances speed and recall. Using COMPACTDS, we show that a minimal RAG pipeline achieves consistent accuracy improvements across all benchmarks and model sizes (8B-70B), with relative gains of 10% on MMLU, 33% on MMLU Pro, 14% on GPQA, and 19% on MATH. No single data source suffices alone, highlighting the importance of diversity of sources (web crawls, curated math, academic papers, educational text). Finally, we show that our carefully designed in-house datastore matches or outperforms web search engines such as Google Search, as well as recently proposed, complex agent-based RAG systems-all while maintaining simplicity, reproducibility, and self-containment. We release COMPACTDS and our retrieval pipeline, supporting future research exploring retrieval-based AI systems. * Equal contribution. Preprint. Under review. Common Crawl, e.g., five billion tokens [6, 34, 4 ]. These efforts were still evaluated on perplexity or Wikipedia-based benchmarks (except for [6, 8] on MMLU, which we compare against). We argue that prior datastores are either too narrow or small to be broadly effective, or not practically usable, e.g., MassiveDS [8] requires over 12TB of RAM to avoid multi-minute latency, making deployment infeasible in typical academic settings without distributed infrastructure. This work directly addresses these issues, proposing a datastore that is large and broad in coverage, yet compact enough to enable subsecond latency in a single-node deployment. Agentic RAG. Recently, agentic RAG, which iteratively issues search queries, retrieves information, and reasons over results to perform reasoning-intensive tasks, has emerged as an active area of research. These approaches can be broadly divided into two categories: (1) prompt-based methods that do not require training [15, 16] , and (2) training-based methods that fine-tune a reasoning LM to use search, typically via reinforcement learning [17, 18, 19, 20] . Much of this work uses web search engines, which are costly, hard to reproduce, and unstable, making them unsuitable for training, as also noted by [20] . Consequently, most training-based work uses an in-house Wikipedia datastore and only evaluate on Wikipedia-based benchmarks. Instead of optimizing for agentic RAG, our work focuses on minimal RAG, which is a fundamental building block of any retrieval-based AI systems that can be easily integrated. This agentic RAG literature, however, highlights an emerging need for high-quality, general-purpose in-house datastores, particularly to enhance reproducibility, improve stability, and ensure cost efficiency. Method Two key ideas enable a high-quality, high-coverage retrieval datastore: data sources that match the breadth of pretraining corpora while filtering out low-quality web text ( §3.1), and approximate nearest neighbor (ANN) search followed by exact search ( §3.2). We discuss each component, then describe how an LLM is augmented with this retrieval ( §3.3). COMPACTDS Data Sources To match the breadth of pretraining corpora while achieving high quality and diversity, we strategically construct COMPACTDS with the following data sources: Web Crawl. To ensure wide coverage, we start with Common Crawl, which is widely used for pre-training and also constitutes 70% of MASSIVEDS [8] . However, we hypothesize that much of it is low-quality and unnecessary for retrieval. Therefore, we construct a compact, high-quality subset-High-quality CC-using a series of filtering steps. We take the union of C4 [35], a small curated subset, and DCLM-Baseline [36] , which has undergone extensive manual and model-based filtering. We further filter DCLM-baseline using the FineWeb-Edu classifier [33] with a threshold of 4.0, which filters text based on its educational value. Overall, this process reduces the size of