ICLR2026

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham, Nguyen Phan Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, Tu Vu

被引用 45 次

摘要

We introduce SEALQA, a challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SEALQA comes in three flavors: (1) SEAL-0 (main) and (2) SEAL-HARD, both of which assess factual accuracy and reasoning capabilities, where SEAL-0 targets the most challenging questions that frontier non-reasoning models (e.g., .1) answer with near-zero accuracy; and (3) LONGSEAL, which extends SEALQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models. Even frontier reasoning models face significant challenges across SEALQA flavors. On SEAL-0, GPT-5 with tools achieves only 43.2% accuracy at its best reasoning effort. We also find that even advanced reasoning models (e.g., DEEPSEEK-R1) can be vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across GPT-5 and the O-series of models, with performance often plateauing or even declining early. Finally, while current models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LONGSEAL when faced with numerous distractors. To facilitate future work, we release SEALQA at https://huggingface.co/datasets/vtllms/sealqa . 1 Each question required over an hour on average -roughly 45 minutes to draft, plus additional time for review and revision. Many initial ideas were discarded as they failed to meaningfully challenge frontier LLMS. 2 For example, the widely used GPQA-DIAMOND (Rein et al., 2024) , a compact set of 198 expert-vetted questions, demonstrates how a small, carefully curated dataset can effectively assess a model's reasoning ability. 3 Our questions often lead multiple models to fail across repeated attempts.