ACL2025

CoIR: A Comprehensive Benchmark for Code Information Retrieval Models

Xiangyang Li, Kuicai Dong, Yi Quan Lee, Wei Xia, Hao Zhang, Xinyi Dai, Yasheng Wang, Ruiming Tang

摘要

Despite the substantial success of Information Retrieval (IR) in various NLP tasks, most IR systems predominantly handle queries and corpora in natural language, neglecting the domain of code retrieval. Code retrieval is critically important yet remains under-explored, with existing methods and benchmarks inadequately representing the diversity of code in various domains and tasks. Moreover, many models have begun to overfit existing leaderboards, limiting their generalizability and real-world applicability. Addressing this gap, we introduce COIR (Code Information Retrieval Benchmark), a robust and comprehensive benchmark specifically designed to evaluate code retrieval capabilities. COIR consists of ten meticulously curated code datasets, all of which have undergone thorough manual inspection and processing. These datasets cover eight distinct retrieval tasks across seven diverse domains, ensuring a broad and rigorous assessment of code retrieval performance. We first discuss the construction of COIR and its diverse dataset composition. Further, we evaluate ten widely used retrieval models using COIR, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems. To ensure seamless integration, COIR is released as a user-friendly Python framework, aligned with the data schema of MTEB and BEIR for consistent cross-benchmark evaluation. Through COIR, we aim to invigorate research in the code retrieval domain, providing a versatile benchmarking tool that encourages further development and exploration of code retrieval systems 1 .