EMNLP2025

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi

1 citation

Abstract

Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora -counting string appearances and retrieving the enclosing documents -yet the high storage overhead hinders their application on Internet-scale data. We present INFINI-GRAM MINI, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FMindex data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. INFINI-GRAM MINI greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18×) and memory use during both indexing (3.2× reduction) and querying (down to a negligible amount). We index 83TB of Internet text in 99 days with a single CPU node with 128 vCPUs (or 19 hours if using 137 such nodes). We show one important use case of INFINI-GRAM MINI in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 74.2% in GSM8K), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on INFINI-GRAM MINI indexes. Project Home infini-gram-mini.io Web Interface infini-gram-mini.io/demo API Endpoint api.infini-gram-mini.io Documentation infini-gram-mini.io/docs Source Code infini-gram-mini.io/code Contam Bulletin infini-gram-mini.io/bulletin