AAAI2026
Intermediate N-Gramming: Deterministic and Fast N-Grams for Large N and Large Datasets
Ryan R. Curtin, Fred Lu, Edward Raff, Priyanka Ranade
Abstract
The number of n-gram features grows exponentially in n, making it computationally demanding to compute the most frequent n-grams even for n as small as 3. Motivated by our production machine learning system built on n-gram features, we ask: is it possible to accurately, deterministically, and quickly recover the top-k most frequent n-grams? We devise a multi-pass algorithm called Intergrams that constructs candidate n-grams from the preceding (n -1)-grams. By designing this algorithm with hardware in mind, our approach yields more than an order of magnitude speedup (up to 33×!) over the next known fastest algorithm, even when similar optimization are applied to the other algorithm. Using the empirical power-law distribution over n-grams, we also provide theory to inform the efficacy of our multi-pass approach. Our code is available at https://github.com/rcurtin/Intergrams .