ACL2025
Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora
Yungi Kim, Hyunsoo Ha, Sukyung Lee, Jihoo Kim, Seonghoon Yang, Chanjun Park
Abstract
With the increasing demand for substantial amounts of high-quality data to train large language models (LLMs), efficiently filtering large web corpora has become a critical challenge. For this purpose, KenLM, a lightweight n-grambased language model that operates on CPUs, is widely used. However, the traditional method of training KenLM utilizes only high-quality data and, consequently, does not explicitly learn the linguistic patterns of low-quality data. To address this issue, we propose an ensemble approach that leverages two contrasting KenLMs: (i) Good KenLM, trained on high-quality data; and (ii) Bad KenLM, trained on low-quality data. Experimental results demonstrate that our approach significantly reduces noisy content while preserving high-quality content compared to the traditional KenLM training method. This indicates that our method can be a practical solution with minimal computational overhead for resource-constrained environments. * Equal Contribution † Corresponding Author These methods typically require GPU resources, which makes them impractical, especially when processing data that exceeds trillions of tokens. To efficiently filter large datasets, the most widely used method is KenLM (Heafield, 2011), a lightweight n-gram-based model that operates on CPUs. In many studies (Wenzek et al., 2019; Computer, 2023; Nguyen et al., 2023; Laurençon et al., 2024) , KenLM, trained on the high-quality Wikipedia dataset, is commonly used. It measures perplexity (PPL) to identify low-quality content. Note that higher PPL scores indicate lower-quality or out-of-domain text, while lower PPL scores suggest that the text closely resembles the linguistic patterns of the high-quality data used to train KenLM. Low-quality data with high PPL scores are then filtered out. We argue that the traditional KenLM does not explicitly learn the linguistic patterns of low-quality data. Thus, while it assigns low PPL scores to data with high-quality linguistic patterns, it does not consistently assign high PPL scores to data with low-quality linguistic patterns. To address this issue, we propose an ensemble approach that utilizes the following two contrasting KenLMs: (i) Good KenLM, trained on high-quality data; and (ii) Bad KenLM, trained on noisy, low-quality data such as spam emails, hate speech, and informal social media text. Our empirical results show that this approach can be a practical solution with minimal computational overhead for resource-constrained environments, significantly reducing noisy content and preserving high-quality content compared to the traditional KenLM training method. Related Work As the demand for a vast amount of high-quality training corpus grows, it has become essential to effectively and efficiently filter large amounts of web corpus. Among various filtering methods,