KDD2026
Scaling Recommender Transformers to One Billion Parameters
Kirill Khrylchenko, Artem Matveev, Sergei S. Makeev, Vladimir Baikalov
12 citations
Abstract
While large transformer models have been successfully used in many real-world applications such as natural language processing, computer vision, and speech processing, scaling transformers for recommender systems remains a challenging problem. Recently, the Generative Recommenders framework was proposed as a way to scale beyond typical Deep Learning Recommendation Models (DLRMs). By reformulating recommendation as a sequential transduction task, it improves scaling properties in terms of compute. Nevertheless, the largest encoder configuration reported by the HSTU authors is only 176 million parameters --- far smaller than the hundreds of billions (or even trillions) that are now common in language models. In this work, we present a recipe for training large transformer recommenders with up to one billion parameters. We show that autoregressive learning on user histories naturally decomposes into two subtasks, feedback prediction and next-item prediction, and demonstrate that this decomposition scales effectively across a wide range of transformer sizes. Furthermore, we report a successful deployment on a large-scale music platform serving millions of users. In online A/B tests, the proposed model increases total listening time by +2.26% and raises the likelihood of user likes by +6.37%, constituting (to our knowledge) the largest improvement in recommendation quality reported for any deep learning-based system in the platform's history.