WWW2026

SEER: Set Encoding for Efficient Representation in Large-Scale E-commerce

Yining Lin, Yuming Shen, Yipeng Zhang, Canran Xu

摘要

Large Language Models (LLMs) perform well across many tasks but degrade when processing large collections of repetitive or highly similar inputs, a common scenario in applications such as near-duplicate search results and large e-commerce catalogs. In these settings, concatenation-based approaches—long-context prompting and supervised fine-tuning—suffer from attention saturation and diminished signal-to-noise ratio, causing models to miss subtle but important distinctions as input size grows. We introduce SEER (Set Encoding for Efficient Representation), a framework that enables LLMs to handle massive sets of near-duplicate items through a single learned token. SEER first encodes individual items with a pretrained embedding model, then aggregates them using an adapter that captures inter-item relationships and preserves fine-grained differences while mitigating redundancy. To ensure both discriminative and generative capabilities, we propose a multi-task alignment strategy that supervises set-level descriptions across multiple semantic dimensions. Experiments on a large-scale e-commerce dataset demonstrate that SEER substantially outperforms in-context and fine-tuned LLM baselines, maintaining stable performance even when processing thousands of highly similar items. These results establish SEER as an effective and scalable approach for LLM processing of dense, redundant input sets.