ICLR2026
LeSTD: LLM Compression via Learning-based Sparse Tensor Decomposition
Yi Li, Zhichun Guo, Miao Yin, Bingzhe Li
Abstract
Large language models (LLMs) deliver the impressive capability, while their parameter scales hinder the deployment ability. Post-training matrix/tensor decomposition offers a promising strategy to alleviate this by exploiting structural redundancies within model weights. However, it faces the critical dense core bottleneck. This bottleneck caps achievable compression level as dense core tensor becomes a new storage burden. To solve this, we introduce LeSTD (Learningbased Sparse Tensor Decomposition), a two-stage, data-free compression framework. LeSTD first learns a high-quality shared basis for model weights, then applies a theoretically-ground pruning mechanism: guided by a derived closedform importance score, to create an ultra-sparse core tensor. Therefore, resulting a superior compression-accuracy trade-off: LeSTD achieves substantially higher compression ratios than dense-core methods without sacrificing performance. Experiments on the LLMs up to 30B parameters confirm that LeSTD consistently attains lower perplexity and higher task accuracy at matched compression levels, and critically, maintains strong performance under aggressive compression where prior methods degrade. Operationally, LeSTD executes the inference directly in its compressed domain, delivering significant throughput gains on standard hardware without requiring any custom kernels. along each mode, while the core tensor encodes the remaining cross-mode interactions. While the factor matrices can be relative small, the core tensor remains fully dense (Ahmadi-Asl et al., 2021; Wang & Yang, 2022) . To preserve model accuracy, the Tucker ranks must be sufficiently large, but the size of the core tensor grows polynomial with these ranks. This dense core rapidly becomes the new storage bottleneck, imposing a hard limit on the achievable compression ratio. This limitation exposes a fundamental gap: current tensor-based methods reduce dimensionality but fail to eliminate redundancy within the compressed latent space itself. To achieve the truly high-ratio compression, we must moving beyond low-rank approximation and into the domain of sparse tensor representation (Park et al., 2021) . To fill this critical gap, we propose LeSTD (Learning-based Sparse Tensor Decomposition), a framework that synergistically combines iterative basis optimization with the learned core tensor sparsity. LeSTD operates in two stages: first, it optimizes a high-quality shared basis for all attention heads; second, it learns an ultra-sparse representation for the core tensor within that basis. This integrated approach yields a representation that is both more accurate and vastly more compact. Main contributions of LeSTD are as follows: 1. The proposal of LeSTD, a data-free post-training compression framework where Stage I learns a high-quality shared basis via iterative optimization, and Stage II introduces a principled pruning strategy to create an ultra-sparse core tensor. 2. We provide a theoretical justification for the magnitude-based pruning in the Tucker-decomposed latent space. We derive a closed-form importance score for each core element, directly linking its magnitude to its impact on the Frobenius reconstruction error. This allows for a principled, rather than purely heuristic, sparsification of the core 3. We demonstrate how inference can operate directly on the compressed representation, avoiding the full weight reconstruction. It reduces arithmetic complexity and delivers practical throughput gains (tokens/sec) measured directly within the standard Transformers library (Wolf et al., 2020), requiring no specialized hardware or custom kernels. 4. Across GPT-J (6B), Llama2 (13B), and OPT (30B) on WikiText-2, MathQA, GSM8K, and Truth-fulQA, LeSTD consistently outperforms baselines at matched size fractions: maintaining higher accuracy under strong compression and delivering competitive-to-superior throughput. BACKGROUND AND PRELIMINARY Input Embedding Positional Encoding