ICLR2026
InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, Hongjie Zhang
10 citations
Abstract
Published as a conference paper at ICLR 2026 tokens for tags, attributes, and coordinates to reduce sequence length while retaining geometric and hierarchical structure. These tokens are initialized with a subword-based strategy that anchors them in the pretrained embedding space, stabilizing early training and accelerating convergence. Training adopts a two-stage strategy that progresses from short static SVGs to longer illustrations and complex animations. Through extensive experiments, we demonstrate that unified modeling can effectively improve performance across understanding, editing, and generation tasks. Comprehensive evaluations further show that our InternSVG surpasses both open-source and proprietary models on SArena and previous benchmarks. For example, on the SArena-Icon benchmark, InternSVG surpasses Claude-Sonnet-4, the strongest proprietary baseline on SVG tasks, by about 11% higher acc in understanding tasks, 34% higher PSNR in editing tasks, 56% lower FID in Text-to-SVG tasks, and 22% higher SSIM in Image-to-SVG tasks. In summary, our contributions are below: (1) We construct SAgoge, the largest and most comprehensive multimodal SVG dataset to date, encompassing static graphics and animations with over 16 million training samples. To enable rigorous and comparable evaluation, we further establish SArena, a companion benchmark that standardizes tasks and metrics across SVG understanding, editing, and generation. (2) We propose InternSVG, a unified MLLM for SVG understanding, editing, and generation. It introduces SVG-specific tokenization with subword-initialized special tokens and adopts a two-stage training strategy to support effective cross-task generalization. (3) We conduct extensive experiments to demonstrate the benefits of unified modeling. The results on SArena and prior benchmarks show that our InternSVG outperforms traditional approaches as well as general-purpose open-source and proprietary models. RELATED WORKS 2.1 SVG DATASETS AND BENCHMARKS Most existing SVG datasets and benchmarks are limited in task coverage or data type and remain too small for effective model training, leading to fragmented evaluations and limited insights into generalization across tasks and complexity. SGP-Bench (Qiu et al., 2024) evaluates semantic comprehension and consistency in symbolic graphics programs. SVGEditBench (Nishina & Matsui, 2024) and its extension V2 (Nishina & Matsui, 2025) focus narrowly on instruction-based SVG editing measured by low-level syntactic metrics. On the generative side, SVG-Stack (Rodriguez et al., 2025), SVGX (Xing et al., 2025), MMSVG (Yang et al., 2025b), and ColorSVG-100K (Chen & Pan, 2025) address Text-to-SVG and Image-to-SVG generation, while VGBench Zou et al. (2024) and UniSVG (Li et al., 2025) jointly evaluate understanding and generation. DeepSVG (Carlier et al., 2020) introduces a dataset of 100K SVG icons and explores generation, interpolation, and latentspace animation of static and limited animated graphics, but lacks rich editing instructions and image-conditioned generation. SVGenius (Chen et al., 2025) introduces a comprehensive benchmark covering understanding, editing, and generation with systematic complexity levels and multidimensional metrics, but it includes only about 2,400 queries, making it sufficient for evaluation yet inadequate for training. In contrast, our SAgoge is substantially larger and more diverse, encompassing both static graphics and SVG animations. It unifies SVG understanding, editing, and generation, and with approximately 16M task samples, its scale and diversity enable robust model training and comprehensive evaluation across the full spectrum of SVG tasks, which effectively address the coverage and scalability limitations of prior datasets. SVG MODELING METHODS Early research on SVG modeling treated vector graphics as sequences of geometric primitives and relied on specialized generative architectures trained on limited domains (