ICLR2025

Towards Semantic Equivalence of Tokenization in Multimodal LLM

Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan

摘要

https://sqwu.top/SeTok-web/ • Key design: vision tokenization, i.e., converting input visual signals into visual tokens  Existing MLLMs • Fixed patch squares, fragmenting objects across multiple patches and disrupting the integrity of visual semantic units • Codebook introduce information loss  Existing MLLMs  Existing MLLMs • Patch-level continuous/discrete token