ICLR2025
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan
摘要
https://sqwu.top/SeTok-web/ • Key design: vision tokenization, i.e., converting input visual signals into visual tokens Existing MLLMs • Fixed patch squares, fragmenting objects across multiple patches and disrupting the integrity of visual semantic units • Codebook introduce information loss Existing MLLMs Existing MLLMs • Patch-level continuous/discrete token