ACL2025

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou

Abstract

image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-theart across multi-page document understanding benchmarks and reduces first token latency by more than 50%. Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are released at https://github.com/X-PLUG/ mPLUG-DocOwl/tree/main/DocOwl2 .