CVPR2025

DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding

Wenhui Liao, Jiapeng Wang, Hongliang Li, Chengyu Wang, Jun Huang, Lianwen Jin

Abstract

We adopt pre-training tasks as shown in Table A . These tasks facilitate the alignment of layout and visual features with the LLM's feature space while enhancing the LLM's understanding of document content.