ACL2025

PM3-KIE: A Probabilistic Multi-Task Meta-Model for Document Key Information Extraction

Birgit Kirsch, Héctor Allende-Cid, Stefan Rüping

Abstract

Key Information Extraction (KIE) from visually rich documents is commonly approached as either fine-grained token classification or coarse-grained entity extraction. While tokenlevel models capture spatial and visual cues, entity-level models better represent logical dependencies and align with real-world use cases. We introduce PM3-KIE, a probabilistic multitask meta-model that incorporates both finegrained and coarse-grained models. It serves as a lightweight reasoning layer that jointly predicts entities and all appearances in a document. PM3-KIE incorporates domain-specific schema constraints to enforce logical consistency and integrates large language models for semantic validation, thereby reducing extraction errors. Experiments on two public datasets, DeepForm and FARA, show that PM3-KIE outperforms three state-of-the-art models and a stacked ensemble, achieving a statistically significant 2% improvement in F1 score. Introduction Key Information Extraction (KIE) focuses on identifying structured key-value pairs from visually rich documents (VRDs) (Huang et al., 2019) based on a predefined schema that specifies the target information types. This capability is essential for automating business document processing across industries such as finance and law. Automating information extraction significantly reduces operational costs: processing a single invoice, for example, can cost $13.11 and takes up to eight days (Girsch-Bock, Mary, 2024; Cohen and York, 2020) . Despite recent advances, extracting structured data remains challenging for state-of-the-art models, particularly for documents with complex schemas or semi-structured layouts (Wang et al., 2023b). KIE can generally be approached through two distinct paradigms, as illustrated in Figure 1: