EMNLP2025
SEAL: Structure and Element Aware Learning Improves Long Structured Document Retrieval
Xinhao Huang, Zhibo Ren, Yipeng Yu, Ying Zhou, Zulong Chen, Zeyi Wen
Abstract
In long structured document retrieval, existing methods typically fine-tune pre-trained language models (PLMs) using contrastive learning on datasets lacking explicit structural information. This practice suffers from two critical issues: 1) current methods fail to leverage structural features and element-level semantics effectively, and 2) the lack of datasets containing structural metadata. To bridge these gaps, we propose SEAL, a novel contrastive learning framework. It leverages structure-aware learning to preserve semantic hierarchies and masked element alignment for fine-grained semantic discrimination. Furthermore, we release StructDocRetrieval, a long structured document retrieval dataset with rich structural annotations. Extensive experiments on both released and industrial datasets across various modern PLMs, along with online A/B testing, demonstrate consistent performance improvements, boosting NDCG@10 from 79.41% to 82.59% on BGE-M3. The resources are available at this URL. * Equal Contribution † Corresponding Author Query Structured Documents <h1>Python Introduction<h1> <p> Python is a high-level programming language. Getting started: <p> Could you give me an example of a Python program to get started? Python Introduction Python [CLS] is a high-level programming [CLS] language. Getting started: [CLS] is a high-level programming language. Getting started: