ACL2025
Re³Syn: A Dependency-Based Data Synthesis Framework for Long-Context Post-training
Zhiyang Zhang, Ziqiang Liu, Huiming Wang, Renke Shan, Li Kuang, Lu Wang, De Wen Soh
被引用 4 次
摘要
An important trend in the realm of large language models (LLMs) is the development of longer context windows. However, training LLMs with long context windows to acquire the capability of effectively modeling lengthy inputs is often hindered by the scarcity of naturally long-context data. Existing methods for constructing long-context data by concatenating short documents have overlooked a crucial characteristic of long-context data quality, namely semantic dependency. In this paper, we propose a novel framework called Re trieval, Dependency Re cognition, and Re order for data syn thesis ( R E 3 S YN 1 ), which leverages semantic similarity to retrieve relevant documents and form several batches. Within each batch, the framework comprehensively recognizes dependency and utilizes them, along with a reorder algorithm, to organize the short documents into coherent long-context data. Comprehensive experiments on multiple benchmarks indicate that the data generated by the R E 3 S YN has longer dependencies and significantly enhances the model’s long-context capabilities.