EMNLP2024

PhiloGPT: A Philology-Oriented Large Language Model for Ancient Chinese Manuscripts with Dunhuang as Case Study

Yuqing Zhang, Baoyi He, Yihan Chen, Hangqi Li, Han Yue, Shengyu Zhang, Huaiyong Dou, Junchi Yan, Zemin Liu, Yongquan Zhang, Fei Wu

被引用 1 次

DOI 出版方

摘要

Philology, the study of ancient manuscripts, demands years of professional training in extensive knowledge memorization and manual textual retrieval. Despite these requirements align closely with strengths of recent successful Large Language Models (LLMs), the scarcity of high-quality, specialized training data has hindered direct applications. To bridge this gap, we curated the PhiloCorpus-ZH, a rich collection of ancient Chinese texts spanning a millennium with 30 diverse topics, including firsthand folk copies. This corpus facilitated the development of PhiloGPT, the first LLM tailored for discovering ancient Chinese manuscripts. To effectively tackle complex philological tasks like restoration, attribution, and linguistic analysis, we introduced the PhiloCoP framework. Modeled on the analytical patterns of philologists, PhiloCoP enhances LLM's handling of historical linguistic peculiarities such as phonetic loans, polysemy, and syntactic inversions. We further integrated these tasks into the PhiloBenchmark, establishing a new standard for evaluating ancient Chinese LLMs addressing philology tasks. Deploying PhiloGPT in practical scenarios has enabled Dunhuang specialists to resolve philology tasks, such as identifying duplication of copied text and assisting archaeologists with text completion, demonstrating its potential in real-world applications.