ACL2024

RECOST: External Knowledge Guided Data-efficient Instruction Tuning

Qi Zhang, Yiming Zhang, Haobo Wang, Junbo Zhao

Abstract

In the current landscape of large language 001 models (LLMs), the process of instruction 002 tuning serves as an essential step. Consider-003 ing the high computing power overhead, data-004 efficient instruction tuning was proposed to 005 reduce the training data size in this process, 006 aiming at selecting high-quality instructional 007 data. Nevertheless, we argue that most current 008 data-efficient instruction-tuning methods are 009 highly dependent on the quality of the original 010 instruction-tuning dataset. When it comes to 011 datasets synthesized by LLMs, a common sce-012 nario in this field, dirty samples will even be se-013 lected with a higher probability than other sam-014 ples. To address these challenges, we utilized 015 external knowledge (relevant examples or para-016 graphs) to evaluate those samples synthesized 017 by LLMs with an in-context-based relative pre-018 dictive entropy. Based on the new metric, we 019 proposed a framework, dubbed as RECOST, 020 which integrates external-knowledge-base re-021 ranking and diversity-consistent sampling into 022 a single pipeline. Through extensive experi-023 ments on several synthetic datasets (Alpaca and 024 Alpaca-gpt4), we demonstrate the effectiveness 025 of our method and achieve even better results 026 with only 1% of the full dataset. 027 1 Introduction 028 Large Language Models (LLMs) (Brown et al., 029 2020) have demonstrated their remarkable capabili-030 ties in numerous fields of natural language process-031 ing (NLP) with the advancing of training datasets 032 and the scale of model parameters. Behind this 033 phenomenon, instruction tuning serves as an es-034 sential step to help pre-trained LLMs align to hu-035 man cognition (Ouyang et al., 2022; Peng et al., 036 2023; Chung et al., 2022). Instruction tuning refers 037 to fine-tuning the LLMs on instruction-response 038 pairs to endow LLMs with instruction-following 039 capability and activate the knowledge gained in the 040 pre-training period. 041 limitations on the development of the field of datations of the predictive entropy in vanilla LLMs as 086 outlined above, we utilize external information to 087 evaluate samples synthesized by LLMs. Despite 088 the suboptimal performance of this dataset in gen-089 erative tasks (Wang et al., 2023), its authenticity 090 is significantly assured. But in the data-efficient 091 instruction-tuning scenario of LLM, this cost is un-092 acceptable. Recognizing the importance of main-093 taining efficiency, we instead intuitively leverage 094 pre-trained LLMs' intrinsic in-context learning 095 (ICL) capabilities, treating these truthful samples 096 as demonstrations. Building on this foundation, we 097 introduce a concept: in-context-knowledge-based 098 relative predictive entropy, which serves as another 099 dimension of uncertainty for vanilla LLMs. 100 In this paper, we propose RECOST (REtrieval, 101 RE-rank, COreset sampling, and Supervised fine-102 Tuning), a framework that encompasses an in-103 context-knowledge-based re-ranking module and 104 a diversity-consistent sampling module to avoid 105 an overly homogeneous data distribution after re-106 ranking. With extensive experiments on synthetic 107 datasets including Alpaca and Alpaca-gpt4, RE-108 COST demonstrates its superiority over previous 109 methods and surpasses remarkably the full-trained 110 model with merely 1% and 10% training data on 111 three benchmarks including the Alpagasus test 112 7 Limitations 519 The primary limitation of our work lies in the neces-520 sity of incorporating additional external knowledge. 521 However, thanks to the development of traditional 522 NLP tasks during the pre-LLM era and the current 523 advancements in retrieval-based ICL, we can easily 524 obtain a vast amount of authentic and reliable exter-525 nal knowledge to meet our requirements. Overall, 526 our research empirically validates the feasibility of 527 integrating exogenous knowledge in the data filter-528 ing process based on synthetic data. Although this 529 introduces a minor overhead in data preprocess-530 ing, it significantly outperforms previous methods 531 within an acceptable cost margin, offering new per-532 spectives in the realm of data efficiency.