WWW2023
PROD: Progressive Distillation for Dense Retrieval
Zhenghao Lin, Yeyun Gong, Xiao Liu, Hang Zhang, Chen Lin, Anlei Dong, Jian Jiao, Jingwen Lu, Daxin Jiang, Rangan Majumder, Nan Duan
33 citations
Abstract
Knowledge distillation is an effective way to transfer knowledge from a strong teacher to an efficient student model. Ideally, we expect the better the teacher is, the better the student performs. However, this expectation does not always come true. It is common that a strong teacher model results in a bad student via distillation due to the nonnegligible gap between teacher and student. To bridge the gap, we propose PROD, a PROgressive Distillation method, for dense retrieval. PROD consists of a teacher progressive distillation and a data progressive distillation to gradually improve the student. To alleviate catastrophic forgetting, we introduce a regularization term in each distillation process. We conduct extensive experiments on seven datasets including five widely-used publicly available benchmarks: MS MARCO Passage, TREC Passage 19, TREC Document 19, MS MARCO Document, and Natural Questions, as well as two industry datasets: Bing-Rel and Bing-Ads. PROD achieves the state-of-the-art in the distillation methods for dense retrieval. Our 6-layer student model even surpasses most of the existing 12-layer models on all five public benchmarks. The code and models are released in https://github.com/microsoft/SimXNS .