NDSS2025

Unleashing the Power of Generative Model in Recovering Variable Names from Stripped Binary

Xiangzhe Xu, Zhuo Zhang, Zian Su, Ziyang Huang, Shiwei Feng, Yapeng Ye, Nan Jiang, Danning Xie, Siyuan Cheng, Lin Tan, Xiangyu Zhang

出版方

摘要

—Decompilation aims to recover the source code form of a binary executable. It has many security applications, such as malware analysis, vulnerability detection, and code hardening. A prominent challenge in decompilation is to recover variable names. We propose a novel technique that leverages the strengths of generative models while mitigating model biases. We build a prototype, G EN N M , from pre-trained generative models CodeGemma-2B, CodeLlama-7B, and CodeLlama-34B. We fine-tune G EN N M on decompiled functions and teach models to leverage contextual information. G EN N M includes names from callers and callees while querying a function, providing rich contextual information within the model’s input token limitation. We mitigate model biases by aligning the output distribution of models with symbol preferences of developers. Our results show that G EN N M improves the state-of-the-art name recovery precision by 5.6–11.4 percentage points on two commonly used datasets and improves the state-of-the-art by 32% (from 17.3% to 22.8%) in the most challenging setup where ground-truth variable names are not seen in the training dataset.