ICSE2024

Prism: Decomposing Program Semantics for Code Clone Detection through Compilation

Haoran Li, Siqian Wang, Weihong Quan, Xiaoli Gong, Huayou Su, Jin Zhang

5 citations

Abstract

Code clone detection (CCD) is of critical importance in software engineering, while semantic similarity is a key evaluation factor for CCD. The embedding technique, which represents an object using a numerical vector, is utilized to generate code representations, where code snippets with similar semantics (clone pairs) should have similar vectors. However, due to the diversity and flexibility of high-level program languages, the code representation of clone pairs may be inconsistent. Assembly code provides the program execution trace and can normalize the diversity of high-level languages in terms of the program behavior semantics. After revisiting the assembly language, we find that different assembly codes can align with the computational logic and memory access patterns of cloned pairs. Therefore, the use of multiple assembly languages can capture the behavior semantics to enhance the understanding of programs. Thus, we propose Prism, a new method for code clone detection fusing behavior semantics from multiple architecture assembly code, which directly captures multilingual domains' syntax and semantic information. Additionally, we introduce a multi-feature fusion strategy that leverages global information interaction to expand the representation space. This fusion process allows us to capture the complementary information from each feature and leverage the relationships between them to create a more expressive representation of the code. After testing the OJClone dataset, the Prism model exhibited exceptional performance with precision and recall scores of 0.999 and 0.999, respectively.