ASE2022

On the Naturalness of Bytecode Instructions

Yoon-Ho Choi, Jaechang Nam

被引用 2 次

摘要

Bytecode is used in software analysis and other approaches due to its advantages such as high availability and simple specification. Therefore, to leverage these advantages in training language models with bytecode, it is important to clearly recognize the characteristics of the naturalness of bytecode. However, the naturalness of bytecode has not been actively explored. In this paper, we experimentally show the naturalness of bytecode instructions and investigate their characteristics by empirically assessing 10 Java open-source projects. Consequently, we demonstrate that the bytecode instructions are more natural than source code representations and less natural than abstract syntax tree representations at a method-level. Furthermore, we found that there is no correlation between the naturalness of bytecode instructions and source code representations at a method-level. Our study supports that researchers need to deal with the characteristics of the naturalness of bytecode instructions in a different view from source code. We expect that these findings will be helpful for future work to study automated software engineering tasks such as automated debugging and vulnerability detection that use bytecode models. CCS CONCEPTS • Software and its engineering → Software notations and tools; • Computing methodologies → Natural language processing; • General and reference → Surveys and overviews.