ICLR2026

SysMoBench: Evaluating AI on Formally Specifying Complex Real-World Systems

Qian Cheng, Ruize Tang, Emilie Ma, Finn Hackett, Peiyang He, Yiming Su, Ivan Beschastnikh, Yu Huang, Xiaoxing Ma, Tianyin Xu

被引用 1 次

DOI arXiv 出版方

摘要

Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. Recent advances in generative AI show promise in generating certain forms of specifications. However, existing work mostly targets small code, not complete systems. It is unclear whether AI can deal with realistic system artifacts, as this requires abstracting their complex behavioral properties into formal models. We present SYSMOBENCH, a benchmark that evaluates AI's ability to formally model large, complex systems. We focus on concurrent and distributed systems, which are keystones of today's critical computing infrastructures, encompassing operating systems and cloud infrastructure. We use TLA + , the de facto specification language for concurrent and distributed systems, though the benchmark can be extended to other specification languages. We address the primary challenge of evaluating AI-generated models by automating metrics like syntactic and runtime correctness, conformance to system code, and invariant correctness. SYSMOBENCH currently includes eleven diverse system artifacts: the Raft implementation of Etcd and Redis, the leader election of ZooKeeper, the Spinlock, Mutex, and Ringbuffer in Asterinas OS, etc., with more being added. SYSMOBENCH enables us to understand the capabilities and limitations of today's LLMs and agents, putting tools in this area on a firm footing and opening up promising new research directions. INTRODUCTION Formal models are essential to specifying computer systems and reasoning about their correctness. They provide a mathematical foundation to document and verify the design of complex systems, such as distributed protocols and concurrent algorithms (Lamport, 2002; Tasiran et al., 2003; Newcombe et al., 2015; Hackett et al., 2023b) . Recently, formal models are used to describe system implementations-system code that runs on user devices and in production environments. Such models, which we refer to as system models, enable verification of system code via comprehensive testing and model checking (Bornholt et al., 2021; Tang et al., 2024; Ouyang et al., 2025; Tang et al., 2025) . For example, system models of Apache ZooKeeper (a distributed coordination system) were used to detect deep bugs that violate system safety and verify their fixes (Ouyang et al., 2025 ). However, system models are notoriously expensive to write and maintain. Different from protocols and algorithms, system code contains low-level details, is more complex, and constantly evolves. Hence, synthesis of system models is an open challenge (e.g., TLAi+ Challenge (2025)). Recent advances in generative AI, represented by large language models (LLMs) and agentic techniques, show promise in generating function-level specifications, in the form of pre-and postconditions (