USENIX Security2025

zkGPT: An Efficient Non-interactive Zero-knowledge Proof Framework for LLM Inference

Wenjie Qu, Yijun Sun, Xuanming Liu, Tao Lu, Yanpei Guo, Kai Chen, Jiaheng Zhang

Abstract

Large Language Models (LLMs) are widely employed for their ability to generate human-like text. However, service providers may deploy smaller models to reduce costs, potentially deceiving users. Zero-Knowledge Proofs (ZKPs) offer a solution by allowing providers to prove LLM inference without compromising the privacy of model parameters. Existing solutions either do not support LLM architectures or suffer from significant inefficiency and tremendous overhead. To address this issue, this paper introduces several new techniques. We propose new methods to efficiently prove linear and nonlinear layers in LLMs, reducing computation overhead by orders of magnitude. To further enhance efficiency, we propose constraint fusion to reduce the overhead of proving non-linear layers and circuit squeeze to improve parallelism. We implement our efficient protocol, specifically tailored for popular LLM architectures like GPT-2, and deploy optimizations to enhance performance. Experiments show that our scheme can prove GPT-2 inference in less than 25 seconds. Compared with state-of-the-art systems such as Hao et al. (USENIX Security'24) and ZKML (Eurosys'24), our work achieves nearly 279× and 185× speedup, respectively. * Equal contribution. † Corresponding author. process is black-boxed to users. The service providers will not reveal the model for users to validate the service since the model parameters are trade secrets [10] . Hence, it is difficult for users to validate the service's integrity, e.g., the specified model computes the output. The zero-knowledge proof (ZKP) [29] is a promising solution to prove the integrity of machine learning (ML) services. In ZKP, a prover produces a proof π to convince the verifier that the result of a public function f on public input x and the secret input of the prover w is indeed y = f (x, w). The secret input w is usually referred to as the witness. ZKP guarantees that the verifier will output rejection with overwhelming probability if the prover cheats on calculation, while the proof π does not reveal any additional information about the secret w. Using f as the function that represents the calculation of the model, x as user input, and w as model weights within the ZKP protocol, the service provider can prove the correctness of LLM inference. In real-world applications, regulatory authorities can frequently and anonymously query the LLM service and require the service provider to prove the validity of the generated answers via ZKP. If the provider fails to provide the valid proof, they may face penalties. This approach effectively safeguards consumer rights while imposing small overhead on existing infrastructure. In recent years, extensive research has been conducted on ZKP for ML, covering a range of models from classic decision trees to modern neural networks (NN). However, most of these works either do not support LLM architectures [9, 18, 19, 35, 37, 40, 61, 68] , or support LLMs [10, 31, 41, 55] but face significant limitations, such as inefficiency and high communication overhead. For example, concurrent work [10] requires over an hour to generate a single proof for the inference of . Such lengthy proof generation times render these approaches impractical for real-world applications involving LLMs. The core challenge in developing practical ZKP schemes for LLMs lies in efficiently proving different layers involved in LLMs, such as matrix multiplication, attention [57], GeLU [32], and normalization [5]. This challenge arises for two main reasons: (1) the linear layers in LLMs are S x •S y S z . This approach allows