CCS2025

ML-Cube: Accelerating Module-Lattice-Based Cryptography using Machine Learning Accelerators with a Memory-Less Design

Tian Zhou, Fangyu Zheng, Zhuoyu Xie, Wenxu Tang, Guang Fan, Yijing Ning, Yi Bian, Jingqiang Lin, Jiwu Jing

Abstract

The rapid advancement of AI technologies has led to a dramatic surge in computational demands, driving significant breakthroughs in ML accelerators. The powerful performance of these accelerators has attracted the attention of cryptography researchers, and recent studies have begun to explore their use in accelerating cryptographic operations. However, treating these accelerators as black boxes leads to high latency, and strict concurrency requirements, which hinder their practical deployment. In this paper, we go beyond the black-box treatment of ML accelerators and introduce ML-Cube (ML3), a novel memory-less framework that leverages ML accelerators to implement module-lattice-based PQC, FIPS 203 ML-KEM, and FIPS 204 ML-DSA. The performance benefits of ML-Cube arise from our thorough analysis of ML accelerator internals. Rather than treating the accelerators as black boxes, we dissect their operating mechanisms and design tailored mathematical transformations for cryptographic acceleration. This enables memory-less (I)NTT and polynomial multiplication that minimizes external memory dependencies and reduces latency. We further address the high latency and excessive parallelism demands of traditional SIMT-based implementations by fully parallelizing both ML-KEM and ML-DSA schemes. Our experiments show that our Tensor Core-based (I)NTT achieves a 2.03x--3.56x speedup over a highly-optimized CUDA-core implementation. Moreover, our memory-less polynomial multiplication attains a 10x speedup, and the full ML-KEM reaches up to a 3.58x speedup with only less than one-tenth of the latency compared with SOTA approach (CHES '24). Additionally, our enhanced ML-DSA implementation offers a 30% to 55% throughput improvement over the previous SOTA methods (TDSC '24) under the server-oriented model. Importantly, by confining core computations within registers, our approach inherently mitigates memory disclosure and cache-based side-channel attacks, thereby enhancing overall security.