ICLR2026

TileLang: Bridge Programmability and Performance in Modern Neural Kernels

Lei Wang, Yu Cheng, Yining Shi, Zhiwen Mo, Zhengju Tang, Wenhao Xie, Tong Wu, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, Zhi Yang

Abstract

Achieving high performance in modern AI increasingly requires kernels codesigned with underlying hardware, but writing efficient kernels remains challenging due to hardware-level complexity and limited fine-grained control in compilers like Triton. In this paper, we introduce TILELANG, a programmable tile-level system that provides explicit primitives for memory placement, data movement, and parallel scheduling. Using a unified fused tile-level dataflow graph (FTG), TILE-LANG streamlines kernel development by unifying tile recommendation, which guides developers with hardware-aware defaults, and tile inference, which automates completion through constraint propagation. TILELANG enables concise expression of a wide range of AI algorithms in fewer than 70 lines of Python, reducing code size by up to 85.5% compared with manual implementations. Our evaluation shows that TILELANG delivers 1.08×-10.58× speedups over Triton on NVIDIA H100 (3.02× on average) and 1.01×-11.56× on AMD GPUs (2.65× on average), effectively bridging programmability and performance. Our code is available at https://github.com/tile-ai/tilelang .