WWW2026

HeteroSim: Towards High-Fidelity Heterogeneous LLM Training Simulation on GPUs

Xiaofei Yue, Fangming Zhao, Fulun Ye, Jiongchi Yu, Zhaoxuan Li, Tingting Li, Ziming Zhao, Jianwei Yin

Abstract

Modern Large Language Model (LLM) training clusters increasingly mix heterogeneous GPUs, diverse intra-node fabrics, and inter-node interconnects, combined with varied parallelism strategies. Exploring this massive design space, further amplified by heterogeneity, through real deployments is prohibitively slow and costly. Existing simulators, which are primarily designed and tuned for homogeneous clusters, either trade fidelity for speed or require heavyweight workflows with non-negligible overhead. We propose HeteroSim, a high-fidelity simulation framework for heterogeneous LLM training systems. It introduces: (i) a LLM training workload compiler that captures realistic training graphs, microbatching schedules, and compute-communication overlap; (ii) a heterogeneity-aware computation planner using roofline-style scaling across GPU generations; (iii) a collective communication planner that reproduces NCCL-like behaviors with per-link models, message channelization, and configurable routing. Across a wide range of heterogeneity levels, experimental results show that HeteroSim achieves near-real simulation accuracy while keeping low overhead.