WWW2026

llm-tuna - Hyperparameter Optimization for LLM Inference

Thameem Abbas Ibrahim Bathusha, Aanya Sharma, Andy Huynh, Rehan Samaratunga, Ashish Kamra

Abstract

The performance of large language model (LLM) inference engines heavily relies on deployment-specific parameters such as batch size, parallelism, CUDA graphs, and GPU and network topology, however, with every new LLM released, the optimal setting of these parameters becomes a problem of interest. Additionally, as developers find new applications for LLMs, the parameter space only increases, with large scale inference deployments that span multiple nodes (i.e. llm-d [5]). In this work, we demonstrate ''llm-tuna'', an open-source framework that automates parameter search for vLLM [4] based inference workloads across a diverse set of hardware setups. Our framework utilizes Bayesian optimization via Optuna to select the best configuration while parallelizing trials across multiple nodes to enable large scale studies. We evaluate our system on a mixture-of-experts (MoE) model family (Qwen3-30B-A3B, Kimi-K2-INT4, GPT-OSS-120B, Deepseek-R1-671B), configuring each experiment to avoid artifacts of KV-cache pre-emption effects. Our method achieves output throughput gains of up to 32.9% while never suggesting configurations that are worse than vLLMs' baseline performance. Our results demonstrate the potential of systematic auto-tuning to expose non-obvious performance regimes and guide operator defaults.