WWW2026

LaTune: Lightweight and Adaptive Configuration Tuning for LLM Inference on Edge Devices

Siqi Zhong, Mugeng Liu, Haiyang Shen, Chongyang Pan, Yun Ma

摘要

Large Language Models (LLMs) are increasingly deployed on edge devices to address privacy and latency concerns in modern Web applications. While numerous studies focus on inference frameworks, the critical problem of tuning runtime configurations remains largely underexplored. This endeavor is particularly challenging on edge devices due to severe budget limitations and the dynamic variability of system resources. To address these challenges, we draw upon key insights regarding parameter sensitivity, configuration transferability, and rank stability to propose LaTune, a lightweight and adaptive tuning framework. LaTune is designed to efficiently find optimal runtime configurations by incorporating three complementary components: parameter selection to focus on the most impactful parameters, knowledge transfer to leverage historical data for accelerated search, and two-stage optimization to dynamically select the best configuration based on real-time resource constraints. Experiments across four edge devices and LLMs show that LaTune achieves up to 3.93x higher hypervolume and 6.90x throughput gains over baselines. It accelerates tuning efficiency by 2-3x, converging within 10-20 iterations, and ensures robust execution under heavy contention where static methods fail. Our code is open-sourced at https://github.com/pkuaiweb/LaTune.