WWW2026
Adaptive Model and Strategy Routing for Cost-Efficient LLM Services
Zhihong Pan, Kai Zhang, Yuze Zhao, Yupeng Han
摘要
In web-based AI services, providers typically host multiple large language models (LLMs) that exhibit diverse capabilities and incur different API costs. Meanwhile, LLM's performance depends not only on its inherent capacity but also on the reasoning strategy it employs, which together influence both answer quality and computational cost. A key challenge is therefore how to adaptively allocate models and strategies to achieve high-quality responses under constrained costs. To address this challenge, we propose Route-To-Reason (RTR), a unified routing framework that simultaneously selects suitable LLMs and reasoning strategies according to query complexity and user budget. Specifically, RTR learns dense vector representations of models and strategies that capture their behavioral characteristics in handling different queries. Leveraging these embeddings, RTR builds a routing table that estimates the cost and performance of different model-strategy pairs. During inference, RTR consults this routing table to dynamically assign the most appropriate pair, enabling adaptive and cost-efficient reasoning tailored to query difficulty and budget scenarios. Extensive experiments across multiple reasoning benchmarks show that RTR achieves comparable or higher accuracy than the best single LLM while substantially reducing both token usage and API cost (by up to 60%), achieving a superior trade-off between performance and efficiency. By lowering the overhead of large-scale LLM inference, RTR contributes to cost-aware and environmentally sustainable deployment of web-based AI services. CCS Concepts • Computing methodologies → Natural language generation; Neural networks.