ICLR2025

RouteLLM: Learning to Route LLMs from Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M. Waleed Kadous, Ion Stoica

Abstract

Large language models (LLMs) excel at a wide range of tasks, but choosing the right model often involves balancing performance and cost. Powerful models offer better results but are expensive, while smaller models are more cost-effective but less capable. To address this trade-off, we introduce a training framework for learning efficient router models that dynamically select between a stronger and weaker LLM during inference. Our framework leverages human preference data and employs data augmentation techniques to enhance performance. Evaluations on public benchmarks show that our approach can reduce costs by over 2 times without sacrificing response quality. Moreover, our routers exhibit strong generalization capabilities, maintaining performance even when routing between LLMs not included in training. This highlights the potential of our framework to deliver cost-effective, high-performance LLM solutions. INTRODUCTION Recent advances in large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language tasks. From open-ended conversation and question answering to text summarization and code generation, LLMs have demonstrated an impressive level of fluency and understanding (Achiam et al., 2023; Bubeck et al., 2023) . This rapid progress has been enabled by a combination of architectural innovations, such as the Transformer architecture (Vaswani et al., 2017) , as well as scaling up data and training infrastructure (Brown et al., 2020; Radford et al., 2019) . However, not all LLMs are created equal-there exists wide variation in the sizes of different LLMs, which in turn affects the resources required to serve them. LLMs also differ in terms of the data on which they are trained, which in turn leads to variations in the strengths, weaknesses, and capabilities of different models. Broadly speaking, larger models tend to be more capable but come at a higher cost, while smaller models tend to be less capable but cheaper to serve. This heterogeneous landscape presents a dilemma in the practical deployment of LLMs. Although routing all user queries to the largest and most capable model ensures high-quality results, it is prohibitively expensive. Conversely, routing queries to smaller models can save costs-by more than 50x (e.g., Claude-3 Haiku vs. Opus 1 )-but may result in lower quality responses, as the smaller model may not handle complex queries effectively. LLM routing (Ding et al., 2024; Hu et al., 2024) offers an effective solution by first processing each user query through a router, which then determines the most suitable LLM to handle the query. The router can direct simpler queries to smaller models and more complex ones to larger models, thereby balancing response quality with cost efficiency.