SIGMOD2025
SpareLLM: Automatically Selecting Task-Specific Minimum-Cost Large Language Models under Equivalence Constraint
Saehan Jo, Immanuel Trummer
2 citations
Abstract
We introduce SpareLLM, Selecting P assable A nd R esource- E fficient LLM s, a novel LLM framework designed to minimize the inference costs (i.e., resource-efficient) of large-scale NLP tasks while ensuring sufficient result quality (i.e., passable). It enables users to specify an equivalence constraint in terms of the equivalence of outputs to those of the most powerful LLM. SpareLLM then generates results that deviate from the outputs of this LLM only with a probability below a user-defined threshold. SpareLLM employs a profiling phase that evaluates the performance of multiple LLMs to identify those that meet the user-defined equivalence level. It optimizes the tradeoff between profiling overheads and the anticipated cost savings resulting from profiling. Moreover, SpareLLM further reduces inference costs by strategically leveraging a mix of LLMs. Our experiments on five real-world datasets show that SpareLLM achieves significant cost savings, up to 8.6x, while generating equivalent outputs in 90% of cases compared to GPT-4-Turbo. Compared to recent LLM cascading baselines, SpareLLM demonstrates a superior tradeoff between cost and accuracy, accounting for 91.1% and 83.8% of the points on the Pareto curve for OpenAI and Llama models.