ICML2025

Optimizing Temperature for Language Models with Multi-Sample Inference

Weihua Du, Yiming Yang, Sean Welleck

Abstract

Multi-sample aggregation strategies, such as majority voting and best-of-N sampling, are widely used in contemporary large language models (LLMs) to enhance predictive accuracy across various tasks. A key challenge in this process is temperature selection, which significantly impacts model performance. Existing approaches either rely on a fixed default temperature or require labeled validation data for tuning, which are often scarce and difficult to obtain. This paper addresses the challenge of automatically identifying the (near)-optimal temperature for different LLMs using multi-sample aggregation strategies, without relying on task-specific validation data. We provide a comprehensive analysis of temperature's role in performance optimization, considering variations in model architectures, datasets, task types, model sizes, and predictive accuracy. Furthermore, we propose a novel entropy-based metric for automated temperature optimization, which consistently outperforms fixed-temperature baselines. Additionally, we incorporate a stochastic process model to enhance interpretability, offering deeper insights into the relationship between temperature and model performance. Our code is available at https://github.com/StigLidu/dualdistill . Optimizing Temperature for Language Models with Multi-Sample Inference ing how to optimize the sampling process to enhance LLM performance under different conditions, including variations in training datasets, task types, and model sizes. A crucial open question is how to tune temperature, a key hyperparameter that controls the smoothness of the system-learned distribution. Intuitively, increasing the temperature leads to a smoother distribution, enhancing the diversity of sampled outputs. However, excessively high temperatures can introduce many low-quality samples, making aggregation more challenging (Holtzman et al., 2019; Renze & Guven, 2024) . Conversely, lowering the temperature results in a highly concentrated distribution, reducing diversity and potentially omitting high-quality samples. Striking the right balance between over-sampling and under-sampling is therefore essential for optimizing LLM performance. A common practice in prior evaluations is to use the same temperature across all methods despite variations in training datasets, task types, model sizes, and aggregation strategies. This practice is clearly suboptimal. An alternative approach is to empirically tune the temperature using labeled validation data for each task, dataset, model size, and aggregation strategy (Zhang et al., 2024a; Dhuliawala et al., 2024) . However, such a process is tedious and time-consuming and heavily dependent on the availability of labeled validation data, limiting its applicability when such data are scarce. In this paper, we present the first systematic investigation of how temperature affects LLM performance under multisample aggregation strategies across various conditions. Furthermore, we propose a principled algorithmic solution for automated temperature optimization without requiring labeled validation data. Our key idea is as follows: 1. We use the confidence score of each model as a selfassessment measure.