ICML2025

Loss Functions and Operators Generated by f-Divergences

Vincent Roulet, Tianlin Liu, Nino Vieillard, Michael Eli Sander, Mathieu Blondel

Abstract

The logistic loss (a.k.a. cross-entropy loss) is one of the most popular loss functions used for multiclass classification. It is also the loss function of choice for next-token prediction in language modeling. It is associated with the Kullback-Leibler (KL) divergence and the softargmax operator. In this work, we build upon Fenchel-Young losses to construct convex loss functions generated from f -divergences. Our loss functions generalize the logistic loss in two directions: i) by replacing the KL divergence with f -divergences and ii) by allowing non-uniform reference measures. We instantiate our framework for numerous f -divergences, recovering existing losses and creating new ones. By analogy with the logistic loss, the loss function generated by an f -divergence is associated with an operator, that we dub fsoftargmax. We derive a novel parallelizable bisection algorithm for computing the f -softargmax associated with any f -divergence. On the empirical side, one of the goals of this paper is to determine the effectiveness of loss functions beyond the classical cross-entropy in a language model setting, including on pre-training, post-training (SFT) and distillation. We show that the loss function generated by the α-divergence (which is equivalent to Tsallis α-negentropy in the case of unit reference measures) with α = 1.5 performs well across several tasks.