ICLR2025

VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models

Lisa Dunlap, Krishna Mandal, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez

Abstract

Large language models (LLMs) often exhibit subtle yet distinctive characteristics in their outputs that users intuitively recognize, but struggle to quantify. These "vibes" -such as tone, formatting, or writing style -influence user preferences, yet traditional evaluations focus primarily on the singular vibe of correctness. We introduce VibeCheck, a system for automatically comparing a pair of LLMs by discovering identifying traits of a model ("vibes") that are well-defined, differentiating, and user-aligned. VibeCheck iteratively discovers vibes from model outputs and then utilizes a panel of LLM judges to quantitatively measure the utility of each vibe. We validate that the vibes generated by VibeCheck align with those found in human discovery and run VibeCheck on pairwise preference data from real-world user conversations with Llama-3-70b vs GPT-4. VibeCheck reveals that Llama has a friendly, funny, and somewhat controversial vibe. These vibes predict model identity with 80% accuracy and human preference with 61% accuracy. Lastly, we run VibeCheck on a variety of models and tasks including summarization, math, and captioning to provide insight into differences in model behavior. VibeCheck discovers vibes like Command X prefers to add concrete intros and conclusions when summarizing in comparison to TNGL, Llama-405b often overexplains its thought process on math problems compared to GPT-4o, and GPT-4 prefers to focus on the mood and emotions of the scene when captioning compared to Gemini-1.5-Flash. Code and vibe visualizer found at https://bench-mark.org/ INTRO vibe check : A process by which a group obtains a subjective assessment of another person, place, or thing. -Urban Dictionary How a large language model writes a story, explains a concept, or edits an essay can be evaluated along many different dimensions such as creativity, formatting, and writing style. However, most evaluations focus on one dimension: "correctness". State-of-the-art in evaluation methods remain largely focused on measuring accuracy for question answering and analytical reasoning tasks (Hendrycks et al., 2021a; Wang et al., 2019b;a; Hendrycks et al., 2021c), and methods which aim to provide a more holistic view of LLMs (Zhang et al., 2024; Padlewski et al., 2024; Mehri & Eskenazi, 2020b) rely on predefined concepts like conciseness, clarity, and trustworthiness to measure a model's performance. These evaluation approaches fail to capture the open-ended nature of LLM applications and the critical dependence on subjective user preferences and context of the task. For instance, tone and creativity might be crucial in creative writing, whereas efficiency and readability are crucial in coding tasks. To best inform users of which model would be best for their needs, we require flexible evaluation methods that can both discover and measure the relevant axes to evaluate for a given task. When interacting with a set of LLMs for an extended period, a user can often tell which model generated a particular response by looking at certain traits of the outputs. We define these identifying traits of models as "vibes". For instance, users have found Llama-3 outputs tend to be more friendly compared to outputs from GPT-4 and Claude which tend to be more formal (see Figure 1 ); in other words, Llama-3 ranks high on the friendliness vibe, defined by the axis formal → friendly. Using these insights, we might select Llama for customer service tasks and Claude for coding tasks.