ICML2025

FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees

Fan Nie, Xiaotian Hou, Shuhang Lin, James Zou, Huaxiu Yao, Linjun Zhang

Abstract

The propensity of Large Language Models (LLMs) to generate hallucinations and non-factual content undermines their reliability in high-stakes domains, where rigorous control over Type I errors (the conditional probability of incorrectly classifying hallucinations as truthful content) is essential. Despite its importance, formal verification of LLM factuality with such guarantees remains largely unexplored. In this paper, we introduce FACTTEST, a novel framework that statistically assesses whether an LLM can confidently provide correct answers to given questions with high-probability correctness guarantees. We formulate factuality testing as hypothesis testing problem to enforce an upper bound of Type I errors at user-specified significance levels. Notably, we prove that our framework also ensures strong Type II error control under mild conditions and can be extended to maintain its effectiveness when covariate shifts exist. Our approach is distribution-free and works for any number of human-annotated samples. It is model-agnostic and applies to any black-box or white-box LM. Extensive experiments on question-answering (QA) and multiple-choice benchmarks demonstrate that FACTTEST effectively detects hallucinations and improves the model's ability to abstain from answering unknown questions, leading to an over 40% accuracy improvement. • We propose FACTTEST, a novel statistical testing framework that evaluates the factuality of LLMs while enabling them to decline unknown questions with user-specified Type I error guarantees. • We prove that our statistical framework achieves strong power control under mild conditions, ensuring that the predictor can also maintain a low Type II error. This power analysis is broadly applicable to standard NP classification problems, not limited to this setting. • We extend our framework to accommodate covariate shifts by approximating density ratios and employing rejection sampling, thereby enhancing its robustness for real-world applications. • We demonstrate that FACTTEST effectively detects hallucinations while maintaining Type I error below user-specified significance levels, achieving an over 40% improvement in accuracy compared to pretrained models without any fine-tuning. Additionally, it surpasses training-based baselines by 30% using only half of the fine-tuning data. STATISTICAL FACTUALITY TESTING In this section, we formulate the evaluation of factuality in LLMs as a statistical hypothesis testing problem and introduce our FACTTEST framework to overcome hallucination issues. PROBLEM FORMULATION We consider a text generation task in which a language model 𝑀 will generate its answers 𝑀 (𝑞) based on a question 𝑞. Our goal is to statistically evaluate whether 𝑀 can correctly answer 𝑞. We formulate this objective as a hypothesis testing problem with the following hypotheses: 𝐻 0 : The model 𝑀 cannot answer the question 𝑞 correctly. 𝐻 1 : The model 𝑀 can answer the question 𝑞 correctly. For any question-answer pair (𝑞, 𝑎) with 𝑎 to be one of the correct answer for question 𝑞, we apply 𝑀 to generate an answer 𝑀 (𝑞). The question-generated answer pair (𝑞, 𝑀 (𝑞)) is deemed correct if the null hypothesis 𝐻 0 is rejected, i.e., 𝑀 (𝑞) aligns with 𝑎; otherwise, it is deemed incorrect. Let 𝑃 0 and 𝑃 1 represent the distributions of all possible incorrect and correct question-generated answer pairs (𝑞, 𝑀 (𝑞)), respectively. Given a dataset (𝑞 ∼ 𝑃 𝑞,𝑎 comprising 𝑛 question-answer pairs with Q, A to be the set of all possible questions and answers, respectively, and 𝑃 𝑞,𝑎 is a distribution of Q × A, we apply 𝑀 to generate answers for all the 𝑛 questions, resulting in the set D = (𝑞 1 , 𝑀 (𝑞 1 ), 𝑎 1 ), . . . , (𝑞 𝑛 , 𝑀 (𝑞 𝑛 ), 𝑎 𝑛 ). Since the distribution 𝑃 𝑀 (𝑞) |𝑞 of 𝑀 (𝑞) produced by 𝑀 given the question 𝑞 is fully determined by 𝑀 and independent of 𝑎, we know D i.i.d. ∼ 𝑃 𝑞,𝑀 (𝑞),𝑎 = 𝑃 𝑞,𝑎 𝑃 𝑀 (𝑞) |𝑞 . Then our goal is to construct a predictor f𝛼 : Q × A → 0, 1 that classifies a pair