EMNLP2025

F²Bench: An Open-ended Fairness Evaluation Benchmark for LLMs with Factuality Considerations

Tian Lan, Jiang Li, Yemin Wang, Xu Liu, Xiangdong Su, Guanglai Gao

3 citations

Abstract

Warning: This paper contains content that may be offensive or harmful With the growing adoption of large language models (LLMs) in NLP tasks, concerns about their fairness have intensified. Yet, most existing fairness benchmarks rely on closed-ended evaluation formats, which diverge from realworld open-ended interactions. These formats are prone to position bias and introduce a "minimum score" effect, where models can earn partial credit simply by guessing. Moreover, such benchmarks often overlook factuality considerations rooted in historical, social, physiological, and cultural contexts, and rarely account for intersectional biases. To address these limitations, we propose F 2 Bench: an openended fairness evaluation benchmark for LLMs that explicitly incorporates factuality considerations. F 2 Bench comprises 2,568 instances across 10 demographic groups and two openended tasks. By integrating text generation, multi-turn reasoning, and factual grounding, F 2 Bench aims to more accurately reflect the complexities of real-world model usage. We conduct a comprehensive evaluation of several LLMs across different series and parameter sizes. Our results reveal that all models exhibit varying degrees of fairness issues. We further compare open-ended and closedended evaluations, analyze model-specific disparities, and provide actionable recommendations for future model development. Our code and dataset are publicly available at https: //github.com/VelikayaScarlet/F2Bench .