ACL2024

FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability

Congying Xia, Chen Xing, Jiangshu Du, Xinyi Yang, Yihao Feng, Ran Xu, Wenpeng Yin, Caiming Xiong

Abstract

This paper presents FOFO, a pioneering bench-001 mark for evaluating large language models' 002 (LLMs) ability to follow complex, domain-003 specific formats, a crucial yet underexamined 004 capability for their application as AI agents. 005 Despite LLMs' advancements, existing bench-006 marks fail to assess their format-following pro-007 ficiency adequately. FOFO fills this gap with a 008 diverse range of real-world formats and instruc-009 tions, developed through an AI-Human collabo-010 rative method. Our evaluation across both open-011 source (e.g., Llama 2, WizardLM) and closed-012 source (e.g., GPT-4, PALM2, Gemini) LLMs 013 highlights three key findings: open-source mod-014 els significantly lag behind closed-source ones 015 in format adherence; LLMs' format-following 016 performance is independent of their content 017 generation quality; and LLMs' format profi-018 ciency varies across different domains. These 019 insights suggest the need for specialized tun-020 ing for format-following skills and highlight 021 FOFO's role in guiding the selection of domain-022 specific AI agents. FOFO will be publicly re-023 leased, contributing a critical tool for advancing 024 LLM evaluation and application. 025 1 Introduction 026 Large language models (LLMs) show great 027 promise in automating diverse tasks, from medi-028