ACL2025

Are LLMs Truly Graph-Savvy? A Comprehensive Evaluation of Graph Generation

Ege Demirci, Rithwik Kerur, Ambuj Singh

Abstract

While large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, their ability to generate valid graph structures remains underexplored. We evaluate fifteen state-of-the-art LLMs on five specialized graph generation tasks spanning delivery networks, social networks, quantum circuits, genedisease networks, and transportation systems. We also test the LLMs using 3 different prompt types: direct, iterative feedback, and programaugmented. Models supported with explicit reasoning modules (o3-mini-high, o1, Claude 3.7 Sonnet, DeepSeek-R1) solve more than twice as many tasks as their general-purpose peers, independent of parameter count. Error analysis reveals two recurring failure modes: smaller parameter size Llama models often violate basic structural constraints, whereas Claude models respect topology but mismanage higherorder logical rules. Allowing models to refine their answers iteratively yields uneven gains, underscoring fundamental differences in errorcorrection capacity. This work demonstrates that graph understanding stems from specialized training methodologies rather than scale, establishing a framework for developing truly graph-savvy language models. Results and verification scripts available at github.com/Are-LLMs-Truly-Graph-Savvy.