ICLR2026
USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning Capabilities of LLMs as Urban Agents
Siqi Lai, Yansong Ning, Zirui Yuan, Zhixi Chen, Hao Liu
被引用 7 次
摘要
Large language models (LLMs) have shown emerging potential in spatiotemporal reasoning, making them promising candidates for building urban agents that support diverse urban downstream applications. Despite these benefits, existing studies primarily focus on evaluating urban LLM agent on outcome-level metrics (e.g., prediction accuracy, traffic efficiency), offering limited insight into their underlying reasoning processes. As a result, the strengths and limitations of urban LLM agents in spatiotemporal reasoning remain poorly understood. To this end, we introduce USTBench, the first benchmark to evaluate LLMs' spatiotemporal reasoning abilities as urban agents across four decomposed dimensions: spatiotemporal understanding, forecasting, planning, and reflection with feedback. Specifically, USTBench supports five diverse urban decision-making and four spatiotemporal prediction tasks, all running within our constructed interactive city environment UAgentEnv. The benchmark includes 62,466 structured QA pairs for process-level evaluation and standardized end-to-end task assessments, enabling fine-grained diagnostics and broad task-level comparison across diverse urban scenarios. Through extensive evaluation of thirteen leading LLMs, we reveal that although LLMs show promising potential across various urban downstream tasks, they still struggle in long-horizon planning and reflective adaptation in dynamic urban contexts. Notably, recent advanced reasoning models (e.g., DeepSeek-R1) trained on general logic or mathematical problems do not consistently outperform non-reasoning LLMs. This discrepancy highlights the need for domain-specialized adaptation methods to enhance urban spatiotemporal reasoning. Overall, USTBench provides a foundation to build more adaptive and effective LLM-based urban agents and broad smart city applications. Our project is available at https://github.com/usail-hkust/USTBench . Introduction Urban systems are inherently complex and dynamic, characterized by continuous fluctuations across space and time. By learning from large-scale spatiotemporal data, traditional data-driven methods have achieved progress in prediction and decision support [1, 68, 46, 54, 45] . However, they often fall short in generalizing to unseen scenarios and providing transparent reasoning for reliable decisionmaking [26, 21] . Recently, the advanced large language models (LLMs) (e.g., and DeepSeek-R1 [15] ) have emerged as intelligent urban agents [21, 26, 75, 35, 36] due to their growing reasoning ability to integrate diverse information, adapt across tasks, and offer detailed interpretation through natural language. To fully leverage their potential, it is essential to systematically evaluate LLMs' spatiotemporal reasoning abilities: the capacity to infer spatiotemporal dynamics and interact with evolving urban environments. Such evaluation is key to understanding their readiness for real-world urban challenges.