CCS2025

The Odyssey of robots.txt Governance: Measuring Convention Implications of Web Bots in Large Language Model Services

Jian Cui, Mingming Zha, XiaoFeng Wang, Xiaojing Liao

摘要

Web content is an essential element for large language model (LLM) services, supporting both training and inference processes. To manage the content access of web bots from LLM service vendors (i.e., LLM bots), web content publishers are increasingly incorporated content access rules into robots.txt, a long-established web content management protocol. However, the rise of proprietary LLM bots, such as OpenAI's ChatGPT-User and Google's Google-Extended, has raised concerns about the transparency of web content access and whether these bots adherence to robots.txt rules. However, there is limited understanding of these LLM bots, concerning their impact on web publishers and broader web content governance. To fill this gap, we present a systematic analysis of 18 LLM bots on 582,281 robots.txt files. Our findings reveal a significant increase in robots.txt rules associated with LLM bots, particularly in domains that fall into the finance and news category. Despite the heightened integration, web publishers face challenges in managing robots.txt configurations due to the complexity of the LLM ecosystem and the involvement of third-party brokers. Furthermore, we identified several cases of robots.txt violations, including instances where LLMs memorized web content from restricted domains, and where ChatGPT-User ignored robots.txt and accessed restricted content. These results highlight the gaps in the current web content governance and underscore the need for enforceable content management mechanisms to respect web publishers' intentions and content control.