WWW2026

URLBank: Data-Driven URL Discovery via Temporal Link Graphs

Felipe Marineli, Valerio Cetorelli, Valter Crescenzi, Tim Furche, Xiaonan Guo

Abstract

Web-scale editorial crawling must balance coverage and freshness within tight politeness and request budgets. Nonetheless, in production systems, manual seed management remains common despite being inefficient—oversampling redundant seeds while missing high-yield ones. URLBank replaces manual curation with a label-free controller that infers optimal seed selection directly from temporal crawl telemetry. It identifies candidate entry points, estimates stability—the persistence of links across crawler revolutions—and productivity—the rate of first-seen publications—and ranks them through greedy marginal gain on a shared-credit coverage objective. In a shadow A/B evaluation spanning 5,238 sites, URLBank consistently achieves higher coverage, greater efficiency, and earlier discovery under identical conditions. Gains remain stable across Top-K budgets, approaching near-complete coverage with far fewer seeds. Deployed alongside Meltwater's production crawler Pulitzer, URLBank operates with versioned policies, ranked prefixes for crawl budgets, and integrated health diagnostics, making allocation transparent, auditable, and reversible. Together, these results demonstrate that temporal signals, through an interpretable greedy objective, yield large, measurable improvements in industrial-scale coverage, resource efficiency, and freshness.