VLDB2025
Schuyler: Self-Supervised Clustering of Tables in Relational Databases
Lukas Laskowski, Fabian Panse, Michael Hladik, Jan Portisch, Felix Naumann
摘要
Databases are integral to modern applications. They enable the efficient processing of vast amounts of data, making it possible to build services that support millions of users. However, these systems often comprise hundreds of interconnected tables, complicating maintenance and comprehension. To effectively operate them, having an overview of the database is of utmost importance. Database table clustering involves grouping semantically related tables, which simplifies many database management, analysis, and integration tasks. We present Schuyler, a system that clusters database tables by combining structural and semantic features of the database. Specifically, Schuyler fine-tunes a large language model in a self-supervised manner using triplet-loss to produce high-quality embeddings representing table semantics. Subsequently, these embeddings are clustered to achieve a database table clustering. Our approach requires no labeled training data and, thus, is applicable to arbitrary databases. To validate Schuyler and benchmark it against state-of-the-art competitors, we introduce a benchmark collection consisting of five real-world databases. These databases vary significantly in size (29–481 tables) and complexity (3–47 clusters) and reflect diverse real-world challenges. Our results demonstrate that Schuyler consistently achieves superior clustering performance, improving the state-of-the-art on average by 0.13 ARI (adjusted Rand index) and 0.10 AMI (adjusted mutual information).