KDD2024

Graph Machine Learning Meets Multi-Table Relational Data

Quan Gan, Minjie Wang, David Wipf, Christos Faloutsos

3 citations

Abstract

While graph machine learning, and notably graph neural networks (GNNs), have gained immense traction in recent years, application is predicated on access to a known input graph upon which predictive models can be trained. And indeed, within the most widely-studied public evaluation benchmarks such graphs are provided, with performance comparisons conditioned on curated data explicitly adhering to this graph. However, in real-world industrial applications, the situation is often quite different. Instead of a known graph, data are originally collected and stored across multiple tables in a repository, at times with ambiguous or incomplete relational structure. As such, to leverage the latest GNN architectures it is then up to a skilled data scientist to first manually construct a graph using intuition and domain knowledge, a laborious process that may discourage adoption in the first place. To narrow this gap and broaden the applicability of graph ML, we survey existing tools and strategies that can be combined to address the more fundamental problem of predictive tabular modeling over data native to multiple tables, with no explicit relational structure assumed a priori. This involves tracing a comprehensive path through related table join discovery and fuzzy table joining, column alignment, automated relational database (RDB) construction, extracting graphs from RDBs, graph sampling, and finally, graph-centric trainable predictive architectures. Although efforts to build deployable systems that integrate all of these components while minimizing manual effort remain in their infancy, this survey will nonetheless reduce barriers to entry and help steer the graph ML community towards promising research directions and wider real-world impact.