VLDB2025

APEX-DAG: Library and Language independent Pipeline EXtraction

Sebastian Eggers, Nina Zukowska, Ziawasch Abedjan

摘要

Modern data-driven systems often rely on complex pipelines to process and transform data for downstream machine learning (ML) tasks. Extracting these pipelines and understanding their structure is critical for ensuring transparency, performance optimization, and maintainability, especially in large-scale projects. In this work, we introduce a novel system, APEX-DAG ( A utomating P ipeline EX traction with D ataflow, Static Code A nalysis, and G raph Attention Networks), which automates the extraction of data pipelines from computational notebooks or scripts. Unlike execution-based methods, APEX-DAG leverages static code analysis to identify the dataflow, transformations, and dependencies within ML workflows without executing the code or the need to alter the code. Further, after an initial training phase, our system can identify pipelines that built with previously unseen libraries.