VLDB2025

Unlocking the Power of CI/CD for Data Pipelines in Distributed Data Warehouses

Hongtao Yang, Zhichen Xu, Sergey Yudin, Andrew Davidson

被引用 1 次

摘要

Ensuring the reliability of data pipelines is critical for modern data-driven organizations, yet building robust Continuous Integration (CI) in large, distributed data warehouses remains a significant challenge. Complexities arising from distributed ownership, the high cost of replicating production environments, and the rapid evolution of business logic lead to fragile pipelines and costly failures. This paper introduces a novel CI framework designed to conquer these challenges, achieving 94.5% pre-production issue detection in YouTube's data warehouse while dramatically reducing resource consumption. Our key innovation lies in a production-coniguration-driven testing methodology, that enables scalable, isolated testing directly within the production environment. This approach reduces testing overhead and ensures high test fidelity. Furthermore, we present a lineage-aware impact analysis framework that automatically propagates data quality checks across distributed pipeline components based on an algebraic dependency model, ensuring data consistency and promoting cross-team collaboration. This production-proven solution provides a practical blueprint for CI/CD in complex, large-scale environments.