VLDB2025

Relational Data Models for Genetic VCF data

Mohamed Sabri Hafidi, Ozan Kahramanogullari, Anton Dignös, Johann Gamper

Abstract

The Variant Call Format (VCF) and its binary counterpart (BCF) are commonly used in bioinformatics for storing gene sequence data. While VCF files provide compact storage, they require specific tools and scripts for querying, thereby missing the rich functionality arsenal of database management systems and their potential for integration in multiomics pipelines. In this paper, we leverage Relational Database Management Systems (RDBMS) to enhance efficiency and flexibility in storing and querying large-scale genetic datasets. We map the VCF file structure to narrow, wide, and array-based data models that are further refined using JSON data structures, resulting in eight data models. Our experimental evaluation shows that RDBMS provide competitive performance in comparison with specialized state-of-the-art tools while making full-fledged database capabilities available for genetic data analysis.