SIGMOD2025

Adda: Towards Efficient in-Database Feature Generation via LLM-based Agents

Kuan Lu, Zhihui Yang, Sai Wu, Ruichen Xia, Dongxiang Zhang, Gang Chen

3 citations

Abstract

Integrating machine learning (ML) analytics into existing database management systems (DBMSs) not only eliminates the need for costly data transfers to external ML platforms but also ensures compliance with regulatory standards. While some DBMSs have integrated functionalities for training and applying ML models for analytics, these tasks still present challenges, particularly due to limited support for automatic feature engineering (AutoFE), which is crucial for optimizing ML model performance. In this paper, we introduce Adda, an agent-driven in-database feature generation tool designed to automatically create high-quality features for ML analytics directly within the database. Adda interprets ML analytics tasks described in natural language and generates code for feature construction by leveraging the power of large language models (LLMs) integrated with specialized agents. This code is then translated into SQL statements using a predefined set of operators and compiled just-in-time (JIT) into user-defined functions (UDFs). The result is a seamless, fully in-database solution for feature generation, specifically tailored for ML analytics tasks. Extensive experiments across 14 public datasets, with five ML tasks per dataset, show that Adda improves the AUC by up to 33.2% and reduces end-to-end latency by up to 100x compared to Madlib.