VLDB2025
Semantic Operators and Their Optimization: Towards AI-Based Data Analytics with Accuracy Guarantees
Liana Patel, Siddharth Jha, Melissa Z. Pan, Harshit Gupta, Parth Asawa, Carlos Guestrin, Matei Zaharia
被引用 16 次
摘要
The semantic capabilities of large language models (LLMs) have the potential to enable rich analytics and reasoning over vast knowledge corpora. Unfortunately, existing systems either empirically optimize expensive LLM-powered operations with no performance guarantees , or limit their support to simple batched-inference primitives. We introduce semantic operators , the first formalism with statistical accuracy guarantees for general-purpose AI-based operations with natural language parameters (e.g., filtering, sorting, joining or aggregating records using natural language criteria). Each operator can be implemented by multiple AI algorithms , which compose individual model invocations to orchestrate the model over the data. Our programming model specifies the expected behavior of each operator with a high-quality reference algorithm , and we develop an optimization framework that reduces cost, while providing accuracy guarantees for individual operators. Using this approach, we propose several novel optimizations to accelerate semantic filtering, joining, group-by and top-k operations by up to 1, 000×. We implement semantic operators in the LOTUS system and demonstrate LOTUS' effectiveness on real, bulk-semantic processing applications, including fact-checking, biomedical multi-label classification, search, and topic analysis. We show that the semantic operator model is expressive, capturing state-of-the-art AI pipelines in a few operator calls, and making it easy to express new pipelines that match or exceed quality of recent LLM-based analytic systems by up to 170%, while offering accuracy guarantees. Overall, LOTUS programs match or exceed the accuracy of state-of-the-art AI pipelines for each task while running up to 3.6× faster than the highest-quality baselines. LOTUS is publicly available at https://github.com/lotus-data/lotus.