KDD2024

Empower an End-to-end Scalable and Interpretable Data Science Ecosystem using Statistics, AI and Domain Science

Xihong Lin

摘要

The data science ecosystem encompasses data fairness, statistical, ML and AI methods and tools, interpretable data analysis and results, and trustworthy decision-making. Rapid advancements in AI have revolutionized data utilization and enabled machines to learn from data more effectively. Statistics, as the science of learning from data while accounting for uncertainty, plays a pivotal role in addressing complex real-world problems and facilitating trustworthy decision-making. In this talk, I will discuss the challenges and opportunities involved in building an end-to-end scalable and interpretable data science ecosystem using the analysis of whole genome sequencing studies and biobanks that integrates statistics, ML/AI, and genomic and health science as an example. Biobanks collect whole genome data, electronic health records and epidemiological data. I will illustrate key points using the analysis of multi-ancestry whole genome sequencing studies and biobanks by discussing a few scalable and interpretable statistical and ML/AI methods, tools and data science resources.