ICML2025

A Mathematical Framework for AI-Human Integration in Work

L. Elisa Celis, Lingxiao Huang, Nisheeth K. Vishnoi

Abstract

Job structure is underspecified Example: O*NET A comprehensive database, maintained by the U.S. Department of Labor, provides standardized descriptions of >1000 jobs Problems: ๏‚– Subskills Involved: ๏‚– ๐Ÿง  Diagnose (reasoning) ๏‚– ๐Ÿ›  Fix + test code (execution) ๏‚– Same score โ‰  same skills ๏‚– Failures are uninterpretable Challenges: ๏‚– Conflate reasoning with execution ๏‚– Lack of standardization ๏‚– Obscure where intervention is needed for upskilling What's missing: ๏‚– No diagnosis, prioritization, or multi-step task context ๏‚– No way to assess judgment or adaptation ๏‚– No notion of job-level success Challenges: ๏‚– AI is evaluated on fragments ๏‚– Statistical noise in evaluation Challenges: โ€ข How tasks depend on skills? โ€ข How to evaluate performance at the level of a skill, task, job Human eval conflate subskills Example: KPI AI benchmarks eval isolated skills Example: Big-Bench Lite