ICML2025

A Mathematical Framework for AI-Human Integration in Work

L. Elisa Celis, Lingxiao Huang, Nisheeth K. Vishnoi

Abstract

Job structure is underspecified Example: O*NET A comprehensive database, maintained by the U.S. Department of Labor, provides standardized descriptions of >1000 jobs Problems:  Subskills Involved:  🧠 Diagnose (reasoning)  🛠 Fix + test code (execution)  Same score ≠ same skills  Failures are uninterpretable Challenges:  Conflate reasoning with execution  Lack of standardization  Obscure where intervention is needed for upskilling What's missing:  No diagnosis, prioritization, or multi-step task context  No way to assess judgment or adaptation  No notion of job-level success Challenges:  AI is evaluated on fragments  Statistical noise in evaluation Challenges: • How tasks depend on skills? • How to evaluate performance at the level of a skill, task, job Human eval conflate subskills Example: KPI AI benchmarks eval isolated skills Example: Big-Bench Lite