ICML2025
A Mathematical Framework for AI-Human Integration in Work
L. Elisa Celis, Lingxiao Huang, Nisheeth K. Vishnoi
Abstract
Job structure is underspecified Example: O*NET A comprehensive database, maintained by the U.S. Department of Labor, provides standardized descriptions of >1000 jobs Problems: ๏ Subskills Involved: ๏ ๐ง Diagnose (reasoning) ๏ ๐ Fix + test code (execution) ๏ Same score โ same skills ๏ Failures are uninterpretable Challenges: ๏ Conflate reasoning with execution ๏ Lack of standardization ๏ Obscure where intervention is needed for upskilling What's missing: ๏ No diagnosis, prioritization, or multi-step task context ๏ No way to assess judgment or adaptation ๏ No notion of job-level success Challenges: ๏ AI is evaluated on fragments ๏ Statistical noise in evaluation Challenges: โข How tasks depend on skills? โข How to evaluate performance at the level of a skill, task, job Human eval conflate subskills Example: KPI AI benchmarks eval isolated skills Example: Big-Bench Lite