ICML2025
A Mathematical Framework for AI-Human Integration in Work
L. Elisa Celis, Lingxiao Huang, Nisheeth K. Vishnoi
摘要
Job structure is underspecified Example: O*NET A comprehensive database, maintained by the U.S. Department of Labor, provides standardized descriptions of >1000 jobs Problems: Subskills Involved: 🧠 Diagnose (reasoning) 🛠 Fix + test code (execution) Same score ≠ same skills Failures are uninterpretable Challenges: Conflate reasoning with execution Lack of standardization Obscure where intervention is needed for upskilling What's missing: No diagnosis, prioritization, or multi-step task context No way to assess judgment or adaptation No notion of job-level success Challenges: AI is evaluated on fragments Statistical noise in evaluation Challenges: • How tasks depend on skills? • How to evaluate performance at the level of a skill, task, job Human eval conflate subskills Example: KPI AI benchmarks eval isolated skills Example: Big-Bench Lite