ICLR2026

Bayesian Ensemble for Sequential Decision-Making

Rui Liu, Enmin Zhao, Lu Wang, Yu Li, Ming Pang, Changping Peng, Zhangang Lin, Ching Law, Jingping Shao

被引用 2 次

摘要

Large language models increasingly serve as autonomous decision-making agents in domains where errors have measurable costs: hiring (missed qualified candidates versus wasted interviews), medical triage (missed emergencies versus unnecessary escalations), and fraud detection (approved fraud versus declined legitimate transactions). Current architectures are built on a flawed foundation: they query LLMs for discriminative probabilities p(state|evidence), apply arbitrary confidence thresholds, and execute actions without considering cost asymmetries or uncertainty quantification. We prove this approach is formally inadequate for sequential decision-making and propose a mathematically principled alternative. We propose a mathematically principled alternative that treats multiple LLMs as approximate likelihood functions rather than classifiers. For each possible state, we elicit p(evidence|state) through contrastive prompting, aggregate across diverse models via robust statistics, and apply Bayes' rule with explicit priors. This generative modeling perspective enables four critical capabilities: (1) proper sequential belief updating as evidence accumulates, (2) cost-aware action selection through expected utility maximization, (3) principled information gathering via value-of-information calculations, and (4) improved fairness through ensemble bias mitigation. We instantiate this framework in resume screening, where hiring mistakes cost 40,000,wastedinterviewscost40,000, wasted interviews cost 2,500, and phone screens cost 150.Experimentsacross1,000resumesevaluatedbyfivediverseLLMs(GPT4o,Claude3.5Sonnet,GeminiPro,Grok,DeepSeek)demonstratethatourapproachreducestotalcostsby150. Experiments across 1,000 resumes evaluated by five diverse LLMs (GPT-4o, Claude 3.5 Sonnet, Gemini Pro, Grok, DeepSeek) demonstrate that our approach reduces total costs by 294,000 (34% improvement) compared to the best single-LLM baseline while improving demographic parity by 45% (reducing maximum group difference from 22 to 5 percentage points). Ablation studies reveal that multi-LLM aggregation contributes 51% of cost savings, sequential updating 43%, and disagreement-triggered information gathering 20%. Critically, we prove these gains are not merely empirical accidents but necessary consequences of correcting the mathematical foundations of LLM-based decision-making.