ICLR2026

Bayesian Ensemble for Sequential Decision-Making

Rui Liu, Enmin Zhao, Lu Wang, Yu Li, Ming Pang, Changping Peng, Zhangang Lin, Ching Law, Jingping Shao

2 citations

Abstract

Large language models increasingly serve as autonomous decision-making agents in domains where errors have measurable costs: hiring (missed qualified candidates versus wasted interviews), medical triage (missed emergencies versus unnecessary escalations), and fraud detection (approved fraud versus declined legitimate transactions). Current architectures are built on a flawed foundation: they query LLMs for discriminative probabilities p(state|evidence), apply arbitrary confidence thresholds, and execute actions without considering cost asymmetries or uncertainty quantification. We prove this approach is formally inadequate for sequential decision-making and propose a mathematically principled alternative. We propose a mathematically principled alternative that treats multiple LLMs as approximate likelihood functions rather than classifiers. For each possible state, we elicit p(evidence|state) through contrastive prompting, aggregate across diverse models via robust statistics, and apply Bayes' rule with explicit priors. This generative modeling perspective enables four critical capabilities: (1) proper sequential belief updating as evidence accumulates, (2) cost-aware action selection through expected utility maximization, (3) principled information gathering via value-of-information calculations, and (4) improved fairness through ensemble bias mitigation. We instantiate this framework in resume screening, where hiring mistakes cost $40,000, wasted interviews cost$ 2,500, and phone screens cost $150. Experiments across 1,000 resumes evaluated by five diverse LLMs (GPT-4o, Claude 3.5 Sonnet, Gemini Pro, Grok, DeepSeek) demonstrate that our approach reduces total costs by$ 294,000 (34% improvement) compared to the best single-LLM baseline while improving demographic parity by 45% (reducing maximum group difference from 22 to 5 percentage points). Ablation studies reveal that multi-LLM aggregation contributes 51% of cost savings, sequential updating 43%, and disagreement-triggered information gathering 20%. Critically, we prove these gains are not merely empirical accidents but necessary consequences of correcting the mathematical foundations of LLM-based decision-making.