ICLR2026

Towards a Foundation Model for Crowdsourced Label Aggregation

Hao Liu, Jiacheng Liu, Feilong Tang, Long Chen, Jiadi Yu, Yanmin Zhu, Qiwen Dong, Yichuan Yu, Xiaofeng Hou

摘要

Inferring ground truth from noisy, crowdsourced labels is a fundamental challenge in machine learning. For decades, the dominant paradigm has relied on dataset-specific parameter estimation, a non-scalable method that fails to transfer knowledge. Recent efforts toward universal aggregation models do not account for the structural and behavioral complexities of human-annotated crowdsourcing, resulting in poor real-world performance. To address this gap, we introduce CrowdFM, a foundation model for crowdsourced label aggregation. At its core, CrowdFM is a bipartite graph neural network that is pre-trained on a vast, domain-randomized synthetic dataset to learn diverse behavioral patterns. By leveraging a size-invariant initialization and attention-based message passing, it learns universal principles of collective intelligence and generalizes to new, unseen datasets. Extensive experiments on 22 real-world benchmarks show that our single, fixed model consistently matches or surpasses bespoke, per-dataset methods in both accuracy and efficiency. Furthermore, the representations learned by CrowdFM readily support diverse downstream applications, such as worker assessment and task assignment. Codes are available at https://github.com/liiuhaao/CrowdFM.