WWW2026

AlignCP: Noise-Aware Preference Alignment for LLMs via Confidence and Polarity Reweighting

Chen Cheng, Hefei Xu, Le Wu

Abstract

Large Language Models (LLMs) are now widely deployed across modern web services, but their safe and trustworthy use in real-world settings critically depends on accurate alignment with human preferences. Preference alignment is typically achieved using methods such as reinforcement learning or direct preference optimization (DPO), whose effectiveness in practice hinges on the quality of labeled preference data. However, a fundamental practical challenge remains: preference datasets inevitably contain noise. Through a systematic analysis of mainstream preference datasets, we find that roughly 25% of preference pairs show clear inconsistencies between reward-model evaluations and human annotations. Such inconsistent examples do not convey reliable preference signals; training directly on them therefore not only fails to improve alignment but can even degrade model behavior. To address this problem, we propose Noise-Aware Preference Alignment for LLMs via Confidence and Polarity Reweighting (AlignCP), a fully automated, human-free framework for noise-aware preference alignment. AlignCP derives two interpretable metrics from reward-model outputs: Confidence, which measures the reliability of each preference judgment, and Polarity, which evaluates whether the reward-model ranking agrees with the original human label. These metrics are combined to assign a training weight to each sample—amplifying high-confidence, label-consistent pairs while down-weighting or discarding low-confidence, contradictory, or otherwise noisy pairs. Unlike methods that rely on human re-inspection, repeated relabeling, or heavy computational reruns, AlignCP performs automated data-quality control with minimal overhead and no human intervention. Empirical results show that AlignCP substantially outperforms existing data-centric alignment approaches on standard preference benchmarks and remains more robust under noisy supervision.