ICLR2026

PALC: Preference Alignment via Logit Calibration

SANGHYUN LEE, Hoh Peter In

Abstract

Aligning Large Language Models with human preferences typically requires computationally intensive training or complex reward architectures. We introduce PALC (Preference Alignment via Logit Calibration), a parameter-efficient framework that achieves test-time alignment through a novel intervention strategy: direct calibration in vocabulary space. Unlike steering methods that intervene directly in the entangled hidden space-often risking unintended side effects due to feature superposition-PALC utilizes hidden states strictly as a read-only context to apply interventions in the naturally disentangled logit space. Our approach employs a bottleneck architecture that learns to compress the base model's hidden states and generate position-dependent calibration vectors, requiring only a fraction of the base model's parameters. Through this design, PALC sidesteps the superposition problem inherent in representation engineering while eliminating the computational overhead of guided decoding methods. A single scaling factor enables runtime adjustment of alignment strength without retraining, allowing practitioners to balance between preserving model capabilities and enforcing preferences. Experiments demonstrate that PALC outperforms most test-time alignment methods while maintaining near-baseline inference speed. Our ablations reveal that human preferences concentrate on surprisingly low-dimensional manifolds, validating our architectural choices. By establishing vocabulary-space intervention as an effective alignment paradigm, PALC makes preference alignment accessible for resource-constrained deployments where traditional methods are infeasible, opening new avenues for scalable and adaptive AI alignment. Our code is available at https://github.com/s4n9hyun/PALC .