ICLR2026

BIRD: Behavior Induction via Representation-structure Distillation

Galen Pogoncheff, Michael Beyeler

Abstract

Human-aligned deep learning models exhibit behaviors consistent with human values, such as robustness, fairness, and honesty. Transferring these behavioral properties to models trained on different tasks or data distributions remains challenging: aligned behavior is easily forgotten during fine-tuning, and collecting task-specific data that preserves this behavior can be prohibitively costly. We introduce BIRD (Behavior Induction via Representation-structure Distillation), a flexible framework for transferring aligned behavior by matching the internal representation structure of a student model to that of a teacher. Applied to outof-distribution robustness in image classification, BIRD outperforms fine-tuning, transfer learning, and continual learning methods, improving robust accuracy by up to 16% over the next strongest baseline. It remains effective even when the teacher is trained on a much simpler dataset and is 25× smaller than the student. In a large-scale study of over 400 teacher-student pairs, we show that three interpretable and computable properties of the teacher's representations (i.e., task relevance, behavioral relevance, and complementary knowledge) explain up to 85% of the variance in transfer success. These insights offer practical guidance for teacher selection and design. BIRD turns small, well-aligned models into scalable alignment seeds, removing a key bottleneck in deploying safe AI systems in the wild. In the context of adversarial robustness, this issue has been addressed by fine-tuning robust feature encoders [6, 11] or by incorporating learning strategies that mitigate catastrophic forgetting [6, 12] . However, these approaches typically assume that robust features will generalize across domains, which only holds with sufficiently large and diverse pre-training datasets, which are costly to obtain and train on. Recent work in weak-to-strong generalization offers a promising alternative: train a small, wellaligned "weak" model and use it to supervise a larger "strong" model [9] . However, these approaches Preprint. Under review.