EMNLP2024

LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models

Hayder Elesedy, Pedro M. Esperança, Silviu Vlad Oprea, Mete Ozay

被引用 5 次

摘要

Guardrails have emerged as comprehensive method of content moderation for large language models (LLMs), complementing safety alignment from fine-tuning. However, existing model-based guardrails are too memory intensive for use on resource-constrained computational devices such as mobile phones, an increasing number of which are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. LoRA-Guard extracts language features from the LLMs and adapts them for the content moderation task using low-rank adapters in a dual-path design which prevents any performance degradation on the generative task. We show that LoRA-Guard outperforms existing guardrail approaches while using 100-1000x fewer guardrail parameters, enabling on-device content moderation. * Version Note: Changes in this version v2 relative to v1: separate output heads for safe/unsafe classification and harm category classification ( §4.3), training on BeaverTails dataset ( §4.2.1), use of recent chat models ( §4.1), comparison with recent guard models ( §5).