EMNLP2025
Rule Discovery for Natural Language Inference Data Generation Using Out-of-Distribution Detection
Juyoung Han, Hyunsun Hwang, Changki Lee
Abstract
Natural Language Inference (NLI) is a fundamental task in Natural Language Processing (NLP). However, adapting NLI models to new domains remains challenging due to the high cost of collecting domain-specific training data. While prior work proposed 15 sentence transformation rules to automate training data generation, these rules do not sufficiently capture the diversity of natural language. We propose a novel framework that combines Out-of-Distribution (OOD) detection and BERT-based clustering to identify premise-hypothesis pairs in the SNLI dataset that are not covered by existing rules and discover four new transformation rules from them. Using these rules with Chain-of-Thought (CoT) prompting and Large Language Models (LLMs), we generate high-quality training data and augment the SNLI dataset. Our method yields consistent performance improvements across dataset sizes, achieving +0.85%p accuracy on 2k samples and +0.15%p on 550k samples. Furthermore, a distribution-aware augmentation strategy enhances performance across all scales. Beyond manual explanations, we extend our framework to automatically-generated explanations (CoT-Ex), demonstrating that they provide a scalable alternative to human-written explanations and enable reliable rule discovery.