WWW2026

Distribution-Aligned Synthetic Text Generation via Tail-Aware Enhancement

Yuan Fan, Xiaoyuan Liu, Bo Liu, Wubing Wang, Jia Sun, Wenzhi Chen, Huaikang Fang, Lifeng Tao, Fan Mo

Abstract

Recent advances in generative AI have popularized synthetic content for training, offering a practical alternative to costly data curation while addressing privacy concerns. However, accumulating evidence shows that the indiscriminate reuse of synthetic data can induce model collapse—a degenerative process that contracts the learned distribution and erodes rare features. For instance, when models are iteratively trained on their own synthetic outputs, the upper tail of the perplexity distribution substantially compresses, with high-percentile values dropping by nearly half—a clear indicator of severe diversity loss. To counter this, we introduce DASGen, a Distribution-Aligned Synthetic Text Generation framework via tail-aware enhancement. Our method first identifies underrepresented regions via embedding-space mining, then steers a frozen, hosted LLM using semantically-structured prompts and a discriminative diversity objective to enrich tail features. This training-free approach enables direct deployment in existing data pipelines. Extensive evaluations on Yelp and ICLR'25 review benchmarks show that DASGen significantly outperforms competitive baselines, achieving tail coverage (98.54% on Yelp; 92.00% on ICLR'25) along with improved downstream accuracy. Overall, DASGen provides a practical path to synthesizing distribution-aligned text by explicitly enhancing tail regions, producing synthetic corpora with enhanced coverage and diversity for more reliable long-tailed applications.