ICLR2026

TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices

Gerrit Quaremba, Elizabeth Black, Denny Vrandecic, Elena Simperl

摘要

Automatically detecting machine-generated text (MGT) is critical to maintaining the knowledge integrity of user-generated content (UGC) platforms such as Wikipedia. Existing detection benchmarks primarily focus on generic text generation tasks (e.g., ``Write an article about machine learning.'').
However, editors frequently employ LLMs for specific writing tasks (e.g., summarisation).
These task-specific MGT instances tend to resemble human-written text more closely due to their constrained task formulation and contextual conditioning. In this work, we show that a range of MGT detectors struggle to identify task-specific MGT reflecting real-world editing on Wikipedia.
We introduce TSM-Bench, a multilingual, multi-generator, and multi-task benchmark for evaluating MGT detectors on common, real-world Wikipedia editing tasks.
Our findings demonstrate that (i) average detection accuracy drops by 10--40% compared to prior benchmarks, and (ii) a generalisation asymmetry exists: fine-tuning on task-specific data enables generalisation to generic data---even across domains---but not vice versa. We demonstrate that models fine-tuned exclusively on generic MGT overfit to superficial artefacts of machine generation.
Our results suggest that, in contrast to prior benchmarks, most detectors remain unreliable for automated detection in real-world contexts such as UGC platforms.
TSM-Bench therefore provides a crucial foundation for developing and evaluating future models.