ACL2025

Large Language Models Struggle to Describe the Haystack without Human Help: A Social Science-Inspired Evaluation of Topic Models

Zongxia Li, Lorena Calvo-Bartolomé, Alexander Miserlis Hoyle, Paiheng Xu, Daniel Kofi Stephens, Juan Francisco Fung, Alden Dima, Jordan Lee Boyd-Graber

Abstract

A common use of NLP by social scientists is to understand large document collections. Recent data exploration and content analysis have shifted from probabilistic topic models to Large Language Models (LLMs). Yet their effectiveness in helping users understand content in real-world applications remains under explored. This study compares the knowledge users gain from unsupervised LLMs, supervised LLMs, and traditional topic models across two datasets. While unsupervised LLMs generate more human-readable topics, their topics are overly generic for domain-specific datasets and do not help users learn much about the documents. Adding human supervision to LLM generation improves data exploration by mitigating hallucination and over-genericity but requires greater human effort. Traditional topic models, such as Latent Dirichlet Allocation (LDA), remain effective for exploration but are less user-friendly. LLMs struggle to describe the haystack of large corpora without human help, particularly domain-specific data, and face scaling and hallucination limitations due to context length constraints.