WWW2026

CIDC: Cluster Identification-Guided Dual Correction for Robust Short Text Clustering

Yuhua Zhao, Zhixin Han, Xuan Li, Peiyu Xu, Hang Gao, Mengting Hu, Tiegang Gao

Abstract

The rapid growth of online short texts has made specialized analysis essential, as these texts are sparse and information-limited. Short text clustering (STC) is critical for automatically grouping unlabeled texts into meaningful clusters, supporting applications such as sentiment analysis, spam filtering, and social media personalization. In the context of massive online content, deep clustering seeks to uncover semantic categories by measuring distances in the representation space. Consequently, aligning clustering pseudo-labels with the true category distribution is crucial for effective self-supervised training, particularly under class imbalance and distribution skew commonly observed in web data. To address this challenge, we propose the Cluster Identification-Guided Dual Correction (CIDC) framework, which generates reliable pseudo-labels to guide deep clustering. Specifically, given cluster partitions and model-estimated class distributions, we perform Cluster Category Identification (CCI) at each training epoch to determine the most probable category for each cluster. This identification provides the foundation for the Pseudo-Label Correction (PLC) and Prototype-Based Correction (PBC) modules, which jointly enhance pseudo-label reliability and representation learning. In the PLC module, samples whose model-estimated class distributions conflict with the assigned cluster category are corrected, thereby improving semantic alignment within clusters. In the PBC module, representative and reliable prototypes are selected according to cluster categories and model predictions to guide training, further strengthening representation discriminability. Extensive experiments demonstrate that CIDC consistently outperforms existing methods in terms of clustering accuracy and mutual information, particularly in unsupervised settings characterized by class imbalance and noisy data.