WWW2026

Multimodal Topic Discovery in Web Media via von Mises-Fisher Mixture Neural Topic Models

Dayu Guo, Zhiwen Luo, Nizar Bouguila, Wentao Fan

摘要

Topic modeling plays a critical role in organizing and understanding large-scale web content. While neural topic models (NTMs) based on variational autoencoders (VAEs) have achieved notable success in analyzing textual data, they remain limited in addressing the multimodal nature of modern web content. Existing unimodal or multimodal extensions often suffer from posterior collapse and fail to capture the directional semantics inherent in both text and images, resulting in incoherent topics and limited interpretability. To address these challenges, we propose MM-vNTM (MultiModal Neural Topic Model with von Mises-Fisher Mixtures), a framework for web-scale topic discovery over multimodal data. MM-vNTM leverages pre-aligned cross-modal embeddings as inputs and jointly models document-level representations of text and image modalities in a shared hyperspherical latent space. Furthermore, it defines topics as mixtures of von Mises-Fisher (vMF) distributions in the L2-normalized word embedding space, explicitly capturing directional similarity. Experiments on multimedia web datasets demonstrate that MM-vNTM consistently outperforms state-of-the-art unimodal and multimodal baselines in terms of overall topic quality, highlighting its effectiveness for real-world web scenarios.