WWW2026

TargetMR: Learning Modality Target for Multimodal Recommendation

Gu Tang, Jinghe Wang, Jiang Bo, Ze Zhao, Jianping Zhou, Xiaoying Gan, Luoyi Fu, Xinbing Wang, Chenghu Zhou

摘要

Rapid development of web services has led to an explosion of multimodal content, making multimodal recommender systems (MRSs) vital tools for mitigating information overload. Current MRSs have achieved remarkable progress by incorporating advanced technologies such as Graph Neural Networks (GNNs) and Large Language Models (LLMs). However, these studies still suffer from the semantic shift problem. Generally, item's multimodal content usually contain multiple objects, including target object (core content of item) and auxiliary objects (decorations of item). Existing MRSs overlooked this distinction, failing to prevent auxiliary objects from dominating the representation, leading to biased item representation. To address this issue, we propose a model-agnostic framework ''TargetMR''. Concretely, TargetMR comprises two core modules, including Object Disentangler and Object Identifier. The Object Disentangler decouples item text and image into multiple objects via text syntactic parsing and image segmentation. The Object Identifier performs knowledge distillation based on LLMs to efficiently identify the target text object. It then identifies the target image object through cross-modal semantic evaluation. Moreover, this module refines the representation of image target object by optimizing the semantic correlation. Owing to the model-agnostic design of TargetMR, it can be integrated into various backbone MRSs. Extensive experiments on three benchmark datasets show that TargetMR consistently improves the performance of five backbone MRSs, with an average improvement of 12.26%. Our codes are available at https://github.com/gutang-97/TargetMR/.