EMNLP2024

UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation

Xiangyu Zhao, Yuehan Zhang, Wenlong Zhang, Xiao-Ming Wu

7 citations

Abstract

The fashion domain includes a range of realworld multimodal tasks, such as multimodal retrieval and generation. Recent advancements in AI-generated content, particularly large language models for text and diffusion models for visuals, have spurred significant research interest in applying these multimodal models to fashion. However, fashion models must also effectively handle embedding tasks, like imageto-text and text-to-image retrieval. Moreover, current unified fashion models often lack the capability for image generation. In this work, we present UniFashion, a unified framework that tackles the challenges of multimodal generation and retrieval tasks in the fashion domain, by integrating image and text generation with retrieval tasks. UniFashion unifies embedding and generative processes through the use of a diffusion model and LLM, enabling controllable and high-fidelity generation. Our model significantly outperforms previous state-of-the-art models focused on single tasks across various fashion-related challenges and can be easily adapted to manage complex vision-language tasks. This study highlights the synergistic potential between multimodal generation and retrieval, offering a promising avenue for future research in the fashion domain. The source code is available at https: //github.com/xiangyu-mm/UniFashion .