CVPR2025

RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models

Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, Xiangyu Yue

摘要

concept editing via updating the external database. To further improve generation quality and alignment with userspecific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at https://hoar012.github.io/RAP-Project/ . Number of Image Data Requirements for Personalization Support Method Positive Negative Caption Description Question-Answer Recognition Real-time edit Text-only QA Fine-tuning n -Yes Yes No No ✗ ✓ MyVLM [2] n 150 Yes No Yes Yes ✗ ✗ Yo'LLaVA [32] n 200 No No Yes Yes ✗ ✓ RAP(Ours) 1 -No Yes No No ✓ ✓ through vision-language alignment brings powerful multimodal LLMs (MLLMs) [12, 15, 29, 33, 45, 51, 56] . MLLMs have shown significant improvement in various tasks, such as image description and question answering, highlighting their potential as humans' assistants. However, their lack of user-specific knowledge continues to limit their effectiveness as personalized assistants in daily life.