ACL2024

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, Trevor Darrell

摘要

Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalities can result in "hallucination", generating textual outputs that are not grounded by the multimodal information in context. To address the multimodal misalignment issue, we adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment, where human annotators are asked to compare two responses and pinpoint the more hallucinated one, and the vision-language model is trained to maximize the simulated human rewards. We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information such as image captions and ground-truth multi-choice options, which alleviates the reward hacking phenomenon in RLHF and further improves the performance. We also enhance the GPT-4-generated training data (for vision instruction tuning) with previously available human-written image-text pairs to improve the general capabilities of our model. To evaluate the proposed approach in real-world scenarios, we develop a new evaluation benchmark MMHAL-BENCH with a special focus on penalizing hallucinations. As the first LMM trained with RLHF, our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 96% performance level of the text-only GPT-4 (while previous best methods can only achieve the 87% level), and an improvement by 60% on MMHAL-BENCH over other baselines. We opensource our code, model, data at https: //llava-rlhf.github.io . Question: Where is this photo taken? Ground Truth: From the slogan "Welcome to Houston, Texas", we can infer that this photo is taken at an airport in Houston, Texas. LLaVA: This photo is taken at an airport. LLaVA-SFT + : This photo is taken at the Houston airport. LLaVA-RLHF: This photo is taken in the baggage claim area of an airport, specifically in the lobby of the George Bush Intercontinental Airport in Houston, Texas. The large sign in the background indicates the airport's name and location. The presence of multiple people, handbags, and suitcases in the scene further supports the fact that this is indeed an airport setting. Question: What is the gender of the person riding the motorcycle in the image? Ground Truth: No one is riding the motorcycle in the image. This image shows the menu of a coffee chop called Roly's Café. Human LMM-SFT Captions LMM-RLHF PPO Data [The sign is not very clear, so perhaps] A: American Fast Food Sampled Output [The RL model's output is clearly contradictory to the image captions] Reward Score: 0.0 Javier's Tacos -Mexican Fast Food -Open 24 hours Captions RM Data A: The sleeping environment on the couch provides the cat with a comfortable and cozy space to rest. Sampled Output (A) A: The cat is resting on a black couch with its front paws tucked under its chest.