This paper explores preference distillation for large vision language models (LVLMs), improving their ability to generate helpful and faithful responses anchoring the visual context. We first build a vision-language feedback (VLFeedback) dataset utilizing AI annotation. Specifically, responses are generated by models sampled from 12 LVLMs, conditioned on multi-modal instructions sourced from various datasets. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. Furthermore, the preference supervision is distilled into Qwen-VL-Chat through the direct preference optimization (DPO) method. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities, respectively. Silkie also demonstrates reduced hallucination by setting a new state-of-the-art score of 3.02 on the MMHal-Bench benchmark. Further analysis shows that DPO with our VLFeedback dataset mainly boosts the fine-grained perception and complex cognition abilities of LVLMs, and brings more comprehensive improvements compared to human-annotated preference datasets.
We sample multi-modal instructions from various souces, covering different capabilities of LVLMs. We further build a model pool consisting of 12 LVLMs. .
We further use GPT-4V as the annoator to assess the quality of each response regarding helpfulessn, visual faithfulness, and ethical considerations.
We improve Qwen-VL-Chat by performing DPO on our VLFeedback, using the efficient LoRA tuning method. After DPO training, the resulting model Silkie achieves promising results compared with other models with similar-sized LLMs as the backbone.
@article{2023vlfeedback,
author = {Lei Li and Zhihui Xie and Mukai Li and Shunian Chen and Peiyi Wang and Liang Chen and Yazheng Yang and Benyou Wang and Lingpeng Kong},
title = {Silkie: Preference Distillation for Large Visual Language Models},
publisher = {arXiv:2312.10665},
year = {2023}
}
This website is adapted from Nerfies and LLaVA-RLHF, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the authors of the multi-modal instruction tuning datasets and open-source projects, including LLaVA, LLaVA-RLHF and Qwen-VL. We would thank Runxin Xu for his great help on the project.
Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of Qwen-VL and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
Related Links: [LLaVA] [LLaVA-RLHF] [Qwen-VL]