VLFeedback and Silkie VLFeedback and Silkie

Preference Distillation for Large Visual Language Models

The University of Hong Kong
The Chinese University of Hong Kong (Shenzhen)
Peking University
  *Equal Contribution

VLFeedback is the first open-sourced GPT-4V annotated vision-language preference dataset, covering 80k instructions sampled from various sources with responses decoded from 12 large language vision models such as GPT-4V, LLaVA-series and Qwen-VL.
Based on Qwen-VL-Chat, we present Silkie, by performing DPO on our VLFeedback. Compared with the original model, Silkile achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities, respectively. Besides, Silkie sets a new state-of-the-art score of 3.02 on MMHal-Bench regarding hallucination evaluation.


This paper explores preference distillation for large vision language models (LVLMs), improving their ability to generate helpful and faithful responses anchoring the visual context. We first build a vision-language feedback (VLFeedback) dataset utilizing AI annotation. Specifically, responses are generated by models sampled from 12 LVLMs, conditioned on multi-modal instructions sourced from various datasets. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. Furthermore, the preference supervision is distilled into Qwen-VL-Chat through the direct preference optimization (DPO) method. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities, respectively. Silkie also demonstrates reduced hallucination by setting a new state-of-the-art score of 3.02 on the MMHal-Bench benchmark. Further analysis shows that DPO with our VLFeedback dataset mainly boosts the fine-grained perception and complex cognition abilities of LVLMs, and brings more comprehensive improvements compared to human-annotated preference datasets.

Multimodal Instructions and AI Preference Data

We sample multi-modal instructions from various souces, covering different capabilities of LVLMs. We further build a model pool consisting of 12 LVLMs. .

We further use GPT-4V as the annoator to assess the quality of each response regarding helpfulessn, visual faithfulness, and ethical considerations.

Silkie: A Better Aligned LVLM

We improve Qwen-VL-Chat by performing DPO on our VLFeedback, using the efficient LoRA tuning method. After DPO training, the resulting model Silkie achieves promising results compared with other models with similar-sized LLMs as the backbone.

(Left) In-depth analysis on the MME benchmark for the performance improvements. Our VLFeedback dataset brings clearer gains in OCR recognition and code reasoning tasks.
(Right) Relative performance improvement by performing DPO with RLHF-V preference data and a subset of our VLFeedback dataset. Our GPT-4V annotated preference dataset brings more consistent improvements on four benchmarks.

Comparison Examples

Our Silkie locates the wooden stools with a red flower without giving misleading assertions (Left), and correctly answers the scientific-related question (Right), exhibiting better perception and cognition capabilities.

On a challenging query asking the model to generate a report for the diagram of weather forecast process, Silkie generates a well-structured report satisfying the word requirement.


    author      = {Lei Li and Zhihui Xie and Mukai Li and Shunian Chen and Peiyi Wang and Liang Chen and  Yazheng Yang and  Benyou Wang and  Lingpeng Kong},
    title       = {Silkie: Preference Distillation for Large Visual Language Models},
    publisher   = {arXiv:2312.10665},
    year        = {2023}


This website is adapted from Nerfies and LLaVA-RLHF, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the authors of the multi-modal instruction tuning datasets and open-source projects, including LLaVA, LLaVA-RLHF and Qwen-VL. We would thank Runxin Xu for his great help on the project.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of Qwen-VL and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Related Links: [LLaVA] [LLaVA-RLHF] [Qwen-VL]