entry-slick
entry-slick
About LLaVA

LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.

Abstract

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks in the language domain, but the idea is less explored in the multimodal field.

  1. Multimodal Instruct Data.

We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.

  1. LLaVA Model.

We introduce LLaVA (Large Language-and-Vision Assistant), an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.

  1. Performance.

Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.

  1. Open-source.

We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

Visit Official Website

https://llava-vl.github.io/

Reviews
Show more
shadow die twice
LLaVA展示了一種非常有前途的方法,啟發大家復現且超越GPT-4的多模態能力[Like]
Share
Moonwalker
Could you share details on training gpu's (batch sie, steps etc) and time
Share
Community Posts
zkyoplan-icon
LLaVA
LLaVA looks like GPT-4 now.
image
Share