About LLaVA-Med

LLaVA-Med was initialized with the general-domain LLaVA and then continuously trained in a curriculum learning fashion (first biomedical concept alignment then full-blown instruction-tuning). We evaluated LLaVA-Med on standard visual conversation and question answering tasks.

Code LicenseData LicenseUsage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

LLaVA-Med Dataset


The data statistics of biomedical multimodal instruction-following data: (a,b) The root verb-noun pairs of instruction and responses, where the inner circle of the plot represents the root verb of the output response, and the outer circle represents the direct nouns. © The distribution of images and QA pairs on the five domains, one image is shown per domain.

LLaVA-Med Performance


Performance comparison of mulitmodal chat instruction-following abilities, measured by the relative score via language GPT-4 evaluation.


Example 1: comparison of medical visual chat. The language-only GPT-4 is considered as the performance upper bound, as the golden captions and inline mentions are fed into GPT-4 as the context, without requiring the model to understand the raw image.


Example 2: comparison of medical visual chat. LLaVA tends to halluciate or refuse to provide domain-specific knowledgable response.


Performance comparison of fine-tuned LLaVA-Med on established Medical QVA datasets.

Visit Official Website


Community Posts
Hayo News
Recently, Microsoft researchers demonstrated a model called LLaVA-Med, which is mainly used in biomedical research. With the help of this model, the pathological condition of the patient can be inferred from CT, X-ray pictures, etc.
In order to train this AI model, Microsoft researchers cooperated with a number of hospitals to obtain large-scale data sets corresponding to biomedical image text, including chest X-ray, MRI, histology, pathology and CT images, etc., covering a relatively comprehensive range.
It is reported that the model finally has "excellent multimodal dialogue ability" and "LLaVA-Med is ahead of other advanced models in the industry in some indicators on three standard biomedical data sets used to answer visual questions". (edited)