About VisualGLM-6B

VisualGLM-6B is an open-source, multi-modal dialog language model that supports images, Chinese, and English . The language model is based on ChatGLM-6B with 6.2 billion parameters; the image part builds a bridge between the visual model and the language model through the training of BLIP2-Qformer , with the total model comprising 7.8 billion parameters. Click here for English version. English multi-modal dialogue language model, the language model is based on ChatGLM-6B , with 6.2 billion parameters; the image part builds a bridge between the visual model and the language model by training BLIP2-Qformer , and the overall model has a total of 7.8 billion parameters.

VisualGLM-6B relies on 30M high-quality Chinese image-text pairs from the CogView dataset, and pre-trains with 300M screened English image-text pairs, with the same weight in Chinese and English. This training method better aligns visual information to the semantic space of ChatGLM; in the subsequent fine-tuning stage, the model is trained on long visual question and answer data to generate answers that meet human preferences.

VisualGLM-6B is trained by the SwissArmyTransformer ( sat for short) library, which is a tool library that supports flexible modification and training of Transformer, and supports efficient fine-tuning methods such as Lora and P-tuning. This project provides a huggingface interface in line with user habits, and also provides a sat-based interface.

Combined with model quantization technology, users can deploy locally on consumer-grade graphics cards (minimum 8.7G video memory is only required at the INT4 quantization level).


The VisualGLM-6B open source model aims to promote the development of large model technology with the open source community. Developers and everyone are urged to abide by the open source agreement and not open source Models and codes and derivatives based on this open source project are used for any purposes that may cause harm to the country and society, and for any services that have not undergone security assessment and filing. At present, this project has not officially developed any applications based on VisualGLM-6B, including websites, Android Apps, Apple iOS applications, and Windows Apps.

Because VisualGLM-6B is still in the v1 version, it is currently known to have quite a few limitations , such as image description factuality/model illusion problems, insufficient capture of image detail information, and some information from language models limitation. Although the model tries its best to ensure the compliance and accuracy of the data at all stages of training, due to the small size of the VisualGLM-6B model and the model is affected by probabilistic randomness factors, the accuracy of the output content cannot be guaranteed, and the model is easy to be Misleading (see Limitations section for details). In the later versions of VisualGLM, efforts will be made to optimize such problems. This project does not assume the risks and responsibilities of data security and public opinion risks caused by open source models and codes, or any risks and responsibilities arising from misleading, misusing, spreading, and improper use of any models.


VisualGLM-6B can conduct questions and answers about the relevant knowledge of image description. Titanic sample

can also combine common sense or put forward interesting points of view, click to expand/collapse more samples

Friendly Links

  • XrayGLM is an X-ray diagnostic question-and-answer project fine-tuned on the X-ray diagnostic data set based on visualGLM-6B, which can answer medical-related inquiries based on X-ray films.

Click to view the sample

  • StarGLM is a fine-tuning project based on Chat/visualGLM-6B on the astronomical dataset, which can answer information related to the light curve of variable stars.

Click to see the sample


Model reasoning

Use pip to ` pip install -i https://pypi.org/simple -r requirements.txt


pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements.txt

dependencies pip install -i https://pypi.org/simple -r requirements.txt


pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements.txt


At this time, deepspeed library (supporting sat library training) will be installed by default. This library is not necessary for model reasoning, and some Windows environments will encounter problems when installing this library. If we want to bypass deepspeed installation, we can change the command to

` pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements_wo_ds.txt pip install -i https://mirrors.aliyun.com/pypi/simple/ --no-deps "SwissArmyTransformer>=0.4.4"

` Use the Huggingface transformers library to call the model ( you also need to install the above dependency package! ), you can pass the following code (where the image path is a local path): .from_pretrained("THUDM/visualglm-6b", trust_remote_code=True) model = AutoModel.from_pretrained("THUDM/visualglm-6b", trust_remote_code=True).half().cuda() image_path = "your image path" response, history = model.chat(tokenizer, image_path, "Describe this image.", history=[]) print(response) response, history = model.chat(tokenizer, image_path, "Where is this picture probably taken?", history=history) print(response)

The above code will be automatically downloaded by transformers Model implementation and parameters. The full model implementation can be found at Hugging Face Hub . If you are slow to download model parameters from Hugging Face Hub, you can manually download the model parameter file from here and load the model locally. For details, see Loading Models from Local . For quantization based on the transformers library model, CPU inference, Mac MPS backend acceleration, etc., please refer to the low-cost deployment of ChatGLM-6B .

If you use the SwissArmyTransformer library to call the model, the method is similar, you can use the environment variable SAT_HOME to determine the download location of the model. In this warehouse directory: 🧥🧥🧥importargparse fromtransformersimportAutoTokenizer tokenizer=AutoTokenizer.from_pretrained("THUDM/chatglm-6b",trust_remote_code=True) frommodelimportchat,VisualGLMModel model,model_args=VisualGLMModel.from_pretrained('visualglm-6b',args=argparse.Namespace(fp16=True,skip_init=True)) fromsat.model.mixinsimportCachedAutoregressiveMixin model.add_mixin('auto-regressive',CachedAutoregressiveMixin()) image_path=“yourimagepathorURL” response,history,cache_image=chat(image_path,model,tokenizer,"Describethispicture.",history=[]) print(response) response,history,cache_image=chat(None,model,tokenizer,"Whereisthispictureprobablytaken?”,history=history,image=cache_image) print(response)

Using the sat library can also easily perform efficient fine-tuning of parameters.

Model fine-tuning

Multi-modal tasks are widely distributed and there are many types, and pre-training often cannot cover everything. Here we provide an example of small-sample fine-tuning, using 20 labeled images to enhance the model's ability to answer "background" questions. ` bash finetune/finetune_visualglm.sh

Unzipfewshot-data.zip` and run the following command: 🏻🧥

Currently supports three ways of fine-tuning:

-LoRA: In the sample, LoRA fine-tuning with rank=10 is added to the 0th and 14th layers of the ChatGLM model, which can be adjusted according to the specific situation and Data volume adjustment --layer_range and --lora_rank parameters. - QLoRA: If resources are limited, you can consider using bash finetune/finetune_visualglm_qlora.sh . QLoRA quantizes the linear layer of ChatGLM in 4-bit, and only needs 9.8GB of video memory to fine-tune it. - P-tuning: You can replace --use_lora with --use_ptuning , but it is not recommended unless the model application scenario is very fixed.

After training, you can use the following command to infer:

` python cli_demo.py --from_pretrained your_checkpoint_path --prompt_zh 这张图片的背景里有什么内容?


Comparison of effects before and after fine-tuning

If you want to merge the parameters of the LoRA part into the original weight, you can call merge_lora() , for example:

from finetune_visualglm import FineTuneVisualGLMModel importargparse

model, args = FineTuneVisualGLMModel.from_pretrained('ch eckpoints/finetune-visualglm-6b-05-19-07-36 ', args=argparse.Namespace( fp16=True, skip_init=True, use_gpu_initialization=True, device='cuda ', )) model.get_mixin('lora').merge_lora() args.layer_range = [] args.save = 'merge_lora' args.mode = 'inference' from sat.training.model_io import save_checkpoint save_checkpoint(1, model, None, None, args)

fine-tuning needs to install deepspeed library, currently this process only supports linux system, More sample descriptions and process descriptions for Windows systems will be completed in the near future.

Deployment tools

Command line Demo

python cli_demo.py

The program will automatically download the sat model and have an interactive dialogue on the command line. Enter instructions and press Enter to generate a reply , enter clear to clear the conversation history, enter stop to terminate the program.

cli_demo The program provides the following hyperparameters to control the generation process and quantization accuracy:

` usage: cli_demo.py [-h] [--max_length MAX_LENGTH] [--top_p TOP_P] [--top_k TOP_K] [--temperature TEMPERATURE] [--english] [--quant {8,4}]

optional arguments: -h, --help show this help message and exit --max_length MAX_LENGTH max length of the total sequence --top_p TOP_P top p for nucleus sampling --top_k TOP_K top k for top k sampling --temperature TEMPERATURE temperature for sampling --english only output English --quant {8,4} quantization bits

👐🏻🧥 need It should be noted that during the training, the prompt word for the English question and answer isQ: A:, while the Chinese is Q: A问:答:, and the Chinese prompt is used in the web demo, so the English reply will be worse and mixed with Chinese; if English is required Reply, please use the--englishoption incli_demo.py` .

We also provide a typewriter effect command line tool inherited from ChatGLM-6B , which uses the Huggingface model:

python cli_demo_hf.py We also support model parallel multi-card deployment: (you need to update the latest version of sat, if you downloaded the checkpoint before, ` torchrun --nnode 1 --nproc-per-node 2 cli_demo_mp.py

also need to manually delete it and then download it again) torchrun --nnode 1 --nproc-per-node 2 cli_demo_mp.py


Web version Demo


We provide a Gradio -based web version Demo, first install Gradio: pip install gradio . Then download and enter this warehouse to run web_demo.py :

` git clone https://github.com/THUDM/VisualGLM-6B cd VisualGLM-6B python web_demo.py

` program will automatically download the sat model, run a Web Server, and output the address. Open the output address in a browser to use it.

We also provide a typewriter effect web version tool inherited from ChatGLM-6B . This tool uses the Huggingface model and will run on port :8080 after startup:

python web_demo_hf .py

Both web version demos accept the command line parameter --share to generate gradio public links, accept --quant 4 and --quant 8 to use 4-bit quantization/8-bit quantization reduction respectively Memory usage.

API deployment

First you need to install additional dependencies pip install fastapi uvicorn , and then run the api.py in the warehouse:

python api.py The program will automatically download the sat model, which is deployed on the local port 8080 by default, and is invoked through the POST method. The following is an example of requesting with curl . In general, you can also use the code method to perform POST.

echo "{\"image\":\"$(base64 path/to/example.jpg)\",\"text\":\"Describe this image\",\" history\”:[]}” > temp.json curl -X POST -H “Content-Type: application/json” -d @temp.json [

The return value obtained is

"` ](


``` ) { "response": "This picture shows a cute cartoon alpaca standing on a on a transparent background. The alpaca has fluffy ears, large eyes and a white body with brown spots. ", "history": [('Describe this image', 'This image shows a cute cartoon alpaca standing on a transparent background. This alpaca has a With furry ears and big eyes, its body is white with brown spots.')], "status":200, "time":"2023-05-16 20 :20:10 " }

" `


python api_hf.py


在Huggingface实现中,模型默认以FP16 精度加载,运行上述代码需要大概15GB 显存。如果你的GPU 显存有限,可以尝试以量化方式加载模型。 使用方法如下:

按需修改,目前只支持4/8 bit 量化。下面将只量化ChatGLM,ViT 量化时误差较大

model = AutoModel.from_pretrained("THUDM/visualglm-6b", trust_remote_code=True).quantize(8).half().cuda()

在sat实现中,需先传参将加载位置改为cpu,再进行量化。方法如下,详见cli\_demo.py\ for details:

from sat.quantization.kernels import quantize model = quantize(model.transformer, args.quant).cuda() #Specify model.transformer to only quantize ChatGLM, and the error of ViT quantization is large


This project is in the V1 version. The parameters and calculation of the visual and language models are relatively small. We have summarized the main improvement directions as follows:

  • image description Factuality/model illusion issues. When generating a long image description, when the image is far away, the language model will dominate, and it is possible to generate content that does not exist in the image based on the context.
  • Attribute mismatch problem. In multi-object scenes, certain attributes of some objects are often wrongly placed on other objects.
  • Resolution issue. This project uses a resolution of 224*224, which is also the most commonly used size in visual models; however, for finer-grained understanding, larger resolutions and calculations are necessary.
  • Due to data and other reasons, the model does not have the ability of Chinese OCR for the time being (there is some English OCR ability), and we will add this ability in subsequent versions.

Visit Official Website


Community Posts
no data
Nothing to display