Visual-Chinese-LLaMA-Alpaca (VisualCLA) is a multimodal Chinese large model developed based on the Chinese LLaMA&Alpaca large model project. VisualCLA adds modules such as image coding to the Chinese LLaMA/Alpaca model, so that the LLaMA model can receive visual information. On this basis, Chinese graphics and text are used to perform multimodal pre-training on the data, aligning image and text representations, and endowing it with basic multimodal understanding capabilities; Comprehension, execution and dialogue capabilities of multimodal commands.
This project is still in the development stage. The current release is a test version for preview, and the model effect is still being optimized.
The main content of this project:
Visual-Chinese-LLaMA-Alpaca (VisualCLA) is a Chinese multimodal model that supports image and text input. On the basis of the Chinese Alpaca model, VisualCLA adds an image coding module, so that the Chinese Alpaca model can understand visual information.
VisualCLA consists of three parts: Vision Encoder, Resampler and LLM:
Vision Encoder: adopts ViT structure to encode the input image to obtain the sequence representation of the image. The released VisualCLA model uses CLIP-ViT-L/14 as the structure and initialization weights of the image encoder.
Resampler: Using a 6-layer BERT-like structure, its structure and function are similar to the Perceiver Resampler in Flamingo or the Q-Former in BLIP-2, and the image representation is re-expressed through the trainable query vector Sampling, which reduces the length of the image representation. Then, the graph representation is aligned to the dimension of the LLM by a linear layer. The parameters of this part are trained from scratch.
LLM: Adopt the LLaMA model and initialize it with Chinese-Alpaca-Plus 7B.
The image is encoded by Vision Encoder and mapped to a fixed-length representation by Resampler. Subsequently, the image and text representations are concatenated and fed into the LLM. LLM generates results based on image and text instructions.
training **strategy
Similar** to Chinese-LLaMA-Alpaca, VisualCLA uses LoRA to efficiently fine-tune the model. The trainable parameters include the LoRA parameters of the image encoder, the LoRA parameters of the LLM and all the parameters of the Resampler. Refer to the description in the model structure diagram. The training process is divided into two stages: 🧥 Multi-modal pre-training: Chinese graphics and text are used for data training, and the model generates corresponding text descriptions (captions) based on images. Multi-modal instruction fine-tuning: Based on the model obtained in the previous step, fine-tuning is performed on a multi-modal instruction dataset constructed from a variety of supervised task data. The dataset includes task types such as visual question answering, visual reasoning, open domain question answering, and OCR. At the same time, part of the plain text instruction data is also mixed in to make up for the lack of multimodal data and alleviate the ability to follow forgotten instructions. This stage uses the same instruction template as the Chinese-Alpaca model.
Visit Official Website