The Chinese version of the open source Llama 2 has language and multi-modal large models at the same time, which is completely commercially available
On July 19th, Meta finally released the free commercial version Llama 2, which brought about a huge change in the open source large model field.
The Llama 2 model family contains 7 billion, 13 billion, and 70 billion parameter variants, a 40% increase in training data compared to the previous generation, and demonstrates It has excellent performance and supports multiple languages.
The fly in the ointment is that the Llama 2 corpus is still dominated by English (89.7%), while Chinese only accounts for 0.13% of it. This makes it difficult for Llama 2 to complete smooth and in-depth Chinese conversations.
The Chinese version of Llama2's open-source large model is the "first" in the community
The good news is that the day after Meta Al open sourced the Llama 2 model, the first open source Chinese LLaMA2 model that can be downloaded and run appeared in the open source community. The model is called "Chinese Llama 2 7B" and was launched by domestic AI startup LinkSoul.Al .
In just two weeks, the project has received over 10,000 downloads on Hugging Face and 1,200 Stars on GitHub.
According to the project introduction, the open source content of Chinese-Llama-2-7b includes the fully commercially available Chinese version of the Llama2 model and the Chinese and English SFT data sets. -Optimization of the chat model.
Project address: https://github.com/LinkSoul-AI/Chinese-Llama-2-7b
Currently, ordinary users can experience "Chinese Llama-2 7B Chat" online.
Trial address: https://huggingface.co/spaces/LinkSoul/Chinese-Llama-2-7b
For example, you can ask questions in English and have it answer in Chinese:
Or direct Chinese dialogue, it can also achieve accurate and fluent answers in Chinese:
The main feature is a flexible switch between Chinese and English:
Some people have already started and said it works well:
In addition to the language model, continue to open source two Chinese multimodal large models
After launching the first open-source Llama2 Chinese language model, the LinkSoul.AI team set its sights on the global speech-text multi-modal model and graphic-text model that are still in the early stages of development, and took the lead in open-sourcing related models again. Provide free download and commercial use for domestic developers.
The two Chinese multimodal large models that are open sourced this time include the following:
- Led by the LinkSoul.Al team, Beijing Zhiyuan Artificial Intelligence Research Institute, Peking University, Zero One Wanwu and other domestic top artificial intelligence teams work together to support the first multi-modal open-source dialogue that supports Chinese-English bilingual, voice-to-text Model (LLaSM)
- The first multimodal model based on Llama 2 that supports Chinese-English bilingual vision-to-text (Chinese-LLaVA)
Both models are open source based on the Apache-2.0 protocol and are completely commercially available.
Shi Yemin, head of the LinkSoul.Al development team, said, "Looking at the world, there is still no reliable open source model available for how to make the model listen to the world and see the world. We hope to do our best to make China's large-scale model ecosystem far from the international leading standard." Get closer."
Speech-to-Text Multimodal Open Source Dialogue Model (LLaSM)
LinkSoul.AI has open sourced the commercially available Chinese-English bilingual speech-language assistant LLaSM and the Chinese-English speech SFT dataset LLaSM-Audio-Instructions. LLaSM is the first open-source and commercially available dialogue model that supports Chinese-English speech-text multimodal dialogue.
Compared with the traditional solutions in the past, LLaSM can greatly improve the experience of using large models with text as input through the convenient voice input interaction method, and at the same time effectively avoid the cumbersome process and possible errors introduced by ASR-based solutions.
- Project address: https://github.com/LinkSoul-AI/LLaSM
- Dataset: https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions
Below is an example of a speech-to-text conversation with LLaSM.
LLaSM also has a corresponding literature introduction.
Model, code and data address: https://huggingface.co/spaces/LinkSoul/LLaSM
Image to Text Multimodal Open Source Dialogue Model (Chinese LLaVA)
LinkSoul.AI has open sourced the commercially available Chinese-English bilingual vision-language assistant Chinese-LLaVA and the Chinese-English visual SFT data set Chinese-LLaVA-Vision-Instructions, an open-source commercial dialogue model that supports Chinese-English visual-text multimodal dialogue.
- Project address: https://github.com/LinkSoul-AI/Chinese-LLaVA
- Dataset: https://huggingface.co/datasets/LinkSoul/Chinese-LLaVA-Vision-Instructions
Below is an example of a visual-text dialogue in Chinese LLaVA.
Model, code and data address: https://huggingface.co/spaces/LinkSoul/Chinese-LLaVa
Interpretation of multimodal model unified architecture
The large language model has demonstrated powerful capabilities in many aspects, and to a certain extent, it has given people the hope of realizing general artificial intelligence (AGI). The multimodal model provides a channel for information interaction between different modalities, so that visual information, voice information, etc. can complement each other with text semantic information, so that the large language model can hear and see the world, thus advancing to GI further.
Therefore, the focus of training a multimodal model is how to fuse information between different complementary modalities and make full use of the existing large language model capabilities. LinkSoul.AI's open source speech-language multimodal model and vision-language multimodal model adopt the framework shown in the figure below .
Firstly, the data features of different modalities are encoded by the modal encoder, and then the modal adapter is learned in the pre-training stage of multi-modal feature alignment to align the input features of different modalities with the large language model.
The modality adapter and large language model are then fine-tuned using instruction datasets from different modalities in an end-to-end supervised fine-tuning (SFT) stage. In the supervised fine-tuning stage, both cross-modal instruction data and text-only instruction data are used for multi-task training. The LinkSoul.AI team believes that multi-task training can help avoid model dependence and bias, and can naturally use one model to achieve multiple modalities.
The next work of the LinkSoul.AI team will further integrate speech-visual-text so that the large language model supports both speech and visual modalities.
In the pre-training phase, both the modal encoder and the large language model parameters are frozen, and the Adaptor is trained using cross-modal speech/visual-text pairs. The optimization goal is to generate corresponding responses to the input instructions.
Specifically, for speech modality, adopt Whisper as feature encoder, freeze Whisper  and extract features of audio input. Use the publicly available Chinese and English Automatic Speech Recognition (ASR) datasets Aishell , LibriSpeech , Magicdata  and Primewords .
For each data sample (audio, text_label) according to the corresponding language, randomly select an instruction from the pre-training voice instruction table (see the data part of the third section) to form data in the format of (audio, instruct, text_label), and in the training process Predict text_label.
For the visual modality, CLIP  is used as the image feature extractor, and mBART  is used to translate the open source visual pre-training data of LLaVA  into Chinese to generate Chinese image-text pairs. In the pre-training stage, both Chinese and English data are used for training, so that the model can better support Chinese.
In the pre-training stage, the features of different modalities are aligned with the large language model. In the supervised fine-tuning stage, only the weights of the modal encoder are frozen, the modality adapter and large language model parameters are turned on, and fine-tuning is performed using cross-modal instruction data.
Aiming at the problem that there is almost no public voice multimodal instruction data, the speech-text multimodal instruction data set LLaSM-Audio is constructed based on the public data sets WizardLM , ShareGPT , GPT-4-LLM  -Instructions. Take voice input as an instruction and predict the corresponding text output.
For the visual modality, the LLaVA  open-source visual command dataset is also translated into Chinese through mBART  to generate a Chinese visual command dataset, and then similarly trained.
Modal Transformation Pre-training Dataset
Let's look at Audio first. The voice multimodal pre-training data set uses the public Chinese and English Automatic Speech Recognition (ASR) data sets Aishell , LibriSpeech , Magicdata  and Primewords .
At the same time, the following instruction set is constructed, and for each (audio, text_label) sample, an instruction is randomly selected according to the corresponding language to construct a data sample (instruction, audio, text_label).
Table 1: English Simple Instruction Set
Table 2: Chinese Simple Instruction Set
Then there is Vision. For the visual modality, the open source visual pre-training data of LLaVA  is used, translated into Chinese through mBART , and Chinese image-text pairs are generated to improve the Chinese ability of the model.
instruction fine-tuning dataset
Also look at Audio first. In the process of building the audio dataset, all dialogue data is first carefully filtered by removing those that are not suitable for sounding, including codes, symbols, URLs and other unreadable text. Then, to ensure data quality, the chatbot's answers in each round of conversation are filtered again, and those that do not contain valuable information are discarded. Finally, speech data is generated using the Microsoft Azure  Speech Synthesis API.
Then there is Vision. For the visual modality, the LLaVA  open-source visual instruction dataset is used, and mBART  is used for Chineseization to generate Chinese multimodal instruction data, so that the model can have the ability to execute Chinese visual instructions.
In order to facilitate the open source community to quickly experience the capabilities of multi-modal large models and jointly promote the research progress of multi-modal large models, the data used for training is open sourced in the project and provided for download in the Hugging Face warehouse.
For the LinkSoul.AI team, these two open-source and commercially available multimodal large models not only bring voice and visual multimodal capabilities to the large model ecosystem, but also contribute to the multilingual aspect of large models.
In addition, in the commercial scenario, the models launched by the team are completely free for commercial use, which is also of extraordinary value for domestic individual developers and start-up companies.
 Aishell: https://www.openslr.org/33/
 LibriSpeech: https://huggingface.co/datasets/librispeech_asr
 Magicdata: https://openslr.org/68/
 Primewords: https://openslr.org/47/
 Whisper: https://huggingface.co/openai/whisper-large-v2
 CLIP: https://huggingface.co/openai/clip-vit-large-patch14
 LLaVA: https://llava-vl.github.io/
 mBART: https://arxiv.org/pdf/2001.08210.pdf, https://huggingface.co/facebook/mbart-large-50-one-to-many-mmt
 WizardLM: https://github.com/nlpxucan/WizardLM
 ShareGPT: https://sharegpt.com/
 GPT-4-LLM: https://arxiv.org/abs/2304.03277
 Microsoft Azure Speech Synthesis API: https://azure.microsoft.com/en-us/products/ai-services/ai-speech