The first comprehensive evaluation of Llama-2, a competition of open source models at home and abroad
Entering July 2023, the development of the large language model (LLM) has entered a new stage, and open source has become a hot topic .
- On July 6, Shanghai Artificial Intelligence Laboratory and SenseTime jointly released the Shusheng Puyu open source system ( https://github.com/InternLM ), which not only open sourced the lightweight version of Shusheng Puyu (InternLM- 7B), also took the lead in open-sourcing the full-chain tool system from data, training to evaluation, and provided a completely free commercial license;
- On July 14, Zhipu Technology opened ChatGLM2-6B for free commercial use;
- On July 19, Meta open sourced the more powerful Llama-2, and also provided a more relaxed commercial license.
In the face of a new wave of open source language models, Turing Award winner Yann Lecun commented on Twitter:
This is going to change the landscape of the LLM market.
However, can the performance of the open source model live up to the eager expectations of the industry?
After we got a series of open source models of Llama-2, we conducted a comprehensive evaluation on it through OpenCompass (https://opencompass.org.cn).
How strong is Llama-2
Compared with Llama-1, Llama-2 has many technical improvements, which has brought effective improvements in model performance, reasoning efficiency, and security. Specifically, the important improvements are as follows:
- The model architecture uses Group-Query-Attention (GQA) to improve the efficiency of model reasoning, and the context length is doubled from 2K to 4K.
- The pre-training corpus is increased from 1.4T tokens to 2T tokens.
- In the supervised fine-tuning (SFT) stage, more attention is paid to the quality of the dataset, and the effect of using less but higher-quality SFT data is significantly improved compared to using millions of public SFT data.
- Three safety training technologies, Supervised Safety Fine-Tuning, Safety RLHF, and Safety Context Distillation, are introduced to improve the safety of the model.
Compared with the performance of the previous generation, it is still difficult to match ChatGPT
So, what is the overall capability of Llama-2?
Although the test results on about 20 data sets have been shown in the official technical report, the evaluation ability dimension is still limited, and the compared models are not comprehensive enough.
Here we use the open source evaluation tool OpenCompass to conduct a comprehensive evaluation of each model released by Llama-2 on more than 40 evaluation sets , and comprehensively measure the ability of large models from the five dimensions of discipline, language, knowledge, understanding, and reasoning.
The results can be summarized in the following radar chart:
The following table lists the performance of Llama, Llama-2, and ChatGPT on several representative evaluation sets:
For more comprehensive and detailed evaluation results, please refer to https://opencompass.org.cn.
Compared with the previous generation model, it has been comprehensively improved:
From the perspective of comprehensive ability, Llama-2-70B (green) is better than Llama-1-65B (purple), and it is significantly better than Llama-1 in various ability dimensions such as language, knowledge, reasoning, comprehension, and discipline. promote. For example, the comprehensive test set MMLU has increased from 63.71 to 69.75, and the GSM8K has increased from 54.51 to 63.46.
Dialogue and pedestal models are basically the same:
Compared with the base model Llama-2-70B (green), the fine-tuned and aligned model Llama-2-70B-Chat (yellow) has basically the same comprehensive ability, and has improved performance in language, reasoning and understanding compared to the base. There is a slight decline in the comprehensive ability and knowledge ability of the subject. For example, on the translation evaluation set Flores and the code evaluation set HumanEval, the Chat model has a relative improvement of more than 40% and 20%, respectively, while there is a relative decrease of about 10% on MMLU and TrivialQA.
There is still a big gap from ChatGPT:
Compared with ChatGPT-0613 (blue), Llama-2-70B-Chat (yellow) still needs to continue to catch up, especially in reasoning ability, comprehension ability, and subject comprehensive ability. Among them, the gap between the mathematics evaluation set MATH and the code evaluation set HumanEval has more than doubled.
Chinese proficiency is obviously weak
In Llama's training corpus, Chinese accounts for a small proportion, and the fine-tuning stage has not been optimized for Chinese, so the current Llama-2-Chat is still insufficient in Chinese issues.
A typical performance is that when a Chinese question is given, the model will still answer in English.
In order to have a deeper understanding of the Chinese and English capabilities of Llama-2, we selected the Chinese and English datasets in OpenCompass for separate analysis.
The results show that:
- Llama-2 is already closer to ChatGPT in English language ability, knowledge level and comprehension ability.
- Llama-2 is inferior to ChatGPT in all aspects of Chinese ability. This result shows that Llama-2 itself is not a particularly good choice as a base model to directly support Chinese applications .
- In terms of reasoning ability, regardless of Chinese or English, there is still a big gap between Llama-2 and ChatGPT. It can be seen that for large models, the difficulty of improving reasoning ability is much higher than that of improving basic language ability.
Safe alignment makes the model overcautious
A major feature of Llama-2 is that it adopts a relatively complete security alignment scheme during the training process, which has greatly improved value alignment and security.
But in the test, we also found that the balance between Llama-2's security and model capabilities is not particularly good. The model is very cautious and refuses to answer many common questions .
Domestic models do not fall behind
In recent months, domestic large-scale models have developed rapidly. Many enterprises and scientific research institutions have released their own large-scale models, including large models with hundreds of billions of parameters.
So how does the domestic large model compare with Llama-2? Many friends are concerned about this issue.
Comparison of Heavyweight Models
The 70B or higher-level models released by domestic institutions are generally not yet open source, and many models only provide limited services through internal testing APIs, so it is still difficult for us to obtain full evaluation data for many domestic models.
On OpenCompass, the 100-billion-parameter Shusheng·Puyu model (InternLM-104B) released by Shanghai Artificial Intelligence Laboratory and SenseTime in conjunction with several universities has already had comprehensive evaluation results.
Based on this result, we compared the performance of Shusheng Puyu and ChatGPT with Llama-2 :
In the comparison of heavyweight models, Shusheng·Puyu performed well, ahead of Llama-2 and ChatGPT on most mainstream evaluation sets. Specifically, InternLM-104B outperforms ChatGPT on 34 of the 43 evaluation sets and Llama-2-70B on 41 of the evaluation sets.
The Chinese test is ahead by a large margin:
On the Chinese test evaluation set CEval and the college entrance examination evaluation set GAOKAO-Bench, InternLM-104B significantly exceeds Llama2-70B.
Slight advantage in language ability:
In terms of basic language tasks in Chinese and English, including word comprehension, idioms, translation and other evaluation sets, InternLM-104B has advantages, and the gap in Chinese evaluation sets is even greater.
Reading comprehension "scholar" is worthy of the name:
In all kinds of reading comprehension evaluation sets in Chinese and English, InternLM-104B has shown obvious advantages, and its ability to summarize and understand key information from text paragraphs is even better.
Superior reasoning skills:
On various data sets of common sense reasoning, mathematical reasoning, and comprehensive reasoning, InternLM-104B has a relatively stable performance, and has certain advantages over Llama2-70B.
Knowledge question and answer is evenly divided: On the knowledge question and answer evaluation sets such as BoolQ, CommonSenseQA, TrivialQA, NaturalQuestion, the performance of the two models is comparable, and there is no obvious difference in knowledge level.
Code ability wins and loses:
The code ability of InternLM-104B and Llama2-70B is comparable, and the HumanEval and MBPP data sets are mutually competitive.
Comparison of Lightweight Models
On the heavyweight track, you chase after me, and on the 7B lightweight track, the competition of open source models is also very active.
Among the many domestic open source models, excellent models such as Baichuan-7B released by Baichuan Intelligent, ChatGLM2-6B released by Tsinghua University and Zhipu AI, and InternLM-7B released by Shanghai Artificial Intelligence Laboratory have attracted wide attention from the industry.
We compared these domestic models with Llama-2-7B in a comprehensive evaluation:
The following table lists the performance of these 7B-level models on several representative evaluation sets:
The results show that Llama-2 has obvious advantages in knowledge ability .
But in terms of discipline, language, reasoning and comprehension, both InternLM and ChatGLM2 have surpassed Llama-2, and InternLM has a clear lead.
Free commercial use forms a spark
A few months ago, Llama's open source detonated the community, benefiting many developers and researchers, and spawned the entire alpaca family. Unfortunately, its agreement restricts commercial use and shuts out companies.
On July 6, at the World Artificial Intelligence Conference, Shusheng Puyu's open source system was officially released, and InternLM-7B was open sourced and provided with a free commercial license.
Since then, open source models such as ChatGLM2-6B and Llama2 have been promoted for free commercial use one after another, following the development trend and the voice of the community.
It is believed that a single spark in the open source community will start a prairie fire for the industry, further lowering the threshold for the application of large models.
(This article is authorized to be published by Qubit, and the views are solely owned by the author.)