Llama2 is released! Comprehensive analysis of performance, parameters, architecture, and training methodology!
Llama2 has been released, and this version is now available for commercial use. I have compiled some known information in detail:
Performance and Parameters of Llama2 How to use and the Limitations Model Architecture of Llama2 Training Method of Llama2
👇Below is the detailed information
Performance and Parameters of Llama2:
Llama2 comes in three different sizes, namely 7B, 13B, and 70B. The training data for Llama2 is 40% larger than that of Llama1, and it has twice the context length compared to Llama1. The pretraining was done on 2 trillion tokens with a context length of 4096. According to Meta's statement, Llama2 outperforms other open-source language models in various external benchmark tests, including inference, encoding, proficiency, and knowledge testing.
How to use and Limitations
Unlike the initial leaked version, Meta has now granted permissions for commercial use. Products with a daily active user base exceeding 700 million need to apply separately for commercial usage rights. It is strictly prohibited to use Llama materials or any output/results from Llama to enhance any other large-scale language models.
Llama2 Model Architecture
Llama2-Chat is based on the Llama2 series of pretrained language models, which utilize a standard Transformer architecture. Llama2-Chat has been optimized through supervised fine-tuning and reinforcement learning with human feedback. The process involves initial supervised fine-tuning, followed by iterative improvements using reinforcement learning algorithms, including rejection sampling and PPO (Proximal Policy Optimization). Several optimizations have been incorporated, such as pre-normalization, SwiGLU activation function, and Rotated Position Embeddings (RoPE). Llama2-Chat comes in versions with 7 billion, 3.4 billion, 1.3 billion, and 700 million parameters. The training was conducted using publicly available data and did not utilize any Meta user data.
Training Methodology of Llama2
Pre-training Publicly available online data was used for pretraining, totaling 2 trillion tokens. The data was cleaned, and websites containing substantial personal information were removed. The standard Transformer architecture was employed, along with optimizations such as RoPE (Rotated Position Embeddings).
2. Supervised fine-tuning
Supervised fine-tuning was conducted using high-quality human-annotated data, comprising approximately 30,000 examples. The focus was on optimizing answer labels rather than prompt labels.
3. Reinforcement Learning with Human Feedback
Collecting human preference data: Allowing humans to compare and choose better responses. Training a reward model to score the replies. Employing rejection sampling and PPO (Proximal Policy Optimization) algorithms for iterative refinement.
Collect safety/helpful data for supervised fine-tuning. Train an independent safety reward model. Enhance safety through methods like content distillation.
Conducting human evaluations for usefulness on 4K prompts, comparable to ChatGPT and similar models. Performing safety human evaluations on 2K prompts, outperforming multiple benchmark models.