Llama2 released! Comprehensive analysis of performance, parameters, architecture and training methods!
Llama2 is released, and this version is available for commercial use. I have sorted out some known information in detail:
- Llama2 performance and parameters
- How to use and restrictions
- Model architecture of Llama2
- Llama2 training method
👇The following is the detailed information
Llama2 performance and parameters
- Llama2 has three versions of 7B 13B and 70B
- Llama 2 has 40% more training data than Llama 1, and the context length is twice that of Llama 1.
- The pre-trained Token is 2 trillion, and the context length is 4096
- According to Meta, Llama 2 outperforms other open-source language models on a number of external benchmarks, including inference, coding, proficiency, and knowledge tests.
How to use and restrictions
- Unlike the first leaked version, this time Meta is open for commercial use.
- Products with more than 700 million daily active users need to apply for commercial permission separately
- The Llama material or any output or results of the Llama material may not be used to improve any other large language model.
Model architecture of Llama2
- Llama 2-Chat is based on the Llama 2 series of pre-trained language models. Llama 2 uses the standard Transformer architecture.
- Llama 2-Chat is optimized with supervised fine-tuning and reinforcement learning human feedback. Supervised fine-tuning is performed first, and then reinforcement learning algorithms including rejection sampling and PPO are applied for iterative improvement.
- Several optimizations are employed, such as prenormalization, SwiGLU activation function, and Rotated Position Embedding (RoPE).
- Llama 2-Chat has 7 billion, 3.4 billion, 1.3 billion and 700 million parameter versions. Training was performed using publicly available data and no Meta user data was used.
Llama2's training methodology
- Pre-training is performed using publicly available online data, totaling 2 trillion tokens. The data was cleaned and some websites containing a large amount of personal information were removed. Adopt the standard Transformer architecture, and some optimizations such as RoPE.
2. Supervised fine-tuning
- Supervised fine-tuning using high-quality human-annotated data (about 30,000 examples). Optimized for answer markers, not hint markers.
3. Reinforcement Learning Based on Human Feedback
- Collecting human preference data: letting humans compare and choose better responses. Train the reward model to score the responses. Iterative tuning using rejection sampling and PPO algorithms.
- Collect safe/helpful data for supervised fine-tuning. Train an independent security reward model. Enhance security using methods such as content distillation.
- Human evaluation of usefulness on 4K prompts, on par with ChatGPT and others. Safety Human Evaluation on 2K Prompts, Outperforms Multiple Baseline Models.