Llama2 released! Comprehensive analysis of performance, parameters, architecture and training methods!

Hayo News
Hayo News
July 19th, 2023
Llama2 is released, and this version is available for commercial use. I have sorted out some known information in detail:

  • Llama2 performance and parameters
  • How to use and restrictions
  • Model architecture of Llama2
  • Llama2 training method

👇The following is the detailed information

Llama2 performance and parameters

  • Llama2 has three versions of 7B 13B and 70B
  • Llama 2 has 40% more training data than Llama 1, and the context length is twice that of Llama 1.
  • The pre-trained Token is 2 trillion, and the context length is 4096
  • According to Meta, Llama 2 outperforms other open-source language models on a number of external benchmarks, including inference, coding, proficiency, and knowledge tests.

How to use and restrictions

  • Unlike the first leaked version, this time Meta is open for commercial use.
  • Products with more than 700 million daily active users need to apply for commercial permission separately
  • The Llama material or any output or results of the Llama material may not be used to improve any other large language model.

Model architecture of Llama2

  • Llama 2-Chat is based on the Llama 2 series of pre-trained language models. Llama 2 uses the standard Transformer architecture.
  • Llama 2-Chat is optimized with supervised fine-tuning and reinforcement learning human feedback. Supervised fine-tuning is performed first, and then reinforcement learning algorithms including rejection sampling and PPO are applied for iterative improvement.
  • Several optimizations are employed, such as prenormalization, SwiGLU activation function, and Rotated Position Embedding (RoPE).
  • Llama 2-Chat has 7 billion, 3.4 billion, 1.3 billion and 700 million parameter versions. Training was performed using publicly available data and no Meta user data was used.

Llama2's training methodology

  1. pre-training
  • Pre-training is performed using publicly available online data, totaling 2 trillion tokens. The data was cleaned and some websites containing a large amount of personal information were removed. Adopt the standard Transformer architecture, and some optimizations such as RoPE.

2. Supervised fine-tuning

  • Supervised fine-tuning using high-quality human-annotated data (about 30,000 examples). Optimized for answer markers, not hint markers.

3. Reinforcement Learning Based on Human Feedback

  • Collecting human preference data: letting humans compare and choose better responses. Train the reward model to score the responses. Iterative tuning using rejection sampling and PPO algorithms.

4. Security

  • Collect safe/helpful data for supervised fine-tuning. Train an independent security reward model. Enhance security using methods such as content distillation.

5. Evaluation

  • Human evaluation of usefulness on 4K prompts, on par with ChatGPT and others. Safety Human Evaluation on 2K Prompts, Outperforms Multiple Baseline Models.

