DeepSpeed Chat: One-click RLHF training, make your ChatGPT-like 100 billion large model speed up and save money by 15 times
In recent days, ChatGPT and similar models have triggered a wave in the field of artificial intelligence (AI). This trend has revolutionized the digital world. ChatGPT-like models are surprisingly versatile and can perform tasks such as induction, programming, and translation, with results comparable to or better than human experts. In order to make the training and deployment of models such as ChatGPT easier, the AI open source community has made various attempts (such as ChatLLaMa, Alpaca, Vicuna, Databricks-Dolly, etc.).
However, despite the great efforts of the open source community, there is still a lack of a scalable system supporting end-to-end reinforcement learning with human feedback mechanism (RLHF), which makes it difficult to train a powerful ChatGPT-like model. For example, using existing open source systems to train a ChatGPT-like model with 6.7 billion parameters usually requires expensive multi-card to multi-node GPU clusters, but these resources are difficult for most data scientists or researchers to obtain. At the same time, even with such computing resources, the training efficiency of existing open source systems is usually less than 5% of the maximum efficiency achieved by these machines . In short, even with expensive multi-GPU clusters, existing solutions cannot easily, quickly, and economically train state-of-the-art ChatGPT-like models with hundreds of billions of parameters.
The training of the ChatGPT model is based on the RLHF method in the InstructGPT paper. This is quite different from the common pre-training and fine-tuning of large language models. This makes existing deep learning systems have various limitations when training ChatGPT-like models. Therefore, in order to make ChatGPT-type models more accessible to ordinary data scientists and researchers, and to make RLHF training truly popular in the AI community, we released DeepSpeed-Chat. DeepSpeed-Chat has the following three core functions:
(i) Simplified training and enhanced inference experience for ChatGPT type models : multiple training steps can be achieved with only one script, including models pre-trained with Huggingface, running all three steps of InstructGPT training with DeepSpeed-RLHF system, and even generating Your own ChatGPT-like model. Additionally, we provide an easy-to-use inference API for users to test conversational interactions after model training.
(ii) DeepSpeed-RLHF module : DeepSpeed-RLHF replicates the training model from the InstructGPT paper and ensures that a) supervised fine-tuning (SFT), b) reward model fine-tuning and c) human feedback-based reinforcement learning (RLHF) are included in The three steps in it correspond to it one by one. In addition, we also provide data abstraction and mixing functions to support users to use multiple data sources from different sources for training.
(iii) DeepSpeed-RLHF system : We integrate the training engine and inference engine of DeepSpeed into a unified hybrid engine (DeepSpeed Hybrid Engine or DeepSpeed-HE) for RLHF training. DeepSpeed-HE is able to seamlessly switch between inference and training modes in RLHF, enabling it to take advantage of various optimizations from DeepSpeed-Inference, such as tensor parallel computing and high-performance CUDA operators for language generation, while training Some also benefit from ZeRO- and LoRA-based memory optimization strategies. DeepSpeed-HE is also capable of intelligent memory management and data caching at different stages of RLHF automatically.
The DeepSpeed-RLHF system has unparalleled efficiency in large-scale training, making complex RLHF training fast, affordable, and easy to scale up:
Efficient and Economical : DeepSpeed-HE is more than 15 times faster than existing systems , making RLHF training fast and affordable. For example, DeepSpeed-HE can train an OPT-13B model in just 9 hours and an OPT-30B model in just 18 hours on the Azure cloud. These two workouts cost less than $300 and $600, respectively.
Excellent scalability : DeepSpeed-HE can support training models with hundreds of billions of parameters, and exhibits excellent scalability on multi-node multi-GPU systems. Therefore, even a model with 13 billion parameters can be trained in only 1.25 hours. For a huge model with 175 billion parameters, it takes less than a day to train using DeepSpeed-HE.
VERY IMPORTANT DETAILS : The data in the above two tables (i.e., Table 1 and Table 2) are for step 3 of RLHF training, based on real datasets and testing of DeepSpeed-RLHF training throughput. The training runs for one epoch on a total of 135 million (135M) tokens. We have a total of 67.5 million query characters (131.9k queries, each sequence length 256) and 67.5 million generation/answer characters (131.9k answers, each sequence length 256), and the maximum The global character batch size is approximately 5 million characters (1024 query-answer pairs). Before making any cost and end-to-end time comparisons with DeepSpeed-RLHF, we advise readers to be aware of these settings. For more details, see our page benchmark setting .
Democratizing RLHF Training : With just a single GPU, DeepSpeed-HE can support training models with over 13 billion parameters. This enables data scientists and researchers who do not have access to multi-GPU systems to easily create not only lightweight RLHF models, but also large and powerful models for different usage scenarios.
Next, we will take a deeper look at the three functions of DeepSpeed-Chat introduced above.
2. Simple, efficient and economical ChatGPT training and reasoning experience
Next, we show you how to train OPT-13B and OPT-66B models with DeepSpeed-RLHF system with one line of command. Even with limited time, you can train an OPT-1.3B model in about two hours on a single consumer-grade GPU. Additionally, we demonstrate how to leverage our DeepSpeed-chat RLHF API to develop your own models:
Easily train your first ChatGPT-like model using DeepSpeed-Chat's RLHF example
a) Only one script is needed to complete all three stages of RLHF training and generate your first ChatGPT model!
The following example shows how to use a script to generate a final 13 billion parameter ChatGPT type model with the pre-trained OPT-13B as the actor model and OPT-350M as the reward model:
In about half a day, your model with 13 billion parameters is ready to train. The table below shows how long the entire process takes in three steps:
Table 4. Time required to train OPT-13b using DeepSpeed-Chat for different RLHF steps on a single DGX node (8x A100-40G).
b) Test the trained model using DeepSpeed-Chat's inference API
Import the trained model into the provided simple API, and you can not only interact with the model for question and answer, but also conduct conversational communication.
Want to try different model sizes and configurations? Easy to do!
We learned that users often like to experiment with different model sizes and configurations to meet their varying training time, resource, and quality needs. With DeepSpeed-Chat, you can easily achieve these goals. For example, if you want to train a larger, higher quality model on a GPU cluster for your research or business, you can use the same script, just enter your desired model size (e.g. 66 billion parameters) and Number of GPUs (eg 64 GPUs):
Within the next 9 hours, you will have a 66 billion parameter ChatGPT model ready to use in your favorite front-end GUI:
Table 5. Time required to train OPT-66b using DeepSpeed-Chat for different RLHF steps on 8 DGX nodes (64x A100-80G).
You can also try training a small model with DeepSpeed-Chat if you only have about 1-2 hours for coffee or lunch breaks. For example, we prepared a training example of a 1.3 billion parameter model for a single dataset to test our framework on your consumer GPU. When you get back from your lunch break, your model is ready for you to use!
Table 6. Time required to train OPT-1.3b using DeepSpeed-Chat for different RLHF steps on a single consumer-grade A6000-48G.
Use DeepSpeed-Chat's RLHF API to customize your own RLHF training process
DeepSpeed-Chat allows users to build their own RLHF training pipeline using our flexible APIs (shown below), which users can use to rebuild their own RLHF training strategies. We hope that these functions can provide a common interface and backend for creating various RLHF algorithms in research exploration.
3. Overview of the complete RLHF training process
To achieve a seamless training experience, we follow the approach of the InstructGPT paper and integrate an end-to-end training pipeline in DeepSpeed-Chat, as shown in Figure 1.
Our process consists of three main steps:
- Step 1: Supervised fine-tuning (SFT) - use curated human responses to fine-tune the pre-trained language model for various queries;
- Step 2: Reward model fine-tuning - train a separate (usually smaller than SFT) reward model (RW) using a dataset containing multiple human-scored answers to the same query;
- Step 3: RLHF training —Using the Proximal Policy Optimization (PPO) algorithm, the SFT model is further fine-tuned according to the reward feedback of the RW model.
In step 3, we provide two additional functions to help improve model quality:
- Exponential Moving Average (EMA) - EMA-based checkpoints can be selected for final evaluation
- Hybrid training - mixes the pre-training objective (i.e. next word prediction) with the PPO objective to prevent performance loss on public benchmarks like SQuAD2.0
These two training functions, EMA and hybrid training, are often ignored by other open source frameworks, because they do not hinder the training process. However, according to InstructGPT, EMA generally provides better response quality than traditional final trained models, and hybrid training can help models maintain pre-trained baseline solving ability. Therefore, we provide these features for users to fully obtain the training experience described in InstructGPT and strive for higher model quality.
In addition to being highly consistent with the InstructGPT paper, we also provide a convenient feature to support researchers and practitioners to train their own RLHF models using multiple data sources:
- Data abstraction and blending capabilities : DeepSpeed-Chat is able to train models using datasets from multiple different sources for better model quality. It is equipped with (1) an abstract dataset layer to unify the format of different datasets; and (2) data splitting/blending functions so that multiple datasets are properly mixed and then split in 3 training stages.
In our previous chapters, you can see how the entire DeepSpeed-Chat trained model performs on multiple rounds of conversation.
4. DeepSpeed Hybrid Engine – a unified high-efficiency hybrid engine that powers and optimizes RLHF training
The first two steps of the DeepSpeed-Chat process are similar to the regular fine-tuning of large models, thanks to the flexible combination of ZeRO-based memory management optimization and parallel strategies in DeepSpeed training, the scale and speed are improved. However, the third step of the process is the most challenging part in terms of performance. Each iteration needs to efficiently handle two phases: a) the inference phase that generates answers, providing input for training; b) the training phase that updates actor and reward model weights, and the interaction and scheduling between them. This introduces two major difficulties: (1) memory cost, since multiple SFT and RW models need to be run throughout the third stage; (2) the generation-answer stage is slower and will be significantly faster if not properly accelerated. Slow down the entire third stage. Additionally, two important optional features we added in Phase III, including Exponential Moving Average (EMA) collection and hybrid training, will incur additional memory and training costs.
To address these challenges, we combined the system capabilities of DeepSpeed training and inference into a unified infrastructure called the Hybrid Engine. It utilizes the original DeepSpeed engine for high-speed training mode, while easily applying the DeepSpeed inference engine for generation/evaluation mode, providing a significantly faster training system for the third stage of RLHF training. As shown in Figure 2, the transition between the DeepSpeed training and inference engines is seamless: by enabling the typical eval and train modes for the actor model, when running the inference and training pipelines, DeepSpeed chooses its different optimizations to run the model more efficiently. faster and improve overall system throughput.
During inference execution during the experience generation phase of RLHF training, the DeepSpeed hybrid engine uses a lightweight memory management system to handle KV caching and intermediate results, while using highly optimized inference CUDA cores and tensor parallel computing. DeepSpeed-HE significantly improves throughput (tokens per second) compared to existing solutions.
During training execution, the hybrid engine uses a variety of memory optimization techniques, such as DeepSpeed's ZeRO series of technologies and the now popular LoRA method. These techniques are compatible with each other in the hybrid engine and can be combined to provide the highest training efficiency.
DeepSpeed-HE can seamlessly change the model partition between training and inference to support tensor parallel computing-based inference and ZeRO-based sharding mechanism for training. It also reconfigures the memory system to maximize memory availability during this time. DeepSpeed-HE also further improves performance by circumventing memory allocation bottlenecks and supporting large batch sizes. The hybrid engine integrates a series of system technologies for DeepSpeed training and inference, breaking through the limits of existing RLHF training, and delivering unparalleled scale and system efficiency for RLHF workloads.
5. DeepSpeed RLHF: Unparalleled Scale and Efficiency via Hybrid Engine
As mentioned earlier, DeepSpeed-HE is a powerful combined system for inference and training, designed to enable DeepSpeed-RLHF to achieve superior scale and efficiency on various hardware, making RLHF training fast, economical and easy AI community use.
In terms of efficiency and economy, as shown in Table 1, DeepSpeed-HE can train an OPT-13B model in only 9 hours on the Azure cloud, and can train an OPT-30B model in 18 hours, costing less than 300 Dollars and $600. In terms of speed and scalability, as shown in Table 2, even the 13B model can be trained in 1.25 hours, while the huge 175B model can be trained in less than a day with 64 GPU clusters. In terms of accessibility and popularization of RLHF, DeepSpeed-HE can train models with more than 13 billion parameters on a single GPU, as shown in Table 3.
Throughput and model size scalability comparison with existing RLHF systems
Compared with other RLHF systems such as Colossal-AI or HuggingFace powered by native PyTorch, DeepSpeed-RLHF outperforms in terms of system performance and model scalability:
- In terms of throughput, DeepSpeed achieves more than a 10x improvement in RLHF training on a single GPU (Figure 3). In multi-GPU setups, it is 6-19x faster than Colossal-AI and 1.4-10.5x faster than HuggingFace DDP (Figure 4).
- In terms of model scalability, Colossal-AI can run models up to 1.3B on a single GPU and 6.7B on a single A100 40G node, while DeepSpeed-HE can run 6.5B and 50B model, achieving up to 7.5 times improvement.
Thus, with more than an order of magnitude higher throughput, DeepSpeed-HE possesses the ability to train larger actor models under the same time budget compared to existing RLHF systems such as Colossal-AI or HuggingFace DDP, or with Ability to train similarly sized models at one-tenth the cost.
This efficiency improvement is the result of DeepSpeed-HE's accelerated RLHF generation during RLHF processing with DeepSpeed Inference Optimization. Figure 5 shows the time consumption details of the 1.3B parameter model in RLHF training iterations: most of the time is spent in the generation phase. By leveraging DeepSpeed's high-performance inference cores, DeepSpeed-HE can achieve up to 9x throughput improvement over HuggingFace and 15x over Colossal-AI at this stage, resulting in unparalleled end-to-end efficiency.
Effective Throughput and Scalability Analysis
(I) Effective throughput analysis. In phase 3 of RLHF training, the effective throughput of DeepSpeed-HE depends on the throughput it achieves in the generation and RL training phases. In our RLHF (see benchmarking setting for details), the generation phase accounts for about 20% of the total computation, while the RL training phase accounts for the remaining 80%. However, despite the smaller scale, the former may take most of the end-to-end time, since it needs to run the actor model once for each generated character, making it memory bandwidth bound and difficult to achieve high throughput. In contrast, the RL training phase is computationally intensive, requiring only a few forward and backward passes to run the reference actor model, with each sample having the full 512 characters from hinting and generation, enabling good throughput quantity.
To maximize the effective throughput, DeepSpeed-HE optimizes two stages. First, it uses as large a batch size as possible to achieve higher efficiency on both stages. Second, in the generation phase, it leverages high-performance CUDA kernels on the model to maximize GPU memory bandwidth utilization on a single GPU, and in other cases utilizes Tensor Parallelism (TP) for computation. DeepSpeed-HE further uses TP instead of ZeRO in the generation stage to reduce inter-GPU communication and maintain high GPU memory bandwidth utilization.
Figure 6 shows the best effective throughput (expressed in TFlops/GPU) that DeepSpeed-HE can achieve in the model size range from 1.3B to 175B. It also shows the throughput achieved during the generation and training phases, respectively. DeepSpeed-HE is most efficient for models in the range 6.7B-66B. Beyond this range to 175B, throughput drops due to limited memory to support larger batch sizes, but is still 1.2x more efficient than the small 1.3B model. When we scale these huge models to more GPUs with more memory, the per-GPU throughput of these models may be further improved.
Furthermore, we would like to point out that, as shown in Figure 2, the effective performance of our system is 19 times higher than that of existing systems, which indicates that they operate at less than 5% of the peak speed. This illustrates the challenge of optimizing RLHF workloads and the effectiveness of our system in the face of the challenge.
(II) Scalability analysis. The best effective throughput for different model sizes depends on different numbers of GPUs. This is partly because some larger model sizes require more memory to run. Based on this, we next discuss the scalability properties of DeepSpeed-HE.
Figure 7 shows that DeepSeed-RLHF achieves good overall scaling on clusters up to 64 GPUs. However, if we look closely, DeepSpeed-RLHF training achieves super-linear scaling at small scales, followed by near-linear or sub-linear scaling at larger scales. This is due to the interplay between memory availability and the maximum global batch size.
The core technology of DeepSpeed-HE is based on ZeRO, which is used to split the model state to each GPU during the training process. This means that as the number of GPUs increases, the memory consumption per GPU decreases, enabling DeepSpeed-HE to support larger batch sizes per GPU, enabling super-linear scaling. However, at large scale, the maximum global batch size still limits the batch size per GPU, leading to near-linear or sub-linear scaling despite the continuous increase in available memory. Thus, DeepSpeed-HE achieves the best throughput between superlinear and sublinear scalability at a given maximum global batch size (e.g., we set to 1024 sentences, each sentence length 512) and cost-effective. The exact balance point depends primarily on the maximum batch size runnable on each GPU, which in turn is a function of available memory and the global batch size.
6. Release: Try DeepSpeed Chat now!
We are very happy to announce that DeepSpeed-Chat is now open source and open to the AI community.
- If you find our results useful to you or like our open source results, please click ⭐ on DeepSpeed and DeepSpeedExamples .
- Visit our DeepSpeed-Chat GitHub page to get started: GitHub Landing Page
- We will continue to improve DeepSpeed-Chat based on your feedback and support. Our roadmap shows which features are currently supported and which are planned to be supported in the future.
DeepSpeed-Chat is part of the larger DeepSpeed ecosystem, which includes numerous deep learning systems and modeling techniques. For more information,
- Visit our website for detailed blog posts, tutorials, and helpful documentation.
- You can also follow our English Twitter and Japanese Twitter to learn about the latest developments of DeepSpeed. We will also authorize the KAIYUANSHE WeChat official account to publish our Chinese blog as soon as possible.
DeepSpeed welcomes your contributions! We encourage you to report issues, contribute PRs, and participate in discussions on the DeepSpeed GitHub page. See our Contributing Guidelines for more details. We are willing to cooperate with universities, research laboratories, companies, etc. to jointly carry out deep learning research, and apply DeepSpeed to AI models and applications that empower the real world. For such requests (and other requests not suitable for submitting on GitHub), please email email@example.com directly.