HomeAI News
An era of ChatGPT in hand! Microsoft releases DeepSpeed Chat, one-click training of hundreds of billions of large models
12

An era of ChatGPT in hand! Microsoft releases DeepSpeed Chat, one-click training of hundreds of billions of large models

Hayo News
Hayo News
April 13th, 2023
View OriginalTranslated by Google

Realize the dream of having a ChatGPT in hand? Microsoft launched the stunning DeepSpeed Chat system framework to make AI model training more complete and efficient. This means that high-quality ChatGPT models of all sizes are now easily available!

Unlock hundreds of billions of ChatGPT with one click

As we all know, due to the limitations of OpenAI, in order to allow more people to use the ChatGPT-like model, the open source community has successively released models such as LLaMa, Alapca, Vicuna, and Databricks-Dolly. However, due to the lack of an end-to-end RLHF scale system, it is still very difficult to train ChatGPT-like models. The advent of DeepSpeed Chat just made up for this "deficiency".

Even better, DeepSpeed Chat managed to keep costs down. Previously, expensive multi-GPU setups were out of reach for many researchers. Even if a multi-GPU cluster can be used, it is difficult for the existing technology to train the ChatGPT model with hundreds of billions of parameters. Now, it only costs $1620 to train an OPT-66B model in 2.1 days using the hybrid engine DeepSpeed-HE. Using a multi-node, multi-GPU system, DeepSpeed-HE can train an OPT-13B model in 1.25 hours for $320 and an OPT-175B model in less than a day for $5120.

Even Elvis, a former Meta AI expert, forwarded it very excitedly, saying it was a big deal, and expressed curiosity about how DeepSpeed Chat performed compared to ColossalChat.

Next, let's take a look at the performance of DeepSpeed-Chat. After training with DeepSpeed-Chat, the 1.3 billion parameter version of "ChatGPT" performed very well in the question-and-answer session. Not only can the context of the question be obtained, but the answers given are also very close to the actual answers.

In multiple rounds of dialogue, this 1.3 billion parameter version of "ChatGPT" has shown performance beyond expectations.

A piece of code to generate your ChatGPT

Of course, before the experience, you need to configure the environment:

plain text ANTLR4 Bash C C# css CoffeeScript CMake Dart Django Docker EJS Erlang Git Go GraphQL Groovy HTML Java JavaScript JSON JSX Kotlin LaTeX less Lua Makefile markdown MATLAB Markup Objective-C Perl PHP PowerShell .properties Protocol Buffers Python R Ruby Sass (Sass) Sass (Scss) Scheme SQL Shell Swift SVG TSX TypeScript WebAssembly YAML XML

A cup of coffee, after training the 1.3 billion parameter version of ChatGPT

Also try training a "little toy" with DeepSpeed-Chat if you only have about 1-2 hours for a coffee or lunch break. The team specially prepared a training example for the 1.3B model, which can be tested on consumer-grade GPUs. Best of all, when you get back from your lunch break, it's all ready to go.

plain text ANTLR4 Bash C C# css CoffeeScript CMake Dart Django Docker EJS Erlang Git Go GraphQL Groovy HTML Java JavaScript JSON JSX Kotlin LaTeX less Lua Makefile markdown MATLAB Markup Objective-C Perl PHP PowerShell .properties Protocol Buffers Python R Ruby Sass (Sass) Sass (Scss) Scheme SQL Shell Swift SVG TSX TypeScript WebAssembly YAML XML python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --num-gpus 1

Consumer-grade NVIDIA A6000 GPU with 48GB of video memory:

A GPU Node, half a day to get 13 billion parameters

If you only have half a day and a server node, you can use the pre-trained OPT-13B as the actor model and OPT-350M as the reward model to generate a ChatGPT-like model with 13 billion parameters:

plain text ANTLR4 Bash C C# css CoffeeScript CMake Dart Django Docker EJS Erlang Git Go GraphQL Groovy HTML Java JavaScript JSON JSX Kotlin LaTeX less Lua Makefile markdown MATLAB Markup Objective-C Perl PHP PowerShell .properties Protocol Buffers Python R Ruby Sass (Sass) Sass (Scss) Scheme SQL Shell Swift SVG TSX TypeScript WebAssembly YAML XML python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --num-gpus 8 plain text ANTLR4 Bash C C# css CoffeeScript CMake Dart Django Docker EJS Erlang Git Go GraphQL Groovy HTML Java JavaScript JSON JSX Kotlin LaTeX less Lua Makefile markdown MATLAB Markup Objective-C Perl PHP PowerShell .properties Protocol Buffers Python R Ruby Sass (Sass) Sass (Scss) Scheme SQL Shell Swift SVG TSX TypeScript WebAssembly YAML XML

Single DGX node with 8 NVIDIA A100-40G GPUs:

Money-saving cloud solution, training 66 billion parameter model

If you have access to a multi-node cluster or cloud resources and want to train a larger, higher quality model. Then just enter the model size you want (such as 66B) and the number of GPUs (such as 64) based on the following line of code:

plain text ANTLR4 Bash C C# css CoffeeScript CMake Dart Django Docker EJS Erlang Git Go GraphQL Groovy HTML Java JavaScript JSON JSX Kotlin LaTeX less Lua Makefile markdown MATLAB Markup Objective-C Perl PHP PowerShell .properties Protocol Buffers Python R Ruby Sass (Sass) Sass (Scss) Scheme SQL Shell Swift SVG TSX TypeScript WebAssembly YAML XML python train.py --actor-model facebook/opt-66b --reward-model facebook/opt-350m --num-gpus 64 plain text ANTLR4 Bash C C# css CoffeeScript CMake Dart Django Docker EJS Erlang Git Go GraphQL Groovy HTML Java JavaScript JSON JSX Kotlin LaTeX less Lua Makefile markdown MATLAB Markup Objective-C Perl PHP PowerShell .properties Protocol Buffers Python R Ruby Sass (Sass) Sass (Scss) Scheme SQL Shell Swift SVG TSX TypeScript WebAssembly YAML XML

8 DGX nodes with 8 NVIDIA A100-80G GPUs each:

Specifically, for different scale models and hardware configurations, the time and cost required for the DeepSpeed-RLHF system are as follows:

What is DeepSpeed Chat?

DeepSpeed Chat is a general system framework that enables end-to-end RLHF training of ChatGPT-like models, thus helping us generate our own high-quality ChatGPT-like models.

DeepSpeed Chat is a general system framework that enables end-to-end RLHF training of ChatGPT-like models to generate high-quality ChatGPT-like models. DeepSpeed Chat has the following three core functions:

Simplify the training and enhanced inference experience of ChatGPT type models

Developers can implement multiple training steps with just one script, and use the inference API for conversational interactive testing after completion.

DeepSpeed-RLHF module

DeepSpeed-RLHF replicates the training model in the InstructGPT paper and provides data abstraction and mixing functions to support developers to use data sources from multiple different sources for training.

DeepSpeed-RLHF system

The DeepSpeed-RLHF system integrates DeepSpeed's training engine and inference engine into a unified hybrid engine (DeepSpeed Hybrid Engine or DeepSpeed-HE) for RLHF training.

DeepSpeed-HE seamlessly switches between inference and training modes, leveraging various optimizations from DeepSpeed-Inference to achieve unparalleled efficiency in large-scale training. DeepSpeed-HE is more than 15 times faster than existing systems, making RLHF training fast and affordable. also,

DeepSpeed-HE can support the training of models with hundreds of billions of parameters, and exhibits excellent scalability on multi-node multi-GPU systems.

DeepSpeed-HE can not only easily create lightweight RLHF models, but also create large and powerful models for different usage scenarios.

RLHF training process

In order to provide a seamless training experience, the researchers followed the idea of InstructGPT and included a complete end-to-end training process in DeepSpeed-Chat.

Some optional features are included in DeepSpeed-Chat's RLHF training flowchart. The process consists of three main steps: Step 1: Supervised fine-tuning (SFT), using curated human responses to fine-tune the pre-trained language model for various queries. Step 2: Reward model fine-tuning, training a (usually smaller than SFT) independent reward model (RW) using a dataset containing multiple answers given by humans to the same query. The third step: RLHF training. In this step, the SFT model is further fine-tuned from the reward feedback of the RW model by using a Proximate Policy Optimization (PPO) algorithm. In the third step, the researchers also provided two additional functions to help improve the model quality: 1. Collection of Exponential Moving Average (EMA), an EMA-based checkpoint can be selected for final evaluation. 2. Hybrid training, which mixes the pre-training objective (i.e., next word prediction) with the PPO objective to prevent performance regression on public benchmarks (such as SQuAD2.0).

The two training features of MA and hybrid training are often ignored by other open source frameworks because they do not hinder the training. However, according to InstructGPT, EMA checkpoints tend to provide better response quality than traditional final training models, and hybrid training can help models maintain baseline solving capabilities before training. Therefore, the researchers provided these features for users to take full advantage of the training experience described in InstructGPT. In addition to being highly consistent with the InstructGPT paper, the researchers also provided functions that allow developers to use a variety of data resources to train their own RLHF models. DeepSpeed-Chat is equipped with (1) a dataset abstraction layer to unify the format of different datasets; and (2) data splitting/mixing functions so that multiple datasets are properly mixed and then performed in three training stages segmentation.

DeepSpeed Hybrid Engine

To achieve scale and speed, steps 1 and 2 of the instruction-guided RLHF pipeline are similar to regular fine-tuning, which are composed of ZeRO-based optimization and flexible parallel strategies in DeepSpeed training. Step 3 of the pipeline is the most complicated part in terms of performance impact. Each iteration needs to efficiently handle two phases: the inference phase is used for token/experience generation, which generates input for training; the training phase updates the weights of actors and reward models, as well as the interaction and scheduling between them. These challenges include the memory cost and slow speed of the answer generation stage, and the additional memory and training cost of the two important functions of EMA collection and hybrid training.

To address these challenges, the researchers combined the full system capabilities of DeepSpeed training and inference into a unified infrastructure, the hybrid engine. By enabling the typical eval and train modes for actor models, DeepSpeed selects different optimizations when running inference and training pipelines to run models faster and improve overall system throughput.

The hybrid engine uses a lightweight memory management system to handle KV caching and intermediate results, and during training, memory optimization techniques such as DeepSpeed's ZeRO series technology and low-order adaptation are enabled. These system optimizations are designed and implemented in such a way that they are compatible with each other and can be combined to provide the highest training efficiency under a unified hybrid engine. The hybrid engine can seamlessly change model partitions during training and inference to support tensor-based parallel inference, and a ZeRO-based training sharding mechanism. It can also reconfigure the memory system to maximize memory availability in each mode. This avoids memory allocation bottlenecks, supports large batch sizes, and greatly improves performance. The Hybrid Engine pushes the boundaries of modern RLHF training, delivering unparalleled scale and system efficiency for RLHF workloads.

effect evaluation

Compared with existing systems such as Colossal-AI or HuggingFace-DDP, DeepSpeed-Chat has more than an order of magnitude of throughput, capable of training larger actor models at the same latency budget or training similarly sized models at lower cost . For example, DeepSpeed improves the throughput of RLHF training by more than 10 times on a single GPU. While both CAI-Coati and HF-DDP can run a 1.3B model, DeepSpeed can run a 6.5B model on the same hardware, which is directly 5 times higher.

On multiple GPUs in a single node, DeepSpeed-Chat is 6-19 times faster than CAI-Coati in terms of system throughput, and HF-DDP is 1.4-10.5 times faster.

According to the team, one of the key reasons why DeepSpeed-Chat can achieve such excellent results is the acceleration provided by the hybrid engine during the generation phase.

Reference link: https://github.com/microsoft/DeepSpeed

Comments

no dataCoffee time! Feel free to comment