GPT-4 insider big leak! 1.8 trillion huge parameters, 13 trillion token training, cost 63 million US dollars
GPT-4 broke the news again in the industry! Parameters, architecture, training data sets, tokens, training and inference costs... all at once. In view of the author's previous achievements, this revelation does have a certain reference value.
Just now, OpenAI's GPT-4 was "open sourced" by industry insiders again!
These include very specific parameters and information such as GPT-4's architecture, training and inference infrastructure, parameter volume, training data set, token number, cost, and Mixture of Experts (MoE).
In particular, behind the different projects, how does OpenAI weigh. And how to cross the biggest bottleneck in large model inference.
Who did such a heavy revelation come from?
The authors of the article are two SemiAnalysis contributors named Dylan Patel and Gerald Wong.
It is worth mentioning that Dylan Patel was also one of the authors of Google’s internal document leaks (“We don’t have a moat, and neither does OpenAI”), which has caused an uproar in the industry.
DeepMind CEO Hassabis recently confirmed the authenticity of the leaked documents from Google engineers in an interview with The Verge.
It can be seen that Dylan Patel does have some special channels, which makes today's revelation a little more authentic.
Go out and ask CEO Li Zhifei also made a speech
Many companies can make GPT-4
In the opinion of the author of the breaking news article, the reason why OpenAI is not open is not to ensure that human beings are not destroyed by AI, but because the things they build are reproducible.
He even predicts that in the future, all Chinese and American Internet giants or AI start-ups will be able to build a model that is the same as GPT-4 or even surpasses GPT-4.
But he also admitted that GPT-4 is a great masterpiece of OpenAI. It condenses the engineer's ingenious design, complex structure and various ingenious engineering trade-offs.
The most durable moat of OpenAI is that they have feedback from real users, the top engineering talents in the industry, and the continuous leading position brought about by the first-mover advantage.
First of all, the author believes that GPT-4 contains a total of 1.8 trillion parameters in 120 layers, while GPT-3 has only about 175 billion parameters.
In other words, the scale of GPT-4 is more than 10 times that of GPT-3.
Previously, it was said on the Internet that the parameter of GPT-4 is 1 trillion, which seems to be underestimated from the actual situation.
In order to keep the cost reasonable, OpenAI adopted the MoE model for construction.
Specifically, GPT-4 has 16 expert models with approximately 111 billion parameters per MLP expert. Among them, two expert models are used for forward propagation.
Although the literature discusses a lot of advanced algorithms for selecting which experts each token points to, it is said that the algorithm used by OpenAI for GPT-4 is actually very simple.
In addition, there are about 55 billion parameters in the model, which are used for the sharing of the attention mechanism.
In each forward pass reasoning (generating a token), GPT-4 only needs to use about 280 billion parameters and 560TFLOPs.
This is in stark contrast to many purely dense models that require about 1.8 trillion parameters and 3700 TFLOPs per forward pass.
The composition of the data set
OpenAI trained GPT-4 with 13 trillion tokens.
This data set not only contains 13 trillion tokens, but because there are no high-quality tokens, this data set also contains many epochs.
Inside Scale AI and the dataset, millions of lines of instruction fine-tuning data are also included.
However, the author of the report said that they did not find much information on these RLHF data.
The context length in the pre-training phase reached 8K (seqlen), and the 32k version was fine-tuned based on the pre-trained 8K version.
The batch size was gradually increased over several days in the cluster, and finally OpenAI used a batch size of 60 million.
Of course, this is "only" the size of the expert models at 7.5 million tokens each, since not every expert model will see all tokens.
Parallel strategy is very important for A100GPU.
OpenAI uses 8-way tensor parallelism, because NVLink only supports so much at most.
But in addition, the author of the breaking news heard that OpenAI uses 15 parallel pipelines.
In theory, 15 pipelines is a bit much considering data communication and computation time.
But because of the limitation of memory capacity, so many pipelines are meaningful.
When purely pipelined and tensor-parallel, the FP16 parameter is about 30GB per GPU.
But once the KV cache and cost are added, if most of the GPUs used by OpenAI are 40GB A100s, then such an architecture makes sense in theory.
Maybe OpenAI is using ZeRo Stage 1, and maybe using block-level FSDP or hybrid shared data parallelism.
Why didn't they use the full model of FSDP? Probably because of the high communication cost.
Although OpenAI has a high-speed network between most nodes, it does not cover all nodes.
Among them, at least some clusters will have much lower connection bandwidth than others.
However, the author said that he does not quite understand how OpenAI can avoid generating "huge bubbles" like the one shown below in each batch under such a high degree of pipeline parallelism. It is very likely that OpenAI has resisted these cost.
OpenAI trains GPT-4 at about 2.15e25 FLOPS on about 25,000 A100s for 90 to 100 days, with utilization between 32% and 36%.
This extremely low utilization was partly due to the high number of failures, which required restarting training from previous checkpoints. Such as the bubble cost mentioned above.
The wasted training cost in this case is extremely high.
Another reason is that all-reduce among so many GPUs is very expensive.
This diagram assumes that the inability to fuse each operation, the memory bandwidth required by the attention mechanism, and the hardware overhead equivalent to parameter reads lead to inefficiencies. In fact, even with an optimized library such as Nvidia's FasterTransformer library, the overhead can be even greater
The author of the report suspects that if this cluster is actually a group of smaller clusters with weaker network connections, then the non-blocking (non-block) connection speed between different parts of the cluster is 800G/1.6T, but these parts The connection speed between them is only 200G/400G.
If the cost of OpenAI cloud computing is about $1/A100 hours, then under these conditions, the training cost is about $63 million.
This does not include all the experiments, failed training and other costs, such as data collection, RLHF, human cost, etc.
If you take into account the factors just mentioned, the real cost is much higher.
Also, this has to be on the premise that someone else can buy the chips/network/datacenters, incur the capex to build these systems, and lease them to OpenAI.
But today, at $2/H100 hours, pre-training can be done on about 8,192 H100s in just 55 days at a cost of $21.5 million.
The figure above shows the number of parameters and tokens for some of the publicly available advanced models. The line in the graph is Google DeepMind's Chinchilla scaled observations (larger error bars smoothed), and each point on the line shows the theoretical FLOPS required to train the model with that parameter and number of tokens
However, the author of the report said that by the end of this year, at least nine companies will have H100 clusters larger than the above-mentioned size.
While not all of these companies will use all of them for individual model training, if any do, they will have larger models than GPT-4.
For example, Meta will have more than 100,000 H100s by the end of this year, but a considerable part of them will be distributed in its own data center for inference.
But its largest single cluster will still exceed 25,000 H100s.
In short, by the end of this year, many companies will have enough computing resources to train GPT-4-sized models.
This table is the theoretically optimal cost of training a model on an Nvidia A100, without considering the manpower required, ML Ops tools, data collection/preprocessing, failure recovery, one-shot/few-shot learning examples, inference, etc., many parts The cost of
Tradeoffs in Mixed Expert Models
MoE (Mixed Model of Experts) is a great way to reduce the amount of parameters during inference, while increasing them at the same time.
But this is necessary for each training token to encode more information, because obtaining enough high-quality tokens is very difficult.
If OpenAI really wants to pursue the best performance, they need to train twice as many tokens to achieve it.
That being said, OpenAI made quite a few trade-offs.
For example, dealing with MoE during inference is very difficult because every part of the model is not used at every token generation.
This means that some parts may be dormant while other parts are working.
This situation can significantly reduce utilization when servicing users.
Researchers have shown that using 64-128 expert models yields better loss profiles than using 16 expert models, but this is just research.
There are many reasons for using relatively few expert models. One of the reasons OpenAI chose 16 experts is because it is difficult for models with more experts to generalize on many tasks.
It is also more difficult to achieve convergence with more expert models.
In such a huge training process, OpenAI chooses to be more conservative in the number of expert models.
Furthermore, using fewer expert models also helps their inference infrastructure. There are various difficult trade-offs and trade-offs when switching to a hybrid expert-model inference architecture.
The author of the breaking news starts with the basic trade-offs of LLM reasoning, and then discusses the problems faced by OpenAI and the choices they made.
Before introducing the inference trade-offs, by the way, after talking to all the LLM companies, the whistleblower found that Nvidia's FasterTransformer inference library is very bad, and TensorRT is even more so.
This means that if Nvidia does not modify, people will need to create their own solutions from scratch.
There are three main tradeoffs in reasoning about large language models, the batch size (number of concurrently processed users) dimension and the number of chips used, as follows:
The model must respond within a reasonable latency. Nobody wants to wait a few seconds in a chat app before they start receiving output. The processing time for prefilling (input tokens) and decoding (output tokens) varies.
The model must output a certain number of tokens per second. Humans need about 30 tokens per second. For various other use cases, both lower and higher throughputs are acceptable.
The hardware running the model must achieve high utilization rates, or the cost will be prohibitive. While higher latency and lower throughput can be used to combine more user requests together to achieve higher utilization, it also increases difficulty.
The key to LLM inference is to balance the two points of memory bandwidth and computation.
LLM theoretical bandwidth requirements: It can be assumed that the maximum model size that can be run on iPhone 14 is ~ 1 billion FP16 parameters, or ~ 4 billion int4 parameters, which is the basic limit of smartphone-based LLM. Large models will not be adopted
Simply put, each parameter must be read and there are 2 FLOPs associated with it.
Therefore, the ratio of most chips (H100 SXM has only 3TB/s memory bandwidth, but FP8 has 2,000 TFLOP/s) is completely unbalanced in batch size 1 inference.
If there is only one user (batch size 1), the memory bandwidth required to read each parameter each time a token is generated dominates the inference time, while the computation time is almost negligible.
To efficiently scale large language models to multiple users, the batch size must exceed 1. Multiple users share the cost of reading parameters. For example, with a batch size of 256/512, you can get 512 FLOP/s or 1024 FLOP/s per byte of memory read.
This ratio is closer to the H100's balance between memory bandwidth and FLOPS. This helps achieve higher utilization, but at the cost of higher latency.
Memory capacity is considered by many to be a major bottleneck for LLM inference, since large models require multiple chips for inference, and higher memory capacity means they can fit on fewer chips.
However, it is actually better to use more chips so that latency is lower, throughput is increased, and larger batch sizes can be used for higher utilization.
GPT-4 Inference Tradeoffs and Infrastructure
As mentioned above, it is very difficult for GPT-4 inference. But being a MoE model again introduces a whole new set of difficulties.
Each forward pass that generates tokens can be routed to a different set of experts. This poses a problem with the trade-off between throughput, latency, and utilization at larger batch sizes.
OpenAI's GPT-4 has 16 experts, and each forward pass is routed to 2 of them.
This means that if the batch size is 8, each expert's parameter read may only have a batch size of 1.
Worse, this could mean that one expert has a batch size of 8 while others have batch sizes of 4, 1, or 0.
For each generated token, the routing algorithm sends forward passes in different directions, causing delays between tokens and expert batch sizes to vary significantly.
Inference infrastructure is one of the main reasons why OpenAI chose a smaller number of experts. If they choose more experts, memory bandwidth becomes the bottleneck for inference.
OpenAI's inference cluster can usually reach a batch size of 4k+, which means that even with the best load balance among experts, the batch size of experts is only about 500 or so. This requires a very large amount of usage to achieve.
According to the whistleblower, we learned that OpenAI performs inference on a cluster of 128 GPUs. They have multiple of these clusters across multiple data centers and geographic locations.
Inference uses 8-way tensor parallelism and 16-way pipeline parallelism. Each node consisting of 8 GPUs has only about 130B parameters, or less than 30GB per GPU under FP16, and less than 15GB under FP8/int8.
This allows running inference on a 40GB A100 as long as the KV cache size for all batches is not too large.
Layers containing different experts on different nodes are not split because that would cause network traffic to be too irregular and recomputing the KV cache between each generated token would be too expensive.
For the future extension of MoE model and conditional routing, the biggest difficulty is how to deal with the routing of KV cache.
The model has 120 layers, so they could simply be distributed to 15 different nodes, but since the first node needs to do data loading and embedding, it makes sense to put fewer layers on the master node of the inference cluster of.
Also, there are some rumors about "speculative decoding" (following), which also explains why masternodes need to contain fewer layers.
Compared to the Davinchi model with 175 billion parameters, GPT-4 costs 3 times as much, although its feed-forward parameters only increase by 1.6 times.
This is mainly because GPT-4 requires a larger cluster and achieves lower utilization.
The authors believe that the cost of inferring GPT-4's 8k sequence length on 128 A100s is $0.0049 per 1,000 tokens, while the cost of inferring GPT-4's 8k sequence length on 128 H100s is $0.0021 per 1,000 tokens.
Note that this assumes fairly high utilization and keeps the batch size high.
But it's clear that OpenAI is sometimes very underutilized.
The authors hypothesize that OpenAI shuts down the cluster during off-peak hours, reconfigures nodes, resumes training smaller test models, and tries various new techniques to reduce inference costs.
Had OpenAI not done so, their utilization would have been lower and their costs would have more than doubled.
In addition, OpenAI is also using Multi-Query Attention (MQA).
Paper address: https://arxiv.org/pdf/1911.02150.pdf
In short, only one attention head is required, and the memory footprint of the KV cache can be significantly reduced.
Even so, the 32k length of GPT-4 certainly won't work on the 40GB A100, and the 8k maximum batch size is capped.
OpenAI implements variable batch size and continuous batch processing.
Doing so allows some degree of maximum latency and optimizes inference cost.
It was revealed that OpenAI used "speculative decoding" in the reasoning process of GPT-4, which still has 100% uncertainty.
The variation in latency from token to token, and the difference between simple retrieval tasks and more complex tasks, seems to suggest that this is possible, but there are still too many variables to be sure.
Here, the whistleblower made appropriate modifications/added some details to explain the text in a study "Accelerating LLM Inference with Staged Speculative Decoding" by DeepMind.
Using LLM is usually divided into two phases.
The first is prefill, where the hint text is fed into the model to generate the KV cache and the log-odds (probability distribution of possible token outputs) of the first output. This process is usually fast because the entire prompt text can be processed in parallel.
The second stage is decoding. A token is selected from the log odds of the output and fed back into the model, which generates the log odds of the next token. Repeat this process until the desired number of tokens are generated.
Since the decoding must happen sequentially, each time the weights need to be streamed through the computing unit to generate a single token. So this second stage is very computationally intensive (i.e. compute FLOPs/bytes of memory bandwidth) when running in mini-batches. Therefore, decoding is usually the most expensive part of autoregressive generation.
This is why the input token is much cheaper than the output token in OpenAI's API calls.
The basic idea of "speculative decoding" is to use a smaller, faster draft model to decode multiple tokens ahead of time, and then feed them into the predictive model as a batch.
If the draft model's predictions are correct, i.e. the larger model agrees with those predictions, multiple tokens can be decoded using a single batch, which saves a lot of memory bandwidth and time.
However, if the larger model rejects a token predicted by the draft model, the remaining batch is discarded and the algorithm naturally reverts to standard token-by-token decoding.
"Speculative decoding" may also be accompanied by a rejection sampling scheme to sample from the original distribution. It's worth noting that this is only useful in small-batch settings where bandwidth is the bottleneck.
Speculative decoding, which trades computation for bandwidth, is an attractive performance engineering target for two key reasons:
First, it does not reduce model quality. Second, the performance improvements it offers are often orthogonal to other approaches, since their performance comes from converting "sequential execution" to "parallel execution".
The current inference method is a separate sequence of batch predictions. However, this approach does not scale well to large batches, or low-draft model alignments.
Intuitively, the probability of two models agreeing on contiguously long sequences of tokens is exponentially low, implying that the gains from speculative decoding diminish rapidly as arithmetic density increases.
The whistleblower believes that if OpenAI uses "speculative decoding", they may only use it in sequences of about 4 tokens.
As an aside, the whole conspiracy about OpenAI's castration, resulting in lower quality GPT-4, may simply be because they subject their predictive models to low-probability sequences from "speculative decoding" models.
It has also been speculated that Bard also uses "speculative decoding" because Google waits for the entire sequence to be fully generated before sending it to the user, but in the whistleblower's opinion, this guess is completely incorrect.
Visual multimodality is the least impressive part of GPT-4, at least compared to leading research.
Of course, no one has yet commercialized the results of multimodal LLM research.
The whistleblower said that it is a visual encoder independent of the text encoder, as well as cross-attention, the architecture is similar to Flamingo, and more parameters have been added to GPT-4 1.8T.
GPT-4 multimodal capability is fine-tuned with about 2 trillion tokens after text pre-training.
It is said that on the visual model, OpenAI originally hoped to train from scratch, but because it was not mature enough, it had no choice but to fine-tune from the text training model.
And the next-generation model GPT-5, whose training should train the vision model from scratch, and be able to generate images, and even generate audio.
One of the main purposes of this visual ability is to enable autonomous agents to read web pages and transcribe images, videos.
It is worth mentioning that the data used by OpenAI to train the multimodal model includes: "joint data" (LaTeX / text), web page screenshots, YouTube videos (sampling frames, and running Whisper to get subtitles).
An interesting fact about the over-optimization of LLM is that visual models have a different IO cost than text models. In the visual model, the data loading IO is about 150 times that of the text model.
The IO cost of the visual model is low
Each token in the visual model is 600 bytes, and the text is 4 bytes/token.
So this requires a lot of work in terms of image compression. This is extremely important for hardware vendors as they are optimizing 2-3 years out hardware around LLM's use cases and ratios.
They may find themselves in a world where each model has powerful visual and audio capabilities.
They may find themselves poorly suited to the architecture.
In general, the architecture will definitely go beyond the dense text-based models we see today, and MoE models.