The Secret of GPT-4's Architecture Leaked
GPT-4's details are leaked.
It is over.
GPT-4 is more than 10x the size of GPT-3. We believe it has a total of ~1.8 trillion parameters across 120 layers.
Mixture Of Experts - Confirmed.
OpenAI was able to keep costs reasonable by utilizing a mixture of experts (MoE) model.They utilizes 16 experts within their model, each is about ~111B parameters for MLP. 2 of these experts are routed to per forward pass.
While the literature talks a lot about advanced routing algorithms for choosing which experts to route each token to, OpenAI’s is allegedly quite simple, for the current GPT-4 model.
There roughly ~55B shared parameters for attention.
Each forward pass inference (generation of 1 token) only utilizes ~280B parameters and ~560 TFLOPs. This contrasts with the ~1.8 trillion parameters and ~3,700 TFLOP that would be required per forward pass of a purely dense model.
GPT-4 is trained on ~13T tokens.
These are not unique tokens, they count the epochs as more tokens as well.
Epoch number: 2 epochs for text-based data and 4 for code-based data.
There is millions of rows of instruction fine-tuning data from ScaleAI & internally.
There was an 8k context length (seqlen) for the pre-training phase. The 32k seqlen version of GPT-4 is based on fine-tuning of the 8k after the pre-training.
The batch size was gradually ramped up over a number of days on the cluster, but by the end, OpenAI was using a batch size of 60 million! This, of course, is “only” a batch size of 7.5 million tokens per expert due to not every expert seeing all tokens.
For the real batch size: Divide this number by the seq len to get the real batch size. just stop with this misleading numbers already.
To parallelize across all their A100s GPUs They utilized 8-way tensor parallelism as that is the limit for NVLink.
Beyond that, they are using 15-way pipeline parallelism.
(likely used ZeRo Stage 1. It is possible they used block-level FSDP)
OpenAI’s training FLOPS for GPT-4 is ~2.15e25, on ~25,000 A100s for 90 to 100 days at about 32% to 36% MFU.
Part of this extremely low utilization is due to an absurd number of failures requiring checkpoints that needed to be restarted from.
If their cost in the cloud was about $1 per A100 hour, the training costs for this run alone would be about $63 million.
(Today, the pre-training could be done with ~8,192 H100 in ~55 days for $21.5 million at $2 per H100 hour.)
Mixture of Expert Tradeoffs
There are multiple MoE tradeoffs taken: For example, MoE is incredibly difficult to deal with on inference because not every part of the model is utilized on every token generation.
This means parts may sit dormant when other parts are being used. When serving users, this really hurts utilization rates.
Researchers have shown that using 64 to 128 experts achieves better loss than 16 experts, but that’s purely research.
There are multiple reasons to go with fewer experts. One reason for OpenAI choosing 16 experts is because more experts are difficult to generalize at many tasks. More experts can also be more difficult to achieve convergence with.
With such a large training run, OpenAI instead chose to be more conservative on the number of experts.
GPT-4 Inference Cost
GPT-4 costs 3x that of the 175B parameter Davinchi.This is largely due to the larger clusters required for GPT-4 and much lower utilization achieved.
AN estimate of it's costs is $0.0049 cents per 1k tokens for 128 A100s to inference GPT-4 8k seqlen and $0.0021 cents per 1k tokens for 128 H100’s to inference GPT-4 8k seqlen. It should be noted, we assume decent high utilization, and keeping batch sizes high.
OpenAI are using MQA just like everybody else.Because of that only 1 head is needed and memory capacity can be significantly reduced for the KV cache. Even then, the 32k seqlen GPT-4 definitely cannot run on 40GB A100s, and the 8k is capped on max bsz.
OpenAI implements both variable batch sizes and continuous batching. This is so as to allow some level of maximum latency as well optimizing the inference costs.
It is a separate vision encoder from the text encoder, with cross-attention. The architecture is similar to Flamingo. This adds more parameters on top of the 1.8T of GPT-4. It is fine-tuned with another ~2 trillion tokens, after the text only pre-training.
On the vision model, OpenAI wanted to train it from scratch, but it wasn’t mature enough, so they wanted to derisk it by starting with text.
One of the primary purposes of this vision capability is for autonomous agents able to read web pages and transcribe what’s in images and video.
Some of the data they train on is joint data (rendered LaTeX/text), screen shots of web page, youtube videos: sampling frames, and run Whisper around it to get transcript.
[Dont want to say "I told you so" but..]
OpenAI might be using speculative decoding on GPT-4's inference. (not sure 100%)
The idea is to use a smaller faster model to decode several tokens in advance, and then feeds them into a large oracle model as a single batch.
If the small model was right about its predictions – the larger model agrees and we can decode several tokens in a single batch.
But if the larger model rejects the tokens predicted by the draft model then the rest of the batch is discarded. And we continue with the larger model.
The conspiracy theory that the new GPT-4 quality had been deteriorated might be simply because they are letting the oracle model accept lower probability sequences from the speculative decoding model.
The inference runs on a cluster of 128 GPUs.
There are multiple of these clusters in multiple datacenters in different locations.
It is done in 8-way tensor parallelism and 16-way pipeline parallelism.
Each node of 8 GPUs has only ~130B parameters, or…
The model has 120, so it fits in 15 different nodes.[Possibly the there are less layers on the first node since it needs to also compute the embeddings]
According to these numbers: OpenAI should have trained on 2x the tokens if they were trying to go by chinchilla's optimal.
[let alone surpass it like we do]
This goes to show that they are struggling to get high quality data.
Why no FSDP?
A possible reason for this could be that some of the hardware infra they secured is of an older generation.
This is pretty common at local compute clusters as the organisation usually upgrade the infra in several "waves" to avoid a complete pause of operation.…
They trained on 13T tokens.CommonCrawl & RefinedWeb are both 5T.
Remove the duplication of tokens from multiple epochs and we get to a much reasonable number of "unaccounted for" tokens: The "secret" data.
Which by this point we already get rumors that parts of it came from twitter, reddit & youtube.
[Rumors that start to become lawsuits]
Some speculations are:
LibGen (4M+ books) Sci-Hub (80M+ papers) All of GitHub
My own opinion:
The missing dataset it a custom dataset of college textbooks collected by hand for as much courses as possible.
This is very easy to convert to txt file and than with self-instruct into instruction form.
This creates the "illusion" that GPT-4 "is smart" no matter who use it.
Computer scientist? sure! it can help you with your questions about P!=NPPhilosophy major? It can totally talk to you about epistemology.
Don't you see?It was trained on the textbooks. It is so obvious.
There are also papers that try to extract by force memorized parts of books from GPT-4 to understand what it trained on.
There are some books it knows so well that it had seen them for sure.
Moreover, If i remember correctly: It even know the unique ids of project Euler exes.