2080 Ti can run 70B large models, and the new framework can increase LLM inference speed by 11 times.
The work that originally required an 80G A100 worth 160,000 yuan now only requires a 24G 4090 worth less than 20,000 yuan!
PowerInfer, an open source inference framework launched by Shanghai Jiao Tong University's IPADS Laboratory, speeds up large model inference by 11 times.
And without quantization, just use FP16 precision to run the 40B model on a personal computer; if quantization is added, the 2080 Ti can also run the 70B model smoothly.
Combining the unique characteristics of large models and through hybrid computing between CPU and GPU, PowerInfer can achieve fast inference on personal computers with limited video memory.
Compared with llama.cpp, PowerInfer achieves up to 11 times acceleration, allowing the 40B model to output ten tokens per second on a personal computer.
ChatGPT, which we are most familiar with, sometimes crashes due to excessive traffic. On the other hand, there are also data security issues.
The open source model can better solve these two problems, but without a high-performance graphics card, the running speed is often very impressive:
The emergence of PowerInfer just solves this pain point.
PowerInfer aroused enthusiastic response as soon as it was released, and received 500+ stars in less than 24 hours, including one from Gerganov, the author of llama.cpp.
At present, the source code and papers of PowerInfer have been made public. Let’s take a look at how powerful its acceleration effect is.
Inference speed up to 11 times
On a consumer-grade hardware platform equipped with x86 CPU and NVIDIA GPU, PowerInfer was benchmarked against a series of LLM models with parameter sizes ranging from 7B to 175B. The end-to-end inference speed of PowerInfer was tested and compared with the best performance on the same platform. The inference framework llama.cpp was compared.
For FP16 precision models, PowerInfer achieved an average speed increase of 7.23 times on a high-end PC (PC-High) equipped with a 13th generation Intel Core i9 and a single RTX 4090, of which up to 11.69 times was achieved on Falcon 40B. Speed increase.
Across all test cases, PowerInfer reached an average of 8.32 tokens/s, with a maximum of 16.06 tokens/s and 12.94 tokens/s on OPT 30B and Falcon 40B respectively.
With PowerInfer, today's consumer-grade platforms can run 30-40B level LLM smoothly and run 70B level LLM at an acceptable speed.
△ PowerInfer generates token speed test chart on average under different output lengths in different models. The ordinate is the acceleration ratio. The number marked on each bar graph represents the number of tokens that can be generated per second.
Model quantization is a very common technology for end-side LLM inference, and PowerInfer also supports the inference of INT4 quantized models.
PowerInfer tested the inference speed of a series of INT4 quantized models on high-end PCs (PC-High) and mid-to-low-end PCs (PC-Low) equipped with a single RTX 2080Ti.
On PC-High, PowerInfer can run 40-70B scale models at high speed , reaching a maximum inference speed of 29.09 tokens/s, and achieving an average speed increase of 2.89 times and a maximum of 4.28 times.
At the same time, it is also possible to run models of the size of the OPT-175B on consumer-grade hardware.
On mid-to-low-end PCs such as PC-Low, PowerInfer can smoothly run models of 30-70B size and achieve an average speed increase of 5.01 times and a maximum of 8.00 times. This is mainly due to the majority of hot neurons in the model after INT4 quantization. be placed in video memory.
△ The inference speed of PowerInfer in the INT4 quantitative model. The ordinate is the acceleration ratio. The number marked on each bar graph represents the number of tokens that can be generated per second.
Finally, PowerInfer compared the end-to-end inference speed of running PowerInfer on PC-High compared to the top cloud computing card A100 running SOTA framework vLLM. The test models were OPT-30B and Falcon-40B (ReLU) with FP16 accuracy.
When the input length is 64, the speed gap between PowerInfer and A100 is reduced from 93%-94% to 28%-29%; in a pure generation scenario with an input length of 1, this gap is further reduced to as low as 18% .
This means that PowerInfer uses sparse activation and CPU/GPU hybrid inference to greatly bridge the inference speed gap between consumer-grade graphics cards and top-notch server-side computing cards.
△ Performance comparison of PowerInfer on 4090 and vLLM on A100
So, how does PowerInfer achieve high-speed inference on consumer-grade hardware?
Take full advantage of model and hardware features
The secret of PowerInfer to achieve high-speed inference is to make full use of the high locality of sparse activation in dense models and fully combine it with the computing characteristics of CPU and GPU.
What is "sparse activation"?
Recently, the Mixtral MoE large model has detonated the entire AI circle, and sparse models have re-entered everyone's field of vision.
An interesting fact is that LLM, which is regarded as a dense model, such as OPT and LLaMA(ReLU), also has the characteristics of sparse activation.
What is sparse activation of dense models?
Similar to an input token in the MoE model, which only needs to activate one or two expert modules in the FFN layer, taking the dense FFN layer of the OPT model as an example, only a small part (experiments show about 10%) of neurons need to be activated to ensure output. correctness.
Although other neurons participate in the calculation, they do not significantly contribute to the output.
In other words, every neuron in the dense model is an expert !
△ The picture on the left is from Alexander Clark’s paper (aRXiv number: 2101.03961)
The MoE model can distribute the input to one or two experts through the routing module before the expert FFN layer for calculation. Then how to route the sparse activations in the dense model or know which expert neurons will contribute to the result before calculation. Woolen cloth?
The answer is to add a route prediction module to the dense model.
Before the model starts serving, PowerInfer will first conduct offline analysis of the model, obtain the correspondence between each layer's input and activated neurons by inferring the model in a common data set, and then train a small model for each layer of the dense model. The predictive routing module predicts the neurons that will be activated for each input and only counts the neurons activated by the route (experts).
In tests on multiple downstream tasks, PowerInfer's routing module introduced almost no additional accuracy loss.
Inference locality brought about by sparse activation
Another interesting fact about sparse activation is that although there are differences in the distribution of activated neurons for different input tokens; if inference is performed on enough data and the distribution of each activation is superimposed, PowerInfer finds that a small number of neurons Elements have a higher probability of being activated overall.
In other words, in a statistical sense, the activation of large model neurons conforms to the Power Law distribution (Power Law distribution is a statistical law that indicates that a small number of events occur much more frequently than a large number of other events).
As shown in Figure (a) below, for a certain layer of FFN network in the OPT-30B and LLaMA(ReGLU)-70B models, statistically 26% and 43% of the neurons contributed 80% of the activation respectively.
At the scale of the entire model, as shown in (b) below, 17% and 26% of neurons contribute 80% of the activation.
Therefore, when only operations that contribute to the final activation are considered, LLM suffers from inference locality: access to weights tends to be concentrated in a certain area rather than evenly distributed across all neurons.
In reasoning operations, it appears as the locality of the program: accesses to the memory space tend to be concentrated in a certain area, rather than evenly distributed throughout the memory space.
In common personal computers, the GPU has less video memory and stronger computing power , and is suitable for processing frequently accessed and computationally intensive tasks; while the CPU has larger memory capacity but relatively weak computing power , and is suitable for processing a small amount of data. Accessible and low computationally intensive tasks.
Therefore, ideally, a small number of frequently accessed neurons should be stored in video memory, while larger, less frequently accessed neurons are more suitable to be stored in memory and computed by the CPU.
This inspired PowerInfer to design a CPU/GPU hybrid inference system based on locality features.
CPU/GPU hybrid inference design
Based on the Power Law of the above-mentioned neurons and the resulting locality, PowerInfer statically analyzes the hotness and coldness of each neuron in advance, loads a small number of hot neurons into the GPU memory, and loads the remaining cold neurons into the CPU. in memory.
Mixed loading of models with neuron granularity will cause some neurons in a layer to be on the GPU and some on the CPU.
To this end, PowerInfer designed a fine-grained CPU/GPU hybrid inference engine.
For example, in the figure below, for the input of a certain layer, PowerInfer will first predict that the input will activate neurons 3, 4, and 5.
Then the CPU and GPU will respectively perform calculations on the neurons located in their memories based on the prediction information.
Specifically, for example in the figure below, the fourth neuron will be calculated on the CPU, the third and fifth neurons will be calculated on the GPU, and then the calculation results of both sides will be merged on the GPU.
△ PowerInfer hybrid computing method
The overall architecture of PowerInfer
Overall, PowerInfer has developed an innovative CPU/GPU hybrid inference engine using sparse activation based on dense models and the locality properties it introduces.
When connecting to a large language model (LLM), PowerInfer first trains the model's predictive routing module in the offline stage and conducts in-depth analysis of the model's activation features.
At the same time, the optimal neuron placement strategy is calculated based on key information such as the bandwidth and capacity of the target hardware.
On this basis, PowerInfer will optimally distribute neurons in memory or video memory based on these calculation results.
During the online inference phase, the CPU and GPU separately process the neurons stored in their memory, and then the results of these independent calculations are efficiently combined on the GPU.
△ PowerInfer overall architecture diagram
Summary and Outlook
For end-side users, PowerInfer’s efficient inference framework opens up new possibilities.
First, it enables PC users to run advanced large-scale language models locally without the need for expensive specialized hardware.
This not only promotes the popularization of artificial intelligence applications, but also provides unprecedented opportunities for enthusiasts, researchers, and small businesses.
PowerInfer also has huge potential when it comes to cloud deployment.
Existing cloud CPUs are also supported by powerful AMX computing units. By taking advantage of the heterogeneous characteristics between CPUs and GPUs, we can be optimistic that PowerInfer can use fewer high-end computing cards to achieve higher service throughput.