Transformer has a successor model! MSRA proposes a new large-scale model infrastructure: 8 times faster reasoning and 70% less memory usage
The new architecture of Microsoft's large model officially challenges Transformer !
The title of the paper reads brightly:
Retentive Network (RetNet): The successor of Transformer in the field of large models.
The paper proposes a new Retention mechanism to replace Attention. Researchers from Microsoft Asia Research Institute and Tsinghua University did not deny their "ambition" and boldly said:
RetNet achieves good scaling results, parallel training, low-cost deployment, and efficient inference. These features make this infrastructure a powerful successor to Transformer in large language models.
The experimental data also shows that on language modeling tasks:
RetNet can achieve perplexity comparable to Transformer 8.4x faster inference 70% reduction in memory usage Has good scalability
And when the model size is larger than a certain scale, RetNet will perform better than Transformer.
Transformer really "successor has a model"? For details, let's see together.
Solve the "impossible triangle"
The importance of Transformer in large language models is beyond doubt. Whether it is OpenAI's GPT series, Google's PaLM, or Meta's LLaMA, they are all based on Transformer.
But Transformer is not perfect: its parallel processing mechanism is at the cost of inefficient reasoning , and the complexity of each step is O(N); Transformer is a memory-intensive model, and the longer the sequence, the more memory it takes up.
Before that, it's not that everyone didn't think about continuing to improve Transformer. However, the main research directions are somewhat neglected:
Linear attention can reduce the cost of reasoning, but the performance is poor;
Recurrent neural networks cannot be trained in parallel.
In other words, there is an "impossible triangle" in front of these neural network architectures. The three corners represent: parallel training, low-cost reasoning, and good scalability.
What the researchers of RetNet want to do is to make the impossible possible.
Specifically, on the basis of Transformer, RetNet uses a multi-scale retention mechanism to replace the standard self-attention mechanism .
Compared with the standard self-attention mechanism, the retention mechanism has several characteristics:
A position-dependent exponential decay term is introduced to replace softmax, which simplifies the calculation and preserves the information of the previous step in the form of decay.
Introduce complex number space to express position information, replace absolute or relative position coding, and easily convert to recursive form.
In addition, the retention mechanism uses multi-scale decay rates to increase the expressiveness of the model, and utilizes the scaling invariance of GroupNorm to improve the numerical accuracy of the retention layer.
△ Dual representation of RetNet
Each RetNet block contains two modules: a multi-scale preserving (MSR) module and a feed-forward network (FFN) module.
The hold mechanism supports representing sequences in three forms:
parallel recursion Block recursion, that is, a hybrid form of parallel representation and recursive representation, divides the input sequence into blocks, performs calculations according to parallel representation within blocks, and follows recursive representation between blocks.
Among them, the parallel representation enables RetNet to efficiently utilize GPU for parallel training like Transformer.
The recursive representation achieves O(1) inference complexity, reducing memory usage and latency.
Chunked recursion can handle long sequences more efficiently.
In this way, RetNet makes the "impossible triangle" possible. The following are the comparison results of RetNet and other infrastructures:
Experimental results on language modeling tasks further prove the effectiveness of RetNet.
The results show that RetNet can achieve a perplexity similar to Transformer (PPL, an indicator for evaluating the quality of a language model, the smaller the better) .
At the same time, when the model parameters are 7 billion and the input sequence length is 8k, the inference speed of RetNet can reach 8.4 times that of Transformer, and the memory usage is reduced by 70% .
During the training process, RetNet also performs better than the standard Transformer+FlashAttention in terms of memory saving and acceleration effects, reaching 25-50% and 7 times respectively.
It is worth mentioning that the inference cost of RetNet is independent of the sequence length, and the inference latency is insensitive to the batch size, allowing high throughput.
In addition, RetNet outperforms Transformer when the model parameter size is greater than 2 billion.
RetNet's research team is from Microsoft Asia Research Institute and Tsinghua University.
Together as Sun Yutao and Dong Li.
Sun Yutao, an undergraduate in the Department of Computer Science, Tsinghua University, is currently an intern at Microsoft Asia Research Institute.
Dong Li is a researcher at Microsoft Asia Research Institute. He is also one of the authors of the paper "Transformer that can remember 1 billion tokens" that has attracted a lot of attention.
The corresponding author of the RetNet paper is Furu Wei. He is a global research partner of Microsoft Asia Research Institute, and the 1 billion token Transformer is also from his research team.